SlideShare a Scribd company logo
1
Hadoop Installation and Running KMeans Clustering
with MapReduce Program on Hadoop
Introduction
General issue that I will cover in this document are Hadoop installation (in section 1) and running
KMeans Clustering MapReduce program on Hadoop (section 2).
1 Hadoop Installation
I will install Hadoop Single Node Cluster mode in my personal computer using this following
environment .
1. Ubuntu 12.04
2. Java JDK 1.7.0 update 21
3. Hadoop 1.2.1 (stable)
1.1 Prerequisites
Before installing Hadoop, the following point must be done first before installing Hadoop in our
system
1. Sun JDK /Open JDK
I use Sun JDK from oracle instead of Open JDK, the resource package can be downloaded from here :
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html
2. Hadoop installer packages
In this report, I will use Hadoop version 1.2.1 (stable). The resource package can be downloaded
from her : http://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/
1.1.1 Configuring Java
In my computer, I have several installed java version, they are Java 7 and java 6. For running Hadoop
program, I need to configure which version I will use. I decide to use the newer version (java version
1.7 update 21) so the following are step by step for configuring latest java version.
1. Configure java
$ sudo update-alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java).
Selection Path Priority Status
------------------------------------------------------------
0 /usr/lib/jvm/jdk1.6.0_45/bin/java 1 auto mode
1 /usr/lib/jvm/jdk1.6.0_45/bin/java 1 manual mode
* 2 /usr/lib/jvm/jdk1.7.0_21/bin/java 1 manual mode
Press enter to keep the current choice[*], or type selection number: 2
2
2. configure javac
$ sudo update-alternatives --config javac
There are 2 choices for the alternative javac (providing /usr/bin/javac).
Selection Path Priority Status
------------------------------------------------------------
0 /usr/lib/jvm/jdk1.6.0_45/bin/javac 1 auto mode
1 /usr/lib/jvm/jdk1.6.0_45/bin/javac 1 manual mode
* 2 /usr/lib/jvm/jdk1.7.0_21/bin/javac 1 manual mode
Press enter to keep the current choice[*], or type selection number: 2
3. configure javaws
$ sudo update-alternatives --config javaws
There are 2 choices for the alternative javaws (providing /usr/bin/javaws).
Selection Path Priority Status
------------------------------------------------------------
0 /usr/lib/jvm/jdk1.6.0_45/bin/javaws 1 auto mode
1 /usr/lib/jvm/jdk1.6.0_45/bin/javaws 1 manual mode
* 2 /usr/lib/jvm/jdk1.7.0_21/bin/javaws 1 manual mode
Press enter to keep the current choice[*], or type selection number: 2
4. check the configuration
To make sure the latest java, javac, and javaws successfully configure, I use this following command
tid@dbubuntu:~$ java -version
java version "1.7.0_21"
Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
1.1.2 Hadoop installer
After downloading Hadoop installer package, then we need to extract the installer package into the
desire directory. I downloaded Hadoop installer and locate in the ~/Download directory.
For extracting Hadoop installer package in the local directory, I use the following command
$ tar -xzfv hadoop-1.2.1.tar.gz
1.2 System configuration
In this section I will explain step-by step how to setup and preparing the system for Hadoop single
node cluster in my local compute. The system configuration consists of network configuration for
setup the hosts name, move Hadoop extracted package into desire folder, enabling ssh, and adding
and changing folder permission.
1.2.1 Network configuration
In the hadoop network configuration, all of the machines should have alias instead of IP address. To
configure network aliases, we can edit /etc/hosts on machine that will we use for Hadoop
master and slave. In my case, since I am using single node, the configuration can be done by these
step
1. Open file /etc/hosts as sudoers,
3
$ sudo nano /etc/hosts
2. I add the following lines, inside the curly branch is my IP Address,
127.0.0.1 localhost
164.125.50.127 localhost
Note :
For configuring the local computer into pseudo distributed mode, the following configuration for
/etc/hosts is used
# /etc/hosts (for Hadoop Master and Slave)
192.168.0.1 master
192.168.0.2 slave
1.2.2 User configuration
For security issue, we better to create special user for Hadoop in each machine, however, since I am
working in local computer, I use existing user. The following command is for adding new user and
group
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
1.2.3 Configuring SSH
Hadoop requires SSH access to manage its nodes. In this configuration, I also configure SSH access to
localhost for hduser that I already made in the previous section and local existing user.
1. For generating ssh key for hduser, we can use the following command
$ su - hduser
Password:
hduser@dbubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
/home/hduser/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
b8:68:c1:4b:d1:fe:4b:40:2c:1c:b4:37:db:5f:76:ee hduser@dbubuntu
The key's randomart image is:
+--[ RSA 2048]----+
| .o |
| . = |
| = * |
| . * = |
| + = S o . |
| . + + . o o |
| + . o . . |
| . . . . |
| . E |
+-----------------+
hduser@dbubuntu:~$
4
2. Enable ssh access for local machine with newly created key
cat /home/hduser/.ssh/id_rsa.pub >>
/home/hduser/.ssh/authorized_keys
Note :
Since I use local existing user, I do the above step 2 times, also for user tid
1.2.4 Extracting Hadoop installer
I copy the Hadoop Installer package from ~/Download directory into desire folder. In this case I use
/usr/local to locate hadoop installer package.
1. Command for moving extracted Hadoop into
$ cp ~/Downloads/Hadoop-1.2.1 /usr/local/
$ cd /usr/local
2. in order to make hadoop easy to access and handling some update version of Hadoop, I create
symbolic link of hadoop-1.2.1 into hadoop directory
$ ln -s hadoop-1.2.1 hadoop
3. change folder permission of hadoop, so it will accessible by user tid
$ sudo chown -R tid:hadoop hadoop
1.3 Hadoop Configuration
After sucessfully configuring the network, folder, and other configuration, in this part, I will explain
step by step hadoop configuration for each machine. Since I locate my hadoop inside
/usr/local/hadoop so it will be the active directory, and all of the hadoop configurations are
located in conf directory.
1.3.1 conf/masters
Since in this configuration for single node, the content of masters file by default should be like this
Localhost
Note : For multi node cluster we can add the master alias’s name regarding the network setup and
configuration in section 1.2.1
1.3.2 conf/slaves
Same as masters file, the default setup for slaves should be like this
Localhost
Note : for multi node cluster, we can add the master and slave alias(considering network
configuration in section 1.2.1), for example the slaves file might look like this
slave1
slave2
slave3
5
1.3.3 conf/core-site.xml
For core site configuration I specify the location in directory /tmp/Hadoop/app. In this file, I give
the configuration of core site of cluster. Firstly, the file look like this following
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
</configuration>
then I change the configuration to look like this
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop/app</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
1.3.4 Conf/mapred-site.xml
I change the mapred-site.xml , that it look like this following
<configuration>
<!-- In: conf/mapred-site.xml -->
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>
The host and port that
the MapReduce job tracker runs
at. If "local", then jobs are run
in-process as a single map
and reduce task.
</description>
</property>
</configuration>
6
1.3.5 hdfs-site.xml
For hdfs-site, for single cluster, I specify the number of replication only 1, if we have several
machine we can add the number of replication.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can
be specified when the file is created.
The default is used if replication
is not specified in create time.
</description>
</property>
</configuration>
1.3.6 Formatting HDFS via Namenode
Before starting the cluster, we should format the Hadoop File System. It can be formatted once or
more, however formatting the namenode means clear all of the data, so we need to be careful
otherwise we can lose our data
tid@dbubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format
1.4 Running Hadoop
After setting all needed configuration, finally we can start our Hadoop. For running Hadoop daemon,
there are several alternatives,
1.4.1 Starting all daemon at once
For starting all of the service in Hadoop, I use this following command
tid@dbubuntu:/usr/local/hadoop$ bin/start-all.sh
Then, checking whether all of daemons already running using the following command
tid@dbubuntu:/usr/local/hadoop$ jps
1.4.2 Stopping all daemon
For stopping all daemon, we can use the following command
tid@dbubuntu:/usr/local/hadoop$ bin/stop-all.sh
7
2 Hadoop MapReduce Program (KMeans Clustering in Map
Reduce)
In this part I will explain how I can run map reduce program regarding reference from Thomas’s blog
for KMeans Clustering in MapReduce (http://codingwiththomas.blogspot.kr/2011/05/k-means-
clustering-with-mapreduce.html ) with some modification
2.1 Eclipse IDE Setup
There are several ways for setting up the IDE environment so we can easily create MapReduce
Program in Eclipse. For ease development and setup, I am using Eclipse plugin from self build plugin
by creating jar from Hadoop library. The following are step-by-step setting up eclipse IDE
Step 1: copy the pre-build eclipse plugin for Hadoop in directory plugins of Eclipse
Figure 1 Eclipse hadoop plugin inside plugins directory of eclipse
Step 2 : restart Eclipse, then in the perspective part, we will see other perspective in the right
corner, and choose MapReduce Perspective
Figure 2 MapReduce perspective in eclipse IDE
Step 3: Add the server in the map reduce panel.
In my case, because my server is located in the local machine, named as localhost, the detail will
looks like the following
8
Figure 3 Configuration for adding Hadoop Location
After adding the server, in the right side, along with project explorer, we will see the HDFS file
explorer, and right now Eclipse is ready to use for developing MapReduce Application
Figure 4 HDFS Directory explorer
2.2 Source Code Preparation
In this part, I will describe how to prepare the project, package, and class for KMeans Clustering
MapReduce program.
Create new project
1. First, we need to create new MapReduce project, by clicking new project in the upper left
corner of eclipse, and after the following
window pup up, choose MapReduce
project
Figure 5 Choosing MapReduce project
9
2. Fill the project name, and check the reference location of Hadoop. By default, if we are using
eclipse plugin for Hadoop, the folder will be directed to our Hadoop installation folder , then
click “Finish”
Create Package and Class
For KMeans Clustering MapReduce program, based on Thomas’s references, we need to create two
package, one package for clustering model, consists of Model class for Vector, Distance Measure,
and define the ClusterCenter (Vector.java, DistanceMeasurer.java, and ClusterCenter.java) and the
other is package for Main, Mapper, and Reducer Class (KMeans
1. com.clustering.model package Model Class (Vector.java)
Model Class : Vector.java
package com.clustering.model;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.Arrays;
import org.apache.hadoop.io.WritableComparable;
public class Vector implements WritableComparable<Vector>{
private double[] vector;
public Vector(){
super();
}
public Vector(Vector v){
super();
int l= v.vector.length;
this.vector= new double[l];
System.arraycopy(v.vector, 0,this.vector, 0, l);
10
}
public Vector(double x, double y){
super();
this.vector = new double []{x,y};
}
@Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
int size = in.readInt();
vector = new double[size];
for(int i=0;i<size;i++)
vector[i]=in.readDouble();
}
@Override
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
out.writeInt(vector.length);
for(int i=0;i<vector.length;i++)
out.writeDouble(vector[i]);
}
@Override
public int compareTo(Vector o) {
// TODO Auto-generated method stub
boolean equals = true;
for (int i=0;i<vector.length;i++){
if (vector[i] != o.vector[i]) {
equals = false;
break;
}
}
if(equals)
return 0;
else
return 1;
}
public double[] getVector(){
return vector;
}
public void setVector(double[]vector){
this.vector=vector;
}
public String toString(){
return "Vector [vector=" + Arrays.toString(vector) + "]";
}
}
2. Distance Measurement Class
Distance Measurement class : DistanceMeasurer.java
package com.clustering.model;
11
public class DistanceMeasurer {
public static final double measureDistance(ClusterCenter center,
Vector v) {
double sum = 0;
int length = v.getVector().length;
for (int i = 0; i < length; i++) {
sum += Math.abs(center.getCenter().getVector()[i]
- v.getVector()[i]);
}
return sum;
}
}
3. ClusterCenter
Cluster Center definition : ClusterCenter.java
package com.clustering.model;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;
public class ClusterCenter implements WritableComparable<ClusterCenter> {
private Vector center;
public ClusterCenter() {
super();
this.center = null;
}
public ClusterCenter(ClusterCenter center) {
super();
this.center = new Vector(center.center);
}
public ClusterCenter(Vector center) {
super();
this.center = center;
}
public boolean converged(ClusterCenter c) {
return compareTo(c) == 0 ? false : true;
}
@Override
public void write(DataOutput out) throws IOException {
center.write(out);
}
@Override
public void readFields(DataInput in) throws IOException {
this.center = new Vector();
center.readFields(in);
}
12
@Override
public int compareTo(ClusterCenter o) {
return center.compareTo(o.getCenter());
}
/**
* @return the center
*/
public Vector getCenter() {
return center;
}
@Override
public String toString() {
return "ClusterCenter [center=" + center + "]";
}
}
After configuring the class model, the next one is MapReduce Classes, which consist of Mapper,
Reducer, and finally the Main class
1. Mapper class
Mapper class : KMeansMapper.java
package com.clustering.mapreduce;
import java.io.IOException;
import java.util.LinkedList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.mapreduce.Mapper;
import com.clustering.model.ClusterCenter;
import com.clustering.model.DistanceMeasurer;
import com.clustering.model.Vector;
public class KMeansMapper extends
Mapper<ClusterCenter, Vector, ClusterCenter, Vector>{
List<ClusterCenter> centers = new LinkedList<ClusterCenter>();
@Override
protected void setup(Context context) throws IOException,
InterruptedException {
super.setup(context);
Configuration conf = context.getConfiguration();
Path centroids = new Path(conf.get("centroid.path"));
FileSystem fs = FileSystem.get(conf);
SequenceFile.Reader reader = new SequenceFile.Reader(fs,
centroids,
conf);
13
ClusterCenter key = new ClusterCenter();
IntWritable value = new IntWritable();
while (reader.next(key, value)) {
centers.add(new ClusterCenter(key));
}
reader.close();
}
@Override
protected void map(ClusterCenter key, Vector value, Context context)
throws IOException, InterruptedException {
ClusterCenter nearest = null;
double nearestDistance = Double.MAX_VALUE;
for (ClusterCenter c : centers) {
double dist = DistanceMeasurer.measureDistance(c,
value);
if (nearest == null) {
nearest = c;
nearestDistance = dist;
} else {
if (nearestDistance > dist) {
nearest = c;
nearestDistance = dist;
}
}
}
context.write(nearest, value);
}
}
2. Reducer class
Reducer class : KMeansReducer.java
package com.clustering.mapreduce;
import java.io.IOException;
import java.util.LinkedList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.mapreduce.Reducer;
import com.clustering.model.ClusterCenter;
import com.clustering.model.Vector;
public class KMeansReducer extends
Reducer<ClusterCenter, Vector, ClusterCenter, Vector>{
public static enum Counter{
CONVERGED
}
List<ClusterCenter> centers = new LinkedList<ClusterCenter>();
14
protected void reduce(ClusterCenter key, Iterable<Vector> values,
Context context) throws IOException,
InterruptedException{
Vector newCenter = new Vector();
List<Vector> vectorList = new LinkedList<Vector>();
int vectorSize = key.getCenter().getVector().length;
newCenter.setVector(new double[vectorSize]);
for(Vector value :values){
vectorList.add(new Vector(value));
for(int i=0;i<value.getVector().length;i++){
newCenter.getVector()[i]+=value.getVector()[i];
}
}
for(int i=0;i<newCenter.getVector().length;i++){
newCenter.getVector()[i] =
newCenter.getVector()[i]/vectorList.size();
}
ClusterCenter center = new ClusterCenter(newCenter);
centers.add(center);
for(Vector vector:vectorList){
context.write(center, vector);
}
if(center.converged(key))
context.getCounter(Counter.CONVERGED).increment(1);
}
protected void cleanup(Context context) throws
IOException,InterruptedException{
super.cleanup(context);
Configuration conf = context.getConfiguration();
Path outPath = new Path(conf.get("centroid.path"));
FileSystem fs = FileSystem.get(conf);
fs.delete(outPath,true);
final SequenceFile.Writer out = SequenceFile.createWriter(fs,
context.getConfiguration(),
outPath, ClusterCenter.class, IntWritable.class);
final IntWritable value = new IntWritable(0);
for(ClusterCenter center:centers){
out.append(center, value);
}
out.close();
}
}
15
3. Main class
Main class : KMeansClusteringJob.java
package com.clustering.mapreduce;
import java.io.IOException;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import com.clustering.model.ClusterCenter;
import com.clustering.model.Vector;
public class KMeansClusteringJob {
private static final Log LOG =
LogFactory.getLog(KMeansClusteringJob.class);
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
int iteration = 1;
Configuration conf = new Configuration();
conf.set("num.iteration", iteration + "");
Path in = new Path("files/clustering/import/data");
Path center = new
Path("files/clustering/import/center/cen.seq");
conf.set("centroid.path", center.toString());
Path out = new Path("files/clustering/depth_1");
Job job = new Job(conf);
job.setJobName("KMeans Clustering");
job.setMapperClass(KMeansMapper.class);
job.setReducerClass(KMeansReducer.class);
job.setJarByClass(KMeansMapper.class);
SequenceFileInputFormat.addInputPath(job, in);
FileSystem fs = FileSystem.get(conf);
if (fs.exists(out))
fs.delete(out, true);
if (fs.exists(center))
fs.delete(out, true);
if (fs.exists(in))
fs.delete(out, true);
final SequenceFile.Writer centerWriter =
SequenceFile.createWriter(fs,
conf, center, ClusterCenter.class,
IntWritable.class);
16
final IntWritable value = new IntWritable(0);
centerWriter.append(new ClusterCenter(new Vector(1, 1)),
value);
centerWriter.append(new ClusterCenter(new Vector(5, 5)),
value);
centerWriter.close();
final SequenceFile.Writer dataWriter =
SequenceFile.createWriter(fs,
conf, in, ClusterCenter.class,
Vector.class);
dataWriter
.append(new ClusterCenter(new Vector(0,
0)), new Vector(1, 2));
dataWriter.append(new ClusterCenter(new Vector(0, 0)),
new Vector(16, 3));
dataWriter
.append(new ClusterCenter(new Vector(0,
0)), new Vector(3, 3));
dataWriter
.append(new ClusterCenter(new Vector(0,
0)), new Vector(2, 2));
dataWriter
.append(new ClusterCenter(new Vector(0,
0)), new Vector(2, 3));
dataWriter.append(new ClusterCenter(new Vector(0, 0)),
new Vector(25, 1));
dataWriter
.append(new ClusterCenter(new Vector(0,
0)), new Vector(7, 6));
dataWriter
.append(new ClusterCenter(new Vector(0,
0)), new Vector(6, 5));
dataWriter.append(new ClusterCenter(new Vector(0, 0)), new
Vector(-1,
-23));
dataWriter.close();
SequenceFileOutputFormat.setOutputPath(job, out);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setOutputKeyClass(ClusterCenter.class);
job.setOutputValueClass(Vector.class);
job.waitForCompletion(true);
long counter = job.getCounters()
.findCounter(KMeansReducer.Counter.CONVERGED).getValue();
iteration++;
while (counter > 0) {
conf = new Configuration();
conf.set("centroid.path", center.toString());
conf.set("num.iteration", iteration + "");
job = new Job(conf);
job.setJobName("KMeans Clustering " + iteration);
job.setMapperClass(KMeansMapper.class);
job.setReducerClass(KMeansReducer.class);
job.setJarByClass(KMeansMapper.class);
Input
vector
K-Center
vector
17
in = new Path("files/clustering/depth_" +
(iteration - 1) + "/");
out = new Path("files/clustering/depth_" +
iteration);
SequenceFileInputFormat.addInputPath(job, in);
if (fs.exists(out))
fs.delete(out, true);
SequenceFileOutputFormat.setOutputPath(job, out);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setOutputKeyClass(ClusterCenter.class);
job.setOutputValueClass(Vector.class);
job.waitForCompletion(true);
iteration++;
counter = job.getCounters()
.findCounter(KMeansReducer.Counter.CONVERGED).getValue();
}
Path result = new Path("files/clustering/depth_" +
(iteration - 1) + "/");
FileStatus[] stati = fs.listStatus(result);
for (FileStatus status : stati) {
if (!status.isDir() &&
!status.getPath().toString().contains("/_")) {
Path path = status.getPath();
LOG.info("FOUND " + path.toString());
SequenceFile.Reader reader = new
SequenceFile.Reader(fs, path,
conf);
ClusterCenter key = new ClusterCenter();
Vector v = new Vector();
while (reader.next(key, v)) {
LOG.info(key + " / " + v);
}
reader.close();
}
}
}
}
18
Final project listing will look like this
Figure 6 File Listing for KMeansMapReduce Program
2.3 Run the program
Unlike the wordcount program that we have to prepare the input files, in KMeansClustering
program the input is defined inside the KMeansClusteringJob class.
For running KMeansClustering job, since we are already configure the eclipse, we can run the
program natively inside Eclipse, So by pointing out the Main class (KMeansClusteringJob.java) we can
run the project as Hadoop Application
Figure 7 Run Project as hadoop application
The Input (to be defined in KMeansClusteringJob class)
Vector [vector=[16.0, 3.0]]
Vector [vector=[7.0, 6.0]]
Vector [vector=[6.0, 5.0]]
Vector [vector=[25.0, 1.0]]
Vector [vector=[1.0, 2.0]]
Vector [vector=[3.0, 3.0]]
Vector [vector=[2.0, 2.0]]
Vector [vector=[2.0, 3.0]]
Vector [vector=[-1.0, -23.0]]
Output from Thomas’s Blog :
ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[16.0, 3.0]]
ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[7.0, 6.0]]
ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[6.0, 5.0]]
ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[25.0, 1.0]]
ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[1.0, 2.0]]
ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[3.0, 3.0]]
ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[2.0, 2.0]]
ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[2.0, 3.0]]
19
ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[-1.0, -23.0]]
Output of my KMeansClusteringJob :
file:/home/tid/eclipse/workspace/MRClustering/files/clustering/depth_3/part-r-00000
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[13.5, 3.75]]] / Vector [vector=[16.0, 3.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[13.5, 3.75]]] / Vector [vector=[7.0, 6.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[13.5, 3.75]]] / Vector [vector=[6.0, 5.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[13.5, 3.75]]] / Vector [vector=[25.0, 1.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[1.0, 2.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[3.0, 3.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[2.0, 2.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[2.0, 3.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[-1.0, -23.0]]
Complete Log Result of My KMeansClusteringJob
14/04/08 15:50:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
14/04/08 15:50:35 INFO compress.CodecPool: Got brand-new compressor
14/04/08 15:50:35 WARN mapred.JobClient: Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.
14/04/08 15:50:35 WARN mapred.JobClient: No job jar file set. User classes may not be
20
found. See JobConf(Class) or JobConf#setJar(String).
14/04/08 15:50:35 INFO input.FileInputFormat: Total input paths to process : 1
14/04/08 15:50:35 INFO mapred.JobClient: Running job: job_local1343624176_0001
14/04/08 15:50:35 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/08 15:50:35 INFO mapred.LocalJobRunner: Starting task:
attempt_local1343624176_0001_m_000000_0
14/04/08 15:50:36 INFO util.ProcessTree: setsid exited with exit code 0
14/04/08 15:50:36 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@24bec229
14/04/08 15:50:36 INFO mapred.MapTask: Processing split:
file:/home/tid/eclipse/workspace/MRClustering/files/clustering/import/data:0+558
14/04/08 15:50:36 INFO mapred.MapTask: io.sort.mb = 100
14/04/08 15:50:36 INFO mapred.MapTask: data buffer = 79691776/99614720
14/04/08 15:50:36 INFO mapred.MapTask: record buffer = 262144/327680
14/04/08 15:50:36 INFO compress.CodecPool: Got brand-new decompressor
14/04/08 15:50:36 INFO compress.CodecPool: Got brand-new decompressor
14/04/08 15:50:36 INFO mapred.MapTask: Starting flush of map output
14/04/08 15:50:36 INFO mapred.MapTask: Finished spill 0
14/04/08 15:50:36 INFO mapred.Task: Task:attempt_local1343624176_0001_m_000000_0 is done.
And is in the process of commiting
14/04/08 15:50:36 INFO mapred.LocalJobRunner:
14/04/08 15:50:36 INFO mapred.Task: Task 'attempt_local1343624176_0001_m_000000_0' done.
14/04/08 15:50:36 INFO mapred.LocalJobRunner: Finishing task:
attempt_local1343624176_0001_m_000000_0
14/04/08 15:50:36 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/08 15:50:36 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@7b4b3d0e
14/04/08 15:50:36 INFO mapred.LocalJobRunner:
14/04/08 15:50:36 INFO mapred.Merger: Merging 1 sorted segments
14/04/08 15:50:36 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of
total size: 380 bytes
14/04/08 15:50:36 INFO mapred.LocalJobRunner:
14/04/08 15:50:36 INFO mapred.Task: Task:attempt_local1343624176_0001_r_000000_0 is done.
And is in the process of commiting
14/04/08 15:50:36 INFO mapred.LocalJobRunner:
14/04/08 15:50:36 INFO mapred.Task: Task attempt_local1343624176_0001_r_000000_0 is allowed
to commit now
14/04/08 15:50:36 INFO output.FileOutputCommitter: Saved output of task
'attempt_local1343624176_0001_r_000000_0' to files/clustering/depth_1
14/04/08 15:50:36 INFO mapred.LocalJobRunner: reduce > reduce
14/04/08 15:50:36 INFO mapred.Task: Task 'attempt_local1343624176_0001_r_000000_0' done.
14/04/08 15:50:36 INFO mapred.JobClient: map 100% reduce 100%
14/04/08 15:50:36 INFO mapred.JobClient: Job complete: job_local1343624176_0001
14/04/08 15:50:36 INFO mapred.JobClient: Counters: 21
14/04/08 15:50:36 INFO mapred.JobClient: File Output Format Counters
14/04/08 15:50:36 INFO mapred.JobClient: Bytes Written=537
14/04/08 15:50:36 INFO mapred.JobClient: File Input Format Counters
14/04/08 15:50:36 INFO mapred.JobClient: Bytes Read=574
14/04/08 15:50:36 INFO mapred.JobClient: FileSystemCounters
14/04/08 15:50:36 INFO mapred.JobClient: FILE_BYTES_READ=2380
14/04/08 15:50:36 INFO mapred.JobClient: FILE_BYTES_WRITTEN=106876
14/04/08 15:50:36 INFO mapred.JobClient: com.clustering.mapreduce.KMeansReducer$Counter
14/04/08 15:50:36 INFO mapred.JobClient: CONVERGED=2
14/04/08 15:50:36 INFO mapred.JobClient: Map-Reduce Framework
14/04/08 15:50:36 INFO mapred.JobClient: Reduce input groups=2
14/04/08 15:50:36 INFO mapred.JobClient: Map output materialized bytes=384
14/04/08 15:50:36 INFO mapred.JobClient: Combine output records=0
14/04/08 15:50:36 INFO mapred.JobClient: Map input records=9
14/04/08 15:50:36 INFO mapred.JobClient: Reduce shuffle bytes=0
14/04/08 15:50:36 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/08 15:50:36 INFO mapred.JobClient: Reduce output records=9
14/04/08 15:50:36 INFO mapred.JobClient: Spilled Records=18
14/04/08 15:50:36 INFO mapred.JobClient: Map output bytes=360
14/04/08 15:50:36 INFO mapred.JobClient: Total committed heap usage (bytes)=355991552
14/04/08 15:50:36 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/08 15:50:36 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
21
14/04/08 15:50:36 INFO mapred.JobClient: SPLIT_RAW_BYTES=139
14/04/08 15:50:36 INFO mapred.JobClient: Map output records=9
14/04/08 15:50:36 INFO mapred.JobClient: Combine input records=0
14/04/08 15:50:36 INFO mapred.JobClient: Reduce input records=9
14/04/08 15:50:36 WARN mapred.JobClient: Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.
14/04/08 15:50:36 WARN mapred.JobClient: No job jar file set. User classes may not be
found. See JobConf(Class) or JobConf#setJar(String).
14/04/08 15:50:36 INFO input.FileInputFormat: Total input paths to process : 1
14/04/08 15:50:37 INFO mapred.JobClient: Running job: job_local1426850290_0002
14/04/08 15:50:37 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/08 15:50:37 INFO mapred.LocalJobRunner: Starting task:
attempt_local1426850290_0002_m_000000_0
14/04/08 15:50:37 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@26adcd34
14/04/08 15:50:37 INFO mapred.MapTask: Processing split:
file:/home/tid/eclipse/workspace/MRClustering/files/clustering/depth_1/part-r-00000:0+521
14/04/08 15:50:37 INFO mapred.MapTask: io.sort.mb = 100
14/04/08 15:50:37 INFO mapred.MapTask: data buffer = 79691776/99614720
14/04/08 15:50:37 INFO mapred.MapTask: record buffer = 262144/327680
14/04/08 15:50:37 INFO mapred.MapTask: Starting flush of map output
14/04/08 15:50:37 INFO mapred.MapTask: Finished spill 0
14/04/08 15:50:37 INFO mapred.Task: Task:attempt_local1426850290_0002_m_000000_0 is done.
And is in the process of commiting
14/04/08 15:50:37 INFO mapred.LocalJobRunner:
14/04/08 15:50:37 INFO mapred.Task: Task 'attempt_local1426850290_0002_m_000000_0' done.
14/04/08 15:50:37 INFO mapred.LocalJobRunner: Finishing task:
attempt_local1426850290_0002_m_000000_0
14/04/08 15:50:37 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/08 15:50:37 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3740f768
14/04/08 15:50:37 INFO mapred.LocalJobRunner:
14/04/08 15:50:37 INFO mapred.Merger: Merging 1 sorted segments
14/04/08 15:50:37 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of
total size: 380 bytes
14/04/08 15:50:37 INFO mapred.LocalJobRunner:
14/04/08 15:50:37 INFO mapred.Task: Task:attempt_local1426850290_0002_r_000000_0 is done.
And is in the process of commiting
14/04/08 15:50:37 INFO mapred.LocalJobRunner:
14/04/08 15:50:37 INFO mapred.Task: Task attempt_local1426850290_0002_r_000000_0 is allowed
to commit now
14/04/08 15:50:37 INFO output.FileOutputCommitter: Saved output of task
'attempt_local1426850290_0002_r_000000_0' to files/clustering/depth_2
14/04/08 15:50:37 INFO mapred.LocalJobRunner: reduce > reduce
14/04/08 15:50:37 INFO mapred.Task: Task 'attempt_local1426850290_0002_r_000000_0' done.
14/04/08 15:50:38 INFO mapred.JobClient: map 100% reduce 100%
14/04/08 15:50:38 INFO mapred.JobClient: Job complete: job_local1426850290_0002
14/04/08 15:50:38 INFO mapred.JobClient: Counters: 21
14/04/08 15:50:38 INFO mapred.JobClient: File Output Format Counters
14/04/08 15:50:38 INFO mapred.JobClient: Bytes Written=537
14/04/08 15:50:38 INFO mapred.JobClient: File Input Format Counters
14/04/08 15:50:38 INFO mapred.JobClient: Bytes Read=537
14/04/08 15:50:38 INFO mapred.JobClient: FileSystemCounters
14/04/08 15:50:38 INFO mapred.JobClient: FILE_BYTES_READ=5088
14/04/08 15:50:38 INFO mapred.JobClient: FILE_BYTES_WRITTEN=212938
14/04/08 15:50:38 INFO mapred.JobClient: com.clustering.mapreduce.KMeansReducer$Counter
14/04/08 15:50:38 INFO mapred.JobClient: CONVERGED=2
14/04/08 15:50:38 INFO mapred.JobClient: Map-Reduce Framework
14/04/08 15:50:38 INFO mapred.JobClient: Reduce input groups=2
14/04/08 15:50:38 INFO mapred.JobClient: Map output materialized bytes=384
14/04/08 15:50:38 INFO mapred.JobClient: Combine output records=0
14/04/08 15:50:38 INFO mapred.JobClient: Map input records=9
14/04/08 15:50:38 INFO mapred.JobClient: Reduce shuffle bytes=0
14/04/08 15:50:38 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/08 15:50:38 INFO mapred.JobClient: Reduce output records=9
14/04/08 15:50:38 INFO mapred.JobClient: Spilled Records=18
22
14/04/08 15:50:38 INFO mapred.JobClient: Map output bytes=360
14/04/08 15:50:38 INFO mapred.JobClient: Total committed heap usage (bytes)=555352064
14/04/08 15:50:38 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/08 15:50:38 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/08 15:50:38 INFO mapred.JobClient: SPLIT_RAW_BYTES=148
14/04/08 15:50:38 INFO mapred.JobClient: Map output records=9
14/04/08 15:50:38 INFO mapred.JobClient: Combine input records=0
14/04/08 15:50:38 INFO mapred.JobClient: Reduce input records=9
14/04/08 15:50:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.
14/04/08 15:50:38 WARN mapred.JobClient: No job jar file set. User classes may not be
found. See JobConf(Class) or JobConf#setJar(String).
14/04/08 15:50:38 INFO input.FileInputFormat: Total input paths to process : 1
14/04/08 15:50:38 INFO mapred.JobClient: Running job: job_local466621791_0003
14/04/08 15:50:38 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/08 15:50:38 INFO mapred.LocalJobRunner: Starting task:
attempt_local466621791_0003_m_000000_0
14/04/08 15:50:38 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@373fdd1a
14/04/08 15:50:38 INFO mapred.MapTask: Processing split:
file:/home/tid/eclipse/workspace/MRClustering/files/clustering/depth_2/part-r-00000:0+521
14/04/08 15:50:38 INFO mapred.MapTask: io.sort.mb = 100
14/04/08 15:50:38 INFO mapred.MapTask: data buffer = 79691776/99614720
14/04/08 15:50:38 INFO mapred.MapTask: record buffer = 262144/327680
14/04/08 15:50:38 INFO mapred.MapTask: Starting flush of map output
14/04/08 15:50:38 INFO mapred.MapTask: Finished spill 0
14/04/08 15:50:38 INFO mapred.Task: Task:attempt_local466621791_0003_m_000000_0 is done.
And is in the process of commiting
14/04/08 15:50:38 INFO mapred.LocalJobRunner:
14/04/08 15:50:38 INFO mapred.Task: Task 'attempt_local466621791_0003_m_000000_0' done.
14/04/08 15:50:38 INFO mapred.LocalJobRunner: Finishing task:
attempt_local466621791_0003_m_000000_0
14/04/08 15:50:38 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/08 15:50:38 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@78d3cdb9
14/04/08 15:50:38 INFO mapred.LocalJobRunner:
14/04/08 15:50:38 INFO mapred.Merger: Merging 1 sorted segments
14/04/08 15:50:38 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of
total size: 380 bytes
14/04/08 15:50:38 INFO mapred.LocalJobRunner:
14/04/08 15:50:38 INFO mapred.Task: Task:attempt_local466621791_0003_r_000000_0 is done.
And is in the process of commiting
14/04/08 15:50:38 INFO mapred.LocalJobRunner:
14/04/08 15:50:38 INFO mapred.Task: Task attempt_local466621791_0003_r_000000_0 is allowed
to commit now
14/04/08 15:50:38 INFO output.FileOutputCommitter: Saved output of task
'attempt_local466621791_0003_r_000000_0' to files/clustering/depth_3
14/04/08 15:50:38 INFO mapred.LocalJobRunner: reduce > reduce
14/04/08 15:50:38 INFO mapred.Task: Task 'attempt_local466621791_0003_r_000000_0' done.
14/04/08 15:50:39 INFO mapred.JobClient: map 100% reduce 100%
14/04/08 15:50:39 INFO mapred.JobClient: Job complete: job_local466621791_0003
14/04/08 15:50:39 INFO mapred.JobClient: Counters: 20
14/04/08 15:50:39 INFO mapred.JobClient: File Output Format Counters
14/04/08 15:50:39 INFO mapred.JobClient: Bytes Written=537
14/04/08 15:50:39 INFO mapred.JobClient: File Input Format Counters
14/04/08 15:50:39 INFO mapred.JobClient: Bytes Read=537
14/04/08 15:50:39 INFO mapred.JobClient: FileSystemCounters
14/04/08 15:50:39 INFO mapred.JobClient: FILE_BYTES_READ=7796
14/04/08 15:50:39 INFO mapred.JobClient: FILE_BYTES_WRITTEN=318992
14/04/08 15:50:39 INFO mapred.JobClient: Map-Reduce Framework
14/04/08 15:50:39 INFO mapred.JobClient: Reduce input groups=2
14/04/08 15:50:39 INFO mapred.JobClient: Map output materialized bytes=384
14/04/08 15:50:39 INFO mapred.JobClient: Combine output records=0
14/04/08 15:50:39 INFO mapred.JobClient: Map input records=9
14/04/08 15:50:39 INFO mapred.JobClient: Reduce shuffle bytes=0
14/04/08 15:50:39 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
23
14/04/08 15:50:39 INFO mapred.JobClient: Reduce output records=9
14/04/08 15:50:39 INFO mapred.JobClient: Spilled Records=18
14/04/08 15:50:39 INFO mapred.JobClient: Map output bytes=360
14/04/08 15:50:39 INFO mapred.JobClient: Total committed heap usage (bytes)=754712576
14/04/08 15:50:39 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/08 15:50:39 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/08 15:50:39 INFO mapred.JobClient: SPLIT_RAW_BYTES=148
14/04/08 15:50:39 INFO mapred.JobClient: Map output records=9
14/04/08 15:50:39 INFO mapred.JobClient: Combine input records=0
14/04/08 15:50:39 INFO mapred.JobClient: Reduce input records=9
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: FOUND
file:/home/tid/eclipse/workspace/MRClustering/files/clustering/depth_3/part-r-00000
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[13.5, 3.75]]] / Vector [vector=[16.0, 3.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[13.5, 3.75]]] / Vector [vector=[7.0, 6.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[13.5, 3.75]]] / Vector [vector=[6.0, 5.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[13.5, 3.75]]] / Vector [vector=[25.0, 1.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[1.0, 2.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[3.0, 3.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[2.0, 2.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[2.0, 3.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[-1.0, -23.0]]
References
 Hadoop Installation on single node cluster - http://www.michael-noll.com/tutorials/running-
hadoop-on-ubuntu-linux-single-node-cluster/
 KMeansClustering with MapReduce - http://codingwiththomas.blogspot.kr/2011/05/k-
means-clustering-with-mapreduce.html

More Related Content

What's hot

Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platformApache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
rhatr
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
DataStax
 
Introduction to JCR
Introduction to JCR Introduction to JCR
Introduction to JCR
David Nuescheler
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Sqoop
SqoopSqoop
My Experiences as a Beginner of OpenJDK Contributor (JCConf Taiwan 2021)
My Experiences as a Beginner of OpenJDK Contributor (JCConf Taiwan 2021)My Experiences as a Beginner of OpenJDK Contributor (JCConf Taiwan 2021)
My Experiences as a Beginner of OpenJDK Contributor (JCConf Taiwan 2021)
NTT DATA Technology & Innovation
 
State transfer With Galera
State transfer With GaleraState transfer With Galera
State transfer With Galera
Mydbops
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
Dushhyant Kumar
 
Implementing CloudStack's VPC feature
Implementing CloudStack's VPC featureImplementing CloudStack's VPC feature
Implementing CloudStack's VPC feature
Marcus L Sorensen
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
NAVER D2
 
Spark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairSpark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclair
Alexis Seigneurin
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
alexbaranau
 
Storing tree structures with MongoDB
Storing tree structures with MongoDBStoring tree structures with MongoDB
Storing tree structures with MongoDB
Vyacheslav
 
Oracle Database Performance Tuning Concept
Oracle Database Performance Tuning ConceptOracle Database Performance Tuning Concept
Oracle Database Performance Tuning Concept
Chien Chung Shen
 
OpenStack Oslo Messaging RPC API Tutorial Demo Call, Cast and Fanout
OpenStack Oslo Messaging RPC API Tutorial Demo Call, Cast and FanoutOpenStack Oslo Messaging RPC API Tutorial Demo Call, Cast and Fanout
OpenStack Oslo Messaging RPC API Tutorial Demo Call, Cast and Fanout
Saju Madhavan
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
Avkash Chauhan
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
GauravBiswas9
 
Indexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data TypesIndexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data Types
Jonathan Katz
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platformApache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Introduction to JCR
Introduction to JCR Introduction to JCR
Introduction to JCR
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Sqoop
SqoopSqoop
Sqoop
 
My Experiences as a Beginner of OpenJDK Contributor (JCConf Taiwan 2021)
My Experiences as a Beginner of OpenJDK Contributor (JCConf Taiwan 2021)My Experiences as a Beginner of OpenJDK Contributor (JCConf Taiwan 2021)
My Experiences as a Beginner of OpenJDK Contributor (JCConf Taiwan 2021)
 
State transfer With Galera
State transfer With GaleraState transfer With Galera
State transfer With Galera
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Implementing CloudStack's VPC feature
Implementing CloudStack's VPC featureImplementing CloudStack's VPC feature
Implementing CloudStack's VPC feature
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
 
Spark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairSpark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclair
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Storing tree structures with MongoDB
Storing tree structures with MongoDBStoring tree structures with MongoDB
Storing tree structures with MongoDB
 
Oracle Database Performance Tuning Concept
Oracle Database Performance Tuning ConceptOracle Database Performance Tuning Concept
Oracle Database Performance Tuning Concept
 
OpenStack Oslo Messaging RPC API Tutorial Demo Call, Cast and Fanout
OpenStack Oslo Messaging RPC API Tutorial Demo Call, Cast and FanoutOpenStack Oslo Messaging RPC API Tutorial Demo Call, Cast and Fanout
OpenStack Oslo Messaging RPC API Tutorial Demo Call, Cast and Fanout
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Indexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data TypesIndexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data Types
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 

Viewers also liked

Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
Varad Meru
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
Tien-Yang (Aiden) Wu
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoopTianwei Liu
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringGeorge Ang
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
mobius.cn
 
Leadership Games and Activities
Leadership Games and ActivitiesLeadership Games and Activities
Leadership Games and Activities
Lacey
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
Lino Possamai
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計
Wei-Yu Chen
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
Want to work for The Insurance Barn
Want to work for The Insurance BarnWant to work for The Insurance Barn
Want to work for The Insurance BarnTim Barnes Clu
 
Smart Innovation Platform Flier - Grindstaff
Smart Innovation Platform Flier - GrindstaffSmart Innovation Platform Flier - Grindstaff
Smart Innovation Platform Flier - GrindstaffJohn Nixon
 
Community Insurance by The Goat trust
Community Insurance by The Goat trustCommunity Insurance by The Goat trust
Community Insurance by The Goat trustSanjeev Kumar
 

Viewers also liked (19)

Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoop
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
 
Leadership Games and Activities
Leadership Games and ActivitiesLeadership Games and Activities
Leadership Games and Activities
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Want to work for The Insurance Barn
Want to work for The Insurance BarnWant to work for The Insurance Barn
Want to work for The Insurance Barn
 
Smart Innovation Platform Flier - Grindstaff
Smart Innovation Platform Flier - GrindstaffSmart Innovation Platform Flier - Grindstaff
Smart Innovation Platform Flier - Grindstaff
 
Community Insurance by The Goat trust
Community Insurance by The Goat trustCommunity Insurance by The Goat trust
Community Insurance by The Goat trust
 

Similar to Hadoop installation and Running KMeans Clustering with MapReduce Program on Hadoop

Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14
jijukjoseph
 
Hadoop completereference
Hadoop completereferenceHadoop completereference
Hadoop completereference
arunkumar sadhasivam
 
Hadoop installation on windows
Hadoop installation on windows Hadoop installation on windows
Hadoop installation on windows
habeebulla g
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
Shashwat Shriparv
 
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Nag Arvind Gudiseva
 
Hadoop cluster 安裝
Hadoop cluster 安裝Hadoop cluster 安裝
Hadoop cluster 安裝
recast203
 
Hadoop 2.4 installing on ubuntu 14.04
Hadoop 2.4 installing on ubuntu 14.04Hadoop 2.4 installing on ubuntu 14.04
Hadoop 2.4 installing on ubuntu 14.04
baabtra.com - No. 1 supplier of quality freshers
 
Hadoop installation
Hadoop installationHadoop installation
Hadoop installation
habeebulla g
 
Single node hadoop cluster installation
Single node hadoop cluster installation Single node hadoop cluster installation
Single node hadoop cluster installation
Mahantesh Angadi
 
Configuring and manipulating HDFS files
Configuring and manipulating HDFS filesConfiguring and manipulating HDFS files
Configuring and manipulating HDFS files
Rupak Roy
 
R hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopR hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopAiden Seonghak Hong
 
Single node setup
Single node setupSingle node setup
Single node setupKBCHOW123
 
Setting up a HADOOP 2.2 cluster on CentOS 6
Setting up a HADOOP 2.2 cluster on CentOS 6Setting up a HADOOP 2.2 cluster on CentOS 6
Setting up a HADOOP 2.2 cluster on CentOS 6
Manish Chopra
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
Subhas Kumar Ghosh
 
Apache Hadoop & Hive installation with movie rating exercise
Apache Hadoop & Hive installation with movie rating exerciseApache Hadoop & Hive installation with movie rating exercise
Apache Hadoop & Hive installation with movie rating exercise
Shiva Rama Krishna Dasharathi
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
Padma shree. T
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
puneet yadav
 
Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2
benjaminwootton
 
Install and configure linux
Install and configure linuxInstall and configure linux
Install and configure linux
Vicent Selfa
 

Similar to Hadoop installation and Running KMeans Clustering with MapReduce Program on Hadoop (20)

Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14
 
Hadoop completereference
Hadoop completereferenceHadoop completereference
Hadoop completereference
 
Hadoop installation on windows
Hadoop installation on windows Hadoop installation on windows
Hadoop installation on windows
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
 
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
 
Hadoop cluster 安裝
Hadoop cluster 安裝Hadoop cluster 安裝
Hadoop cluster 安裝
 
Hadoop 2.4 installing on ubuntu 14.04
Hadoop 2.4 installing on ubuntu 14.04Hadoop 2.4 installing on ubuntu 14.04
Hadoop 2.4 installing on ubuntu 14.04
 
Hadoop installation
Hadoop installationHadoop installation
Hadoop installation
 
Run wordcount job (hadoop)
Run wordcount job (hadoop)Run wordcount job (hadoop)
Run wordcount job (hadoop)
 
Single node hadoop cluster installation
Single node hadoop cluster installation Single node hadoop cluster installation
Single node hadoop cluster installation
 
Configuring and manipulating HDFS files
Configuring and manipulating HDFS filesConfiguring and manipulating HDFS files
Configuring and manipulating HDFS files
 
R hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopR hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing Hadoop
 
Single node setup
Single node setupSingle node setup
Single node setup
 
Setting up a HADOOP 2.2 cluster on CentOS 6
Setting up a HADOOP 2.2 cluster on CentOS 6Setting up a HADOOP 2.2 cluster on CentOS 6
Setting up a HADOOP 2.2 cluster on CentOS 6
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
Apache Hadoop & Hive installation with movie rating exercise
Apache Hadoop & Hive installation with movie rating exerciseApache Hadoop & Hive installation with movie rating exercise
Apache Hadoop & Hive installation with movie rating exercise
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2
 
Install and configure linux
Install and configure linuxInstall and configure linux
Install and configure linux
 

Recently uploaded

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 

Recently uploaded (20)

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 

Hadoop installation and Running KMeans Clustering with MapReduce Program on Hadoop

  • 1. 1 Hadoop Installation and Running KMeans Clustering with MapReduce Program on Hadoop Introduction General issue that I will cover in this document are Hadoop installation (in section 1) and running KMeans Clustering MapReduce program on Hadoop (section 2). 1 Hadoop Installation I will install Hadoop Single Node Cluster mode in my personal computer using this following environment . 1. Ubuntu 12.04 2. Java JDK 1.7.0 update 21 3. Hadoop 1.2.1 (stable) 1.1 Prerequisites Before installing Hadoop, the following point must be done first before installing Hadoop in our system 1. Sun JDK /Open JDK I use Sun JDK from oracle instead of Open JDK, the resource package can be downloaded from here : http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html 2. Hadoop installer packages In this report, I will use Hadoop version 1.2.1 (stable). The resource package can be downloaded from her : http://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/ 1.1.1 Configuring Java In my computer, I have several installed java version, they are Java 7 and java 6. For running Hadoop program, I need to configure which version I will use. I decide to use the newer version (java version 1.7 update 21) so the following are step by step for configuring latest java version. 1. Configure java $ sudo update-alternatives --config java There are 2 choices for the alternative java (providing /usr/bin/java). Selection Path Priority Status ------------------------------------------------------------ 0 /usr/lib/jvm/jdk1.6.0_45/bin/java 1 auto mode 1 /usr/lib/jvm/jdk1.6.0_45/bin/java 1 manual mode * 2 /usr/lib/jvm/jdk1.7.0_21/bin/java 1 manual mode Press enter to keep the current choice[*], or type selection number: 2
  • 2. 2 2. configure javac $ sudo update-alternatives --config javac There are 2 choices for the alternative javac (providing /usr/bin/javac). Selection Path Priority Status ------------------------------------------------------------ 0 /usr/lib/jvm/jdk1.6.0_45/bin/javac 1 auto mode 1 /usr/lib/jvm/jdk1.6.0_45/bin/javac 1 manual mode * 2 /usr/lib/jvm/jdk1.7.0_21/bin/javac 1 manual mode Press enter to keep the current choice[*], or type selection number: 2 3. configure javaws $ sudo update-alternatives --config javaws There are 2 choices for the alternative javaws (providing /usr/bin/javaws). Selection Path Priority Status ------------------------------------------------------------ 0 /usr/lib/jvm/jdk1.6.0_45/bin/javaws 1 auto mode 1 /usr/lib/jvm/jdk1.6.0_45/bin/javaws 1 manual mode * 2 /usr/lib/jvm/jdk1.7.0_21/bin/javaws 1 manual mode Press enter to keep the current choice[*], or type selection number: 2 4. check the configuration To make sure the latest java, javac, and javaws successfully configure, I use this following command tid@dbubuntu:~$ java -version java version "1.7.0_21" Java(TM) SE Runtime Environment (build 1.7.0_21-b11) Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) 1.1.2 Hadoop installer After downloading Hadoop installer package, then we need to extract the installer package into the desire directory. I downloaded Hadoop installer and locate in the ~/Download directory. For extracting Hadoop installer package in the local directory, I use the following command $ tar -xzfv hadoop-1.2.1.tar.gz 1.2 System configuration In this section I will explain step-by step how to setup and preparing the system for Hadoop single node cluster in my local compute. The system configuration consists of network configuration for setup the hosts name, move Hadoop extracted package into desire folder, enabling ssh, and adding and changing folder permission. 1.2.1 Network configuration In the hadoop network configuration, all of the machines should have alias instead of IP address. To configure network aliases, we can edit /etc/hosts on machine that will we use for Hadoop master and slave. In my case, since I am using single node, the configuration can be done by these step 1. Open file /etc/hosts as sudoers,
  • 3. 3 $ sudo nano /etc/hosts 2. I add the following lines, inside the curly branch is my IP Address, 127.0.0.1 localhost 164.125.50.127 localhost Note : For configuring the local computer into pseudo distributed mode, the following configuration for /etc/hosts is used # /etc/hosts (for Hadoop Master and Slave) 192.168.0.1 master 192.168.0.2 slave 1.2.2 User configuration For security issue, we better to create special user for Hadoop in each machine, however, since I am working in local computer, I use existing user. The following command is for adding new user and group $ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser 1.2.3 Configuring SSH Hadoop requires SSH access to manage its nodes. In this configuration, I also configure SSH access to localhost for hduser that I already made in the previous section and local existing user. 1. For generating ssh key for hduser, we can use the following command $ su - hduser Password: hduser@dbubuntu:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): /home/hduser/.ssh/id_rsa already exists. Overwrite (y/n)? y Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: b8:68:c1:4b:d1:fe:4b:40:2c:1c:b4:37:db:5f:76:ee hduser@dbubuntu The key's randomart image is: +--[ RSA 2048]----+ | .o | | . = | | = * | | . * = | | + = S o . | | . + + . o o | | + . o . . | | . . . . | | . E | +-----------------+ hduser@dbubuntu:~$
  • 4. 4 2. Enable ssh access for local machine with newly created key cat /home/hduser/.ssh/id_rsa.pub >> /home/hduser/.ssh/authorized_keys Note : Since I use local existing user, I do the above step 2 times, also for user tid 1.2.4 Extracting Hadoop installer I copy the Hadoop Installer package from ~/Download directory into desire folder. In this case I use /usr/local to locate hadoop installer package. 1. Command for moving extracted Hadoop into $ cp ~/Downloads/Hadoop-1.2.1 /usr/local/ $ cd /usr/local 2. in order to make hadoop easy to access and handling some update version of Hadoop, I create symbolic link of hadoop-1.2.1 into hadoop directory $ ln -s hadoop-1.2.1 hadoop 3. change folder permission of hadoop, so it will accessible by user tid $ sudo chown -R tid:hadoop hadoop 1.3 Hadoop Configuration After sucessfully configuring the network, folder, and other configuration, in this part, I will explain step by step hadoop configuration for each machine. Since I locate my hadoop inside /usr/local/hadoop so it will be the active directory, and all of the hadoop configurations are located in conf directory. 1.3.1 conf/masters Since in this configuration for single node, the content of masters file by default should be like this Localhost Note : For multi node cluster we can add the master alias’s name regarding the network setup and configuration in section 1.2.1 1.3.2 conf/slaves Same as masters file, the default setup for slaves should be like this Localhost Note : for multi node cluster, we can add the master and slave alias(considering network configuration in section 1.2.1), for example the slaves file might look like this slave1 slave2 slave3
  • 5. 5 1.3.3 conf/core-site.xml For core site configuration I specify the location in directory /tmp/Hadoop/app. In this file, I give the configuration of core site of cluster. Firstly, the file look like this following <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> </configuration> then I change the configuration to look like this <configuration> <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop/app</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> </configuration> 1.3.4 Conf/mapred-site.xml I change the mapred-site.xml , that it look like this following <configuration> <!-- In: conf/mapred-site.xml --> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description> The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> </configuration>
  • 6. 6 1.3.5 hdfs-site.xml For hdfs-site, for single cluster, I specify the number of replication only 1, if we have several machine we can add the number of replication. <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> </configuration> 1.3.6 Formatting HDFS via Namenode Before starting the cluster, we should format the Hadoop File System. It can be formatted once or more, however formatting the namenode means clear all of the data, so we need to be careful otherwise we can lose our data tid@dbubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format 1.4 Running Hadoop After setting all needed configuration, finally we can start our Hadoop. For running Hadoop daemon, there are several alternatives, 1.4.1 Starting all daemon at once For starting all of the service in Hadoop, I use this following command tid@dbubuntu:/usr/local/hadoop$ bin/start-all.sh Then, checking whether all of daemons already running using the following command tid@dbubuntu:/usr/local/hadoop$ jps 1.4.2 Stopping all daemon For stopping all daemon, we can use the following command tid@dbubuntu:/usr/local/hadoop$ bin/stop-all.sh
  • 7. 7 2 Hadoop MapReduce Program (KMeans Clustering in Map Reduce) In this part I will explain how I can run map reduce program regarding reference from Thomas’s blog for KMeans Clustering in MapReduce (http://codingwiththomas.blogspot.kr/2011/05/k-means- clustering-with-mapreduce.html ) with some modification 2.1 Eclipse IDE Setup There are several ways for setting up the IDE environment so we can easily create MapReduce Program in Eclipse. For ease development and setup, I am using Eclipse plugin from self build plugin by creating jar from Hadoop library. The following are step-by-step setting up eclipse IDE Step 1: copy the pre-build eclipse plugin for Hadoop in directory plugins of Eclipse Figure 1 Eclipse hadoop plugin inside plugins directory of eclipse Step 2 : restart Eclipse, then in the perspective part, we will see other perspective in the right corner, and choose MapReduce Perspective Figure 2 MapReduce perspective in eclipse IDE Step 3: Add the server in the map reduce panel. In my case, because my server is located in the local machine, named as localhost, the detail will looks like the following
  • 8. 8 Figure 3 Configuration for adding Hadoop Location After adding the server, in the right side, along with project explorer, we will see the HDFS file explorer, and right now Eclipse is ready to use for developing MapReduce Application Figure 4 HDFS Directory explorer 2.2 Source Code Preparation In this part, I will describe how to prepare the project, package, and class for KMeans Clustering MapReduce program. Create new project 1. First, we need to create new MapReduce project, by clicking new project in the upper left corner of eclipse, and after the following window pup up, choose MapReduce project Figure 5 Choosing MapReduce project
  • 9. 9 2. Fill the project name, and check the reference location of Hadoop. By default, if we are using eclipse plugin for Hadoop, the folder will be directed to our Hadoop installation folder , then click “Finish” Create Package and Class For KMeans Clustering MapReduce program, based on Thomas’s references, we need to create two package, one package for clustering model, consists of Model class for Vector, Distance Measure, and define the ClusterCenter (Vector.java, DistanceMeasurer.java, and ClusterCenter.java) and the other is package for Main, Mapper, and Reducer Class (KMeans 1. com.clustering.model package Model Class (Vector.java) Model Class : Vector.java package com.clustering.model; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import java.util.Arrays; import org.apache.hadoop.io.WritableComparable; public class Vector implements WritableComparable<Vector>{ private double[] vector; public Vector(){ super(); } public Vector(Vector v){ super(); int l= v.vector.length; this.vector= new double[l]; System.arraycopy(v.vector, 0,this.vector, 0, l);
  • 10. 10 } public Vector(double x, double y){ super(); this.vector = new double []{x,y}; } @Override public void readFields(DataInput in) throws IOException { // TODO Auto-generated method stub int size = in.readInt(); vector = new double[size]; for(int i=0;i<size;i++) vector[i]=in.readDouble(); } @Override public void write(DataOutput out) throws IOException { // TODO Auto-generated method stub out.writeInt(vector.length); for(int i=0;i<vector.length;i++) out.writeDouble(vector[i]); } @Override public int compareTo(Vector o) { // TODO Auto-generated method stub boolean equals = true; for (int i=0;i<vector.length;i++){ if (vector[i] != o.vector[i]) { equals = false; break; } } if(equals) return 0; else return 1; } public double[] getVector(){ return vector; } public void setVector(double[]vector){ this.vector=vector; } public String toString(){ return "Vector [vector=" + Arrays.toString(vector) + "]"; } } 2. Distance Measurement Class Distance Measurement class : DistanceMeasurer.java package com.clustering.model;
  • 11. 11 public class DistanceMeasurer { public static final double measureDistance(ClusterCenter center, Vector v) { double sum = 0; int length = v.getVector().length; for (int i = 0; i < length; i++) { sum += Math.abs(center.getCenter().getVector()[i] - v.getVector()[i]); } return sum; } } 3. ClusterCenter Cluster Center definition : ClusterCenter.java package com.clustering.model; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.WritableComparable; public class ClusterCenter implements WritableComparable<ClusterCenter> { private Vector center; public ClusterCenter() { super(); this.center = null; } public ClusterCenter(ClusterCenter center) { super(); this.center = new Vector(center.center); } public ClusterCenter(Vector center) { super(); this.center = center; } public boolean converged(ClusterCenter c) { return compareTo(c) == 0 ? false : true; } @Override public void write(DataOutput out) throws IOException { center.write(out); } @Override public void readFields(DataInput in) throws IOException { this.center = new Vector(); center.readFields(in); }
  • 12. 12 @Override public int compareTo(ClusterCenter o) { return center.compareTo(o.getCenter()); } /** * @return the center */ public Vector getCenter() { return center; } @Override public String toString() { return "ClusterCenter [center=" + center + "]"; } } After configuring the class model, the next one is MapReduce Classes, which consist of Mapper, Reducer, and finally the Main class 1. Mapper class Mapper class : KMeansMapper.java package com.clustering.mapreduce; import java.io.IOException; import java.util.LinkedList; import java.util.List; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.mapreduce.Mapper; import com.clustering.model.ClusterCenter; import com.clustering.model.DistanceMeasurer; import com.clustering.model.Vector; public class KMeansMapper extends Mapper<ClusterCenter, Vector, ClusterCenter, Vector>{ List<ClusterCenter> centers = new LinkedList<ClusterCenter>(); @Override protected void setup(Context context) throws IOException, InterruptedException { super.setup(context); Configuration conf = context.getConfiguration(); Path centroids = new Path(conf.get("centroid.path")); FileSystem fs = FileSystem.get(conf); SequenceFile.Reader reader = new SequenceFile.Reader(fs, centroids, conf);
  • 13. 13 ClusterCenter key = new ClusterCenter(); IntWritable value = new IntWritable(); while (reader.next(key, value)) { centers.add(new ClusterCenter(key)); } reader.close(); } @Override protected void map(ClusterCenter key, Vector value, Context context) throws IOException, InterruptedException { ClusterCenter nearest = null; double nearestDistance = Double.MAX_VALUE; for (ClusterCenter c : centers) { double dist = DistanceMeasurer.measureDistance(c, value); if (nearest == null) { nearest = c; nearestDistance = dist; } else { if (nearestDistance > dist) { nearest = c; nearestDistance = dist; } } } context.write(nearest, value); } } 2. Reducer class Reducer class : KMeansReducer.java package com.clustering.mapreduce; import java.io.IOException; import java.util.LinkedList; import java.util.List; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.mapreduce.Reducer; import com.clustering.model.ClusterCenter; import com.clustering.model.Vector; public class KMeansReducer extends Reducer<ClusterCenter, Vector, ClusterCenter, Vector>{ public static enum Counter{ CONVERGED } List<ClusterCenter> centers = new LinkedList<ClusterCenter>();
  • 14. 14 protected void reduce(ClusterCenter key, Iterable<Vector> values, Context context) throws IOException, InterruptedException{ Vector newCenter = new Vector(); List<Vector> vectorList = new LinkedList<Vector>(); int vectorSize = key.getCenter().getVector().length; newCenter.setVector(new double[vectorSize]); for(Vector value :values){ vectorList.add(new Vector(value)); for(int i=0;i<value.getVector().length;i++){ newCenter.getVector()[i]+=value.getVector()[i]; } } for(int i=0;i<newCenter.getVector().length;i++){ newCenter.getVector()[i] = newCenter.getVector()[i]/vectorList.size(); } ClusterCenter center = new ClusterCenter(newCenter); centers.add(center); for(Vector vector:vectorList){ context.write(center, vector); } if(center.converged(key)) context.getCounter(Counter.CONVERGED).increment(1); } protected void cleanup(Context context) throws IOException,InterruptedException{ super.cleanup(context); Configuration conf = context.getConfiguration(); Path outPath = new Path(conf.get("centroid.path")); FileSystem fs = FileSystem.get(conf); fs.delete(outPath,true); final SequenceFile.Writer out = SequenceFile.createWriter(fs, context.getConfiguration(), outPath, ClusterCenter.class, IntWritable.class); final IntWritable value = new IntWritable(0); for(ClusterCenter center:centers){ out.append(center, value); } out.close(); } }
  • 15. 15 3. Main class Main class : KMeansClusteringJob.java package com.clustering.mapreduce; import java.io.IOException; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; import com.clustering.model.ClusterCenter; import com.clustering.model.Vector; public class KMeansClusteringJob { private static final Log LOG = LogFactory.getLog(KMeansClusteringJob.class); public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { int iteration = 1; Configuration conf = new Configuration(); conf.set("num.iteration", iteration + ""); Path in = new Path("files/clustering/import/data"); Path center = new Path("files/clustering/import/center/cen.seq"); conf.set("centroid.path", center.toString()); Path out = new Path("files/clustering/depth_1"); Job job = new Job(conf); job.setJobName("KMeans Clustering"); job.setMapperClass(KMeansMapper.class); job.setReducerClass(KMeansReducer.class); job.setJarByClass(KMeansMapper.class); SequenceFileInputFormat.addInputPath(job, in); FileSystem fs = FileSystem.get(conf); if (fs.exists(out)) fs.delete(out, true); if (fs.exists(center)) fs.delete(out, true); if (fs.exists(in)) fs.delete(out, true); final SequenceFile.Writer centerWriter = SequenceFile.createWriter(fs, conf, center, ClusterCenter.class, IntWritable.class);
  • 16. 16 final IntWritable value = new IntWritable(0); centerWriter.append(new ClusterCenter(new Vector(1, 1)), value); centerWriter.append(new ClusterCenter(new Vector(5, 5)), value); centerWriter.close(); final SequenceFile.Writer dataWriter = SequenceFile.createWriter(fs, conf, in, ClusterCenter.class, Vector.class); dataWriter .append(new ClusterCenter(new Vector(0, 0)), new Vector(1, 2)); dataWriter.append(new ClusterCenter(new Vector(0, 0)), new Vector(16, 3)); dataWriter .append(new ClusterCenter(new Vector(0, 0)), new Vector(3, 3)); dataWriter .append(new ClusterCenter(new Vector(0, 0)), new Vector(2, 2)); dataWriter .append(new ClusterCenter(new Vector(0, 0)), new Vector(2, 3)); dataWriter.append(new ClusterCenter(new Vector(0, 0)), new Vector(25, 1)); dataWriter .append(new ClusterCenter(new Vector(0, 0)), new Vector(7, 6)); dataWriter .append(new ClusterCenter(new Vector(0, 0)), new Vector(6, 5)); dataWriter.append(new ClusterCenter(new Vector(0, 0)), new Vector(-1, -23)); dataWriter.close(); SequenceFileOutputFormat.setOutputPath(job, out); job.setInputFormatClass(SequenceFileInputFormat.class); job.setOutputFormatClass(SequenceFileOutputFormat.class); job.setOutputKeyClass(ClusterCenter.class); job.setOutputValueClass(Vector.class); job.waitForCompletion(true); long counter = job.getCounters() .findCounter(KMeansReducer.Counter.CONVERGED).getValue(); iteration++; while (counter > 0) { conf = new Configuration(); conf.set("centroid.path", center.toString()); conf.set("num.iteration", iteration + ""); job = new Job(conf); job.setJobName("KMeans Clustering " + iteration); job.setMapperClass(KMeansMapper.class); job.setReducerClass(KMeansReducer.class); job.setJarByClass(KMeansMapper.class); Input vector K-Center vector
  • 17. 17 in = new Path("files/clustering/depth_" + (iteration - 1) + "/"); out = new Path("files/clustering/depth_" + iteration); SequenceFileInputFormat.addInputPath(job, in); if (fs.exists(out)) fs.delete(out, true); SequenceFileOutputFormat.setOutputPath(job, out); job.setInputFormatClass(SequenceFileInputFormat.class); job.setOutputFormatClass(SequenceFileOutputFormat.class); job.setOutputKeyClass(ClusterCenter.class); job.setOutputValueClass(Vector.class); job.waitForCompletion(true); iteration++; counter = job.getCounters() .findCounter(KMeansReducer.Counter.CONVERGED).getValue(); } Path result = new Path("files/clustering/depth_" + (iteration - 1) + "/"); FileStatus[] stati = fs.listStatus(result); for (FileStatus status : stati) { if (!status.isDir() && !status.getPath().toString().contains("/_")) { Path path = status.getPath(); LOG.info("FOUND " + path.toString()); SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf); ClusterCenter key = new ClusterCenter(); Vector v = new Vector(); while (reader.next(key, v)) { LOG.info(key + " / " + v); } reader.close(); } } } }
  • 18. 18 Final project listing will look like this Figure 6 File Listing for KMeansMapReduce Program 2.3 Run the program Unlike the wordcount program that we have to prepare the input files, in KMeansClustering program the input is defined inside the KMeansClusteringJob class. For running KMeansClustering job, since we are already configure the eclipse, we can run the program natively inside Eclipse, So by pointing out the Main class (KMeansClusteringJob.java) we can run the project as Hadoop Application Figure 7 Run Project as hadoop application The Input (to be defined in KMeansClusteringJob class) Vector [vector=[16.0, 3.0]] Vector [vector=[7.0, 6.0]] Vector [vector=[6.0, 5.0]] Vector [vector=[25.0, 1.0]] Vector [vector=[1.0, 2.0]] Vector [vector=[3.0, 3.0]] Vector [vector=[2.0, 2.0]] Vector [vector=[2.0, 3.0]] Vector [vector=[-1.0, -23.0]] Output from Thomas’s Blog : ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[16.0, 3.0]] ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[7.0, 6.0]] ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[6.0, 5.0]] ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[25.0, 1.0]] ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[1.0, 2.0]] ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[3.0, 3.0]] ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[2.0, 2.0]] ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[2.0, 3.0]]
  • 19. 19 ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[-1.0, -23.0]] Output of my KMeansClusteringJob : file:/home/tid/eclipse/workspace/MRClustering/files/clustering/depth_3/part-r-00000 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[16.0, 3.0]] 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[7.0, 6.0]] 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[6.0, 5.0]] 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[25.0, 1.0]] 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[1.0, 2.0]] 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[3.0, 3.0]] 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[2.0, 2.0]] 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[2.0, 3.0]] 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[-1.0, -23.0]] Complete Log Result of My KMeansClusteringJob 14/04/08 15:50:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/04/08 15:50:35 INFO compress.CodecPool: Got brand-new compressor 14/04/08 15:50:35 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/04/08 15:50:35 WARN mapred.JobClient: No job jar file set. User classes may not be
  • 20. 20 found. See JobConf(Class) or JobConf#setJar(String). 14/04/08 15:50:35 INFO input.FileInputFormat: Total input paths to process : 1 14/04/08 15:50:35 INFO mapred.JobClient: Running job: job_local1343624176_0001 14/04/08 15:50:35 INFO mapred.LocalJobRunner: Waiting for map tasks 14/04/08 15:50:35 INFO mapred.LocalJobRunner: Starting task: attempt_local1343624176_0001_m_000000_0 14/04/08 15:50:36 INFO util.ProcessTree: setsid exited with exit code 0 14/04/08 15:50:36 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@24bec229 14/04/08 15:50:36 INFO mapred.MapTask: Processing split: file:/home/tid/eclipse/workspace/MRClustering/files/clustering/import/data:0+558 14/04/08 15:50:36 INFO mapred.MapTask: io.sort.mb = 100 14/04/08 15:50:36 INFO mapred.MapTask: data buffer = 79691776/99614720 14/04/08 15:50:36 INFO mapred.MapTask: record buffer = 262144/327680 14/04/08 15:50:36 INFO compress.CodecPool: Got brand-new decompressor 14/04/08 15:50:36 INFO compress.CodecPool: Got brand-new decompressor 14/04/08 15:50:36 INFO mapred.MapTask: Starting flush of map output 14/04/08 15:50:36 INFO mapred.MapTask: Finished spill 0 14/04/08 15:50:36 INFO mapred.Task: Task:attempt_local1343624176_0001_m_000000_0 is done. And is in the process of commiting 14/04/08 15:50:36 INFO mapred.LocalJobRunner: 14/04/08 15:50:36 INFO mapred.Task: Task 'attempt_local1343624176_0001_m_000000_0' done. 14/04/08 15:50:36 INFO mapred.LocalJobRunner: Finishing task: attempt_local1343624176_0001_m_000000_0 14/04/08 15:50:36 INFO mapred.LocalJobRunner: Map task executor complete. 14/04/08 15:50:36 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@7b4b3d0e 14/04/08 15:50:36 INFO mapred.LocalJobRunner: 14/04/08 15:50:36 INFO mapred.Merger: Merging 1 sorted segments 14/04/08 15:50:36 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 380 bytes 14/04/08 15:50:36 INFO mapred.LocalJobRunner: 14/04/08 15:50:36 INFO mapred.Task: Task:attempt_local1343624176_0001_r_000000_0 is done. And is in the process of commiting 14/04/08 15:50:36 INFO mapred.LocalJobRunner: 14/04/08 15:50:36 INFO mapred.Task: Task attempt_local1343624176_0001_r_000000_0 is allowed to commit now 14/04/08 15:50:36 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1343624176_0001_r_000000_0' to files/clustering/depth_1 14/04/08 15:50:36 INFO mapred.LocalJobRunner: reduce > reduce 14/04/08 15:50:36 INFO mapred.Task: Task 'attempt_local1343624176_0001_r_000000_0' done. 14/04/08 15:50:36 INFO mapred.JobClient: map 100% reduce 100% 14/04/08 15:50:36 INFO mapred.JobClient: Job complete: job_local1343624176_0001 14/04/08 15:50:36 INFO mapred.JobClient: Counters: 21 14/04/08 15:50:36 INFO mapred.JobClient: File Output Format Counters 14/04/08 15:50:36 INFO mapred.JobClient: Bytes Written=537 14/04/08 15:50:36 INFO mapred.JobClient: File Input Format Counters 14/04/08 15:50:36 INFO mapred.JobClient: Bytes Read=574 14/04/08 15:50:36 INFO mapred.JobClient: FileSystemCounters 14/04/08 15:50:36 INFO mapred.JobClient: FILE_BYTES_READ=2380 14/04/08 15:50:36 INFO mapred.JobClient: FILE_BYTES_WRITTEN=106876 14/04/08 15:50:36 INFO mapred.JobClient: com.clustering.mapreduce.KMeansReducer$Counter 14/04/08 15:50:36 INFO mapred.JobClient: CONVERGED=2 14/04/08 15:50:36 INFO mapred.JobClient: Map-Reduce Framework 14/04/08 15:50:36 INFO mapred.JobClient: Reduce input groups=2 14/04/08 15:50:36 INFO mapred.JobClient: Map output materialized bytes=384 14/04/08 15:50:36 INFO mapred.JobClient: Combine output records=0 14/04/08 15:50:36 INFO mapred.JobClient: Map input records=9 14/04/08 15:50:36 INFO mapred.JobClient: Reduce shuffle bytes=0 14/04/08 15:50:36 INFO mapred.JobClient: Physical memory (bytes) snapshot=0 14/04/08 15:50:36 INFO mapred.JobClient: Reduce output records=9 14/04/08 15:50:36 INFO mapred.JobClient: Spilled Records=18 14/04/08 15:50:36 INFO mapred.JobClient: Map output bytes=360 14/04/08 15:50:36 INFO mapred.JobClient: Total committed heap usage (bytes)=355991552 14/04/08 15:50:36 INFO mapred.JobClient: CPU time spent (ms)=0 14/04/08 15:50:36 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
  • 21. 21 14/04/08 15:50:36 INFO mapred.JobClient: SPLIT_RAW_BYTES=139 14/04/08 15:50:36 INFO mapred.JobClient: Map output records=9 14/04/08 15:50:36 INFO mapred.JobClient: Combine input records=0 14/04/08 15:50:36 INFO mapred.JobClient: Reduce input records=9 14/04/08 15:50:36 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/04/08 15:50:36 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 14/04/08 15:50:36 INFO input.FileInputFormat: Total input paths to process : 1 14/04/08 15:50:37 INFO mapred.JobClient: Running job: job_local1426850290_0002 14/04/08 15:50:37 INFO mapred.LocalJobRunner: Waiting for map tasks 14/04/08 15:50:37 INFO mapred.LocalJobRunner: Starting task: attempt_local1426850290_0002_m_000000_0 14/04/08 15:50:37 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@26adcd34 14/04/08 15:50:37 INFO mapred.MapTask: Processing split: file:/home/tid/eclipse/workspace/MRClustering/files/clustering/depth_1/part-r-00000:0+521 14/04/08 15:50:37 INFO mapred.MapTask: io.sort.mb = 100 14/04/08 15:50:37 INFO mapred.MapTask: data buffer = 79691776/99614720 14/04/08 15:50:37 INFO mapred.MapTask: record buffer = 262144/327680 14/04/08 15:50:37 INFO mapred.MapTask: Starting flush of map output 14/04/08 15:50:37 INFO mapred.MapTask: Finished spill 0 14/04/08 15:50:37 INFO mapred.Task: Task:attempt_local1426850290_0002_m_000000_0 is done. And is in the process of commiting 14/04/08 15:50:37 INFO mapred.LocalJobRunner: 14/04/08 15:50:37 INFO mapred.Task: Task 'attempt_local1426850290_0002_m_000000_0' done. 14/04/08 15:50:37 INFO mapred.LocalJobRunner: Finishing task: attempt_local1426850290_0002_m_000000_0 14/04/08 15:50:37 INFO mapred.LocalJobRunner: Map task executor complete. 14/04/08 15:50:37 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3740f768 14/04/08 15:50:37 INFO mapred.LocalJobRunner: 14/04/08 15:50:37 INFO mapred.Merger: Merging 1 sorted segments 14/04/08 15:50:37 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 380 bytes 14/04/08 15:50:37 INFO mapred.LocalJobRunner: 14/04/08 15:50:37 INFO mapred.Task: Task:attempt_local1426850290_0002_r_000000_0 is done. And is in the process of commiting 14/04/08 15:50:37 INFO mapred.LocalJobRunner: 14/04/08 15:50:37 INFO mapred.Task: Task attempt_local1426850290_0002_r_000000_0 is allowed to commit now 14/04/08 15:50:37 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1426850290_0002_r_000000_0' to files/clustering/depth_2 14/04/08 15:50:37 INFO mapred.LocalJobRunner: reduce > reduce 14/04/08 15:50:37 INFO mapred.Task: Task 'attempt_local1426850290_0002_r_000000_0' done. 14/04/08 15:50:38 INFO mapred.JobClient: map 100% reduce 100% 14/04/08 15:50:38 INFO mapred.JobClient: Job complete: job_local1426850290_0002 14/04/08 15:50:38 INFO mapred.JobClient: Counters: 21 14/04/08 15:50:38 INFO mapred.JobClient: File Output Format Counters 14/04/08 15:50:38 INFO mapred.JobClient: Bytes Written=537 14/04/08 15:50:38 INFO mapred.JobClient: File Input Format Counters 14/04/08 15:50:38 INFO mapred.JobClient: Bytes Read=537 14/04/08 15:50:38 INFO mapred.JobClient: FileSystemCounters 14/04/08 15:50:38 INFO mapred.JobClient: FILE_BYTES_READ=5088 14/04/08 15:50:38 INFO mapred.JobClient: FILE_BYTES_WRITTEN=212938 14/04/08 15:50:38 INFO mapred.JobClient: com.clustering.mapreduce.KMeansReducer$Counter 14/04/08 15:50:38 INFO mapred.JobClient: CONVERGED=2 14/04/08 15:50:38 INFO mapred.JobClient: Map-Reduce Framework 14/04/08 15:50:38 INFO mapred.JobClient: Reduce input groups=2 14/04/08 15:50:38 INFO mapred.JobClient: Map output materialized bytes=384 14/04/08 15:50:38 INFO mapred.JobClient: Combine output records=0 14/04/08 15:50:38 INFO mapred.JobClient: Map input records=9 14/04/08 15:50:38 INFO mapred.JobClient: Reduce shuffle bytes=0 14/04/08 15:50:38 INFO mapred.JobClient: Physical memory (bytes) snapshot=0 14/04/08 15:50:38 INFO mapred.JobClient: Reduce output records=9 14/04/08 15:50:38 INFO mapred.JobClient: Spilled Records=18
  • 22. 22 14/04/08 15:50:38 INFO mapred.JobClient: Map output bytes=360 14/04/08 15:50:38 INFO mapred.JobClient: Total committed heap usage (bytes)=555352064 14/04/08 15:50:38 INFO mapred.JobClient: CPU time spent (ms)=0 14/04/08 15:50:38 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0 14/04/08 15:50:38 INFO mapred.JobClient: SPLIT_RAW_BYTES=148 14/04/08 15:50:38 INFO mapred.JobClient: Map output records=9 14/04/08 15:50:38 INFO mapred.JobClient: Combine input records=0 14/04/08 15:50:38 INFO mapred.JobClient: Reduce input records=9 14/04/08 15:50:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/04/08 15:50:38 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 14/04/08 15:50:38 INFO input.FileInputFormat: Total input paths to process : 1 14/04/08 15:50:38 INFO mapred.JobClient: Running job: job_local466621791_0003 14/04/08 15:50:38 INFO mapred.LocalJobRunner: Waiting for map tasks 14/04/08 15:50:38 INFO mapred.LocalJobRunner: Starting task: attempt_local466621791_0003_m_000000_0 14/04/08 15:50:38 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@373fdd1a 14/04/08 15:50:38 INFO mapred.MapTask: Processing split: file:/home/tid/eclipse/workspace/MRClustering/files/clustering/depth_2/part-r-00000:0+521 14/04/08 15:50:38 INFO mapred.MapTask: io.sort.mb = 100 14/04/08 15:50:38 INFO mapred.MapTask: data buffer = 79691776/99614720 14/04/08 15:50:38 INFO mapred.MapTask: record buffer = 262144/327680 14/04/08 15:50:38 INFO mapred.MapTask: Starting flush of map output 14/04/08 15:50:38 INFO mapred.MapTask: Finished spill 0 14/04/08 15:50:38 INFO mapred.Task: Task:attempt_local466621791_0003_m_000000_0 is done. And is in the process of commiting 14/04/08 15:50:38 INFO mapred.LocalJobRunner: 14/04/08 15:50:38 INFO mapred.Task: Task 'attempt_local466621791_0003_m_000000_0' done. 14/04/08 15:50:38 INFO mapred.LocalJobRunner: Finishing task: attempt_local466621791_0003_m_000000_0 14/04/08 15:50:38 INFO mapred.LocalJobRunner: Map task executor complete. 14/04/08 15:50:38 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@78d3cdb9 14/04/08 15:50:38 INFO mapred.LocalJobRunner: 14/04/08 15:50:38 INFO mapred.Merger: Merging 1 sorted segments 14/04/08 15:50:38 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 380 bytes 14/04/08 15:50:38 INFO mapred.LocalJobRunner: 14/04/08 15:50:38 INFO mapred.Task: Task:attempt_local466621791_0003_r_000000_0 is done. And is in the process of commiting 14/04/08 15:50:38 INFO mapred.LocalJobRunner: 14/04/08 15:50:38 INFO mapred.Task: Task attempt_local466621791_0003_r_000000_0 is allowed to commit now 14/04/08 15:50:38 INFO output.FileOutputCommitter: Saved output of task 'attempt_local466621791_0003_r_000000_0' to files/clustering/depth_3 14/04/08 15:50:38 INFO mapred.LocalJobRunner: reduce > reduce 14/04/08 15:50:38 INFO mapred.Task: Task 'attempt_local466621791_0003_r_000000_0' done. 14/04/08 15:50:39 INFO mapred.JobClient: map 100% reduce 100% 14/04/08 15:50:39 INFO mapred.JobClient: Job complete: job_local466621791_0003 14/04/08 15:50:39 INFO mapred.JobClient: Counters: 20 14/04/08 15:50:39 INFO mapred.JobClient: File Output Format Counters 14/04/08 15:50:39 INFO mapred.JobClient: Bytes Written=537 14/04/08 15:50:39 INFO mapred.JobClient: File Input Format Counters 14/04/08 15:50:39 INFO mapred.JobClient: Bytes Read=537 14/04/08 15:50:39 INFO mapred.JobClient: FileSystemCounters 14/04/08 15:50:39 INFO mapred.JobClient: FILE_BYTES_READ=7796 14/04/08 15:50:39 INFO mapred.JobClient: FILE_BYTES_WRITTEN=318992 14/04/08 15:50:39 INFO mapred.JobClient: Map-Reduce Framework 14/04/08 15:50:39 INFO mapred.JobClient: Reduce input groups=2 14/04/08 15:50:39 INFO mapred.JobClient: Map output materialized bytes=384 14/04/08 15:50:39 INFO mapred.JobClient: Combine output records=0 14/04/08 15:50:39 INFO mapred.JobClient: Map input records=9 14/04/08 15:50:39 INFO mapred.JobClient: Reduce shuffle bytes=0 14/04/08 15:50:39 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
  • 23. 23 14/04/08 15:50:39 INFO mapred.JobClient: Reduce output records=9 14/04/08 15:50:39 INFO mapred.JobClient: Spilled Records=18 14/04/08 15:50:39 INFO mapred.JobClient: Map output bytes=360 14/04/08 15:50:39 INFO mapred.JobClient: Total committed heap usage (bytes)=754712576 14/04/08 15:50:39 INFO mapred.JobClient: CPU time spent (ms)=0 14/04/08 15:50:39 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0 14/04/08 15:50:39 INFO mapred.JobClient: SPLIT_RAW_BYTES=148 14/04/08 15:50:39 INFO mapred.JobClient: Map output records=9 14/04/08 15:50:39 INFO mapred.JobClient: Combine input records=0 14/04/08 15:50:39 INFO mapred.JobClient: Reduce input records=9 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: FOUND file:/home/tid/eclipse/workspace/MRClustering/files/clustering/depth_3/part-r-00000 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[16.0, 3.0]] 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[7.0, 6.0]] 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[6.0, 5.0]] 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[25.0, 1.0]] 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[1.0, 2.0]] 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[3.0, 3.0]] 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[2.0, 2.0]] 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[2.0, 3.0]] 14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[-1.0, -23.0]] References  Hadoop Installation on single node cluster - http://www.michael-noll.com/tutorials/running- hadoop-on-ubuntu-linux-single-node-cluster/  KMeansClustering with MapReduce - http://codingwiththomas.blogspot.kr/2011/05/k- means-clustering-with-mapreduce.html