1. The document discusses installing Hadoop in single node cluster mode on Ubuntu, including installing Java, configuring SSH, extracting and configuring Hadoop files. Key configuration files like core-site.xml and hdfs-site.xml are edited.
2. Formatting the HDFS namenode clears all data. Hadoop is started using start-all.sh and the jps command checks if daemons are running.
3. The document then moves to discussing running a KMeans clustering MapReduce program on the installed Hadoop framework.
Avaliable at: https://github.com/dbsmasters/bdsmasters
The current project is implemented in the context of the course "Big Data Management Systems" taught by Prof. Chatziantoniou in the Department of Management Science and Technology (AUEB). The aim of the project is to familiarize the students with big data management systems such as Hadoop, Redis, MongoDB and Azure Stream Analytics.
Apache Sqoop: A Data Transfer Tool for HadoopCloudera, Inc.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases. This slide deck aims at familiarizing the user with Sqoop and how to effectively use it in real deployments.
As MapReduce clusters have become popular these days, their scheduling is one of the important factor which is to be considered. In order to achieve good performance a MapReduce scheduler must avoid unnecessary data transmission. Hence different scheduling algorithms for MapReduce are necessary to provide good performance. This
slide provides an overview of many different scheduling algorithms for MapReduce.
Avaliable at: https://github.com/dbsmasters/bdsmasters
The current project is implemented in the context of the course "Big Data Management Systems" taught by Prof. Chatziantoniou in the Department of Management Science and Technology (AUEB). The aim of the project is to familiarize the students with big data management systems such as Hadoop, Redis, MongoDB and Azure Stream Analytics.
Apache Sqoop: A Data Transfer Tool for HadoopCloudera, Inc.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases. This slide deck aims at familiarizing the user with Sqoop and how to effectively use it in real deployments.
As MapReduce clusters have become popular these days, their scheduling is one of the important factor which is to be considered. In order to achieve good performance a MapReduce scheduler must avoid unnecessary data transmission. Hence different scheduling algorithms for MapReduce are necessary to provide good performance. This
slide provides an overview of many different scheduling algorithms for MapReduce.
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platformrhatr
A long time ago in a galaxy far, far away only the chosen few could deploy and operate a fully functional Hadoop cluster. Vendors were taking pride in rationalizing this experience to their customers by creating various distributions including Apache Hadoop. It all changed when Cloudera decided to support Apache Bigtop as the first 100% community driven bigdata management distribution of Apache Hadoop. Today, most major commercial distribution of Apache Hadoop are based on Bigtop. Bigtop has won the Hadoop distributions war and is offering a superset of packaged components. In this talk we will focus on practical advice of how to deploy and start operating a Hadoop cluster using Bigtop’s packages and deployment code. We will dive into the details of using packages of Hadoop ecosystem provided by Bigtop and how to build data management pipelines in support your enterprise applications.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
Spark, ou comment traiter des données à la vitesse de l'éclairAlexis Seigneurin
Spark fait partie de la nouvelle génération de frameworks de manipulation de données basés sur Hadoop. L’outil utilise agressivement la mémoire pour offrir des temps de traitement jusqu’à 100 fois plus rapides qu'Hadoop. Dans cette session, nous découvrirons les principes de traitement de données (notamment MapReduce) et les options mises à disposition pour monter un cluster (Zookeper, Mesos…). Nous ferons un point sur les différents modules proposés par le framework, et notamment sur Spark Streaming pour le traitement de données en flux continu.
Présentation jouée chez Ippon le 11 décembre 2014.
In a real life almost any project deals with the
tree structures. Different kinds of taxonomies,
site structures etc require modeling of
hierarchy relations.
Typical approaches used
● Model Tree Structures with Child References
● Model Tree Structures with Parent References
● Model Tree Structures with an Array of Ancestors
● Model Tree Structures with Materialized Paths
● Model Tree Structures with Nested Sets
In this Introduction to Apache Sqoop the following topics are covered:
1. Why Sqoop
2. What is Sqoop
3. How Sqoop Works
4. Importing and Exporting Data using Sqoop
5. Data Import in Hive and HBase with Sqoop
6. Sqoop and NoSql data store i.e. MongoDB
7. Resources
PostgreSQL comes built-in with a variety of indexes, some of which are further extensible to build powerful new indexing schemes. But what are all these index types? What are some of the special features of these indexes? What are the size & performance tradeoffs? How do I know which ones are appropriate for my application?
Fortunately, this talk aims to answer all of these questions as we explore the whole family of PostgreSQL indexes: B-tree, expression, GiST (of all flavors), GIN and how they are used in theory and practice.
This article got published in the Software Developer's Journal's February Edition.
It describes the use of MapReduce paradigm to design Clustering algorithms and explain three algorithms using MapReduce.
- K-Means Clustering
- Canopy Clustering
- MinHash Clustering
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platformrhatr
A long time ago in a galaxy far, far away only the chosen few could deploy and operate a fully functional Hadoop cluster. Vendors were taking pride in rationalizing this experience to their customers by creating various distributions including Apache Hadoop. It all changed when Cloudera decided to support Apache Bigtop as the first 100% community driven bigdata management distribution of Apache Hadoop. Today, most major commercial distribution of Apache Hadoop are based on Bigtop. Bigtop has won the Hadoop distributions war and is offering a superset of packaged components. In this talk we will focus on practical advice of how to deploy and start operating a Hadoop cluster using Bigtop’s packages and deployment code. We will dive into the details of using packages of Hadoop ecosystem provided by Bigtop and how to build data management pipelines in support your enterprise applications.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
Spark, ou comment traiter des données à la vitesse de l'éclairAlexis Seigneurin
Spark fait partie de la nouvelle génération de frameworks de manipulation de données basés sur Hadoop. L’outil utilise agressivement la mémoire pour offrir des temps de traitement jusqu’à 100 fois plus rapides qu'Hadoop. Dans cette session, nous découvrirons les principes de traitement de données (notamment MapReduce) et les options mises à disposition pour monter un cluster (Zookeper, Mesos…). Nous ferons un point sur les différents modules proposés par le framework, et notamment sur Spark Streaming pour le traitement de données en flux continu.
Présentation jouée chez Ippon le 11 décembre 2014.
In a real life almost any project deals with the
tree structures. Different kinds of taxonomies,
site structures etc require modeling of
hierarchy relations.
Typical approaches used
● Model Tree Structures with Child References
● Model Tree Structures with Parent References
● Model Tree Structures with an Array of Ancestors
● Model Tree Structures with Materialized Paths
● Model Tree Structures with Nested Sets
In this Introduction to Apache Sqoop the following topics are covered:
1. Why Sqoop
2. What is Sqoop
3. How Sqoop Works
4. Importing and Exporting Data using Sqoop
5. Data Import in Hive and HBase with Sqoop
6. Sqoop and NoSql data store i.e. MongoDB
7. Resources
PostgreSQL comes built-in with a variety of indexes, some of which are further extensible to build powerful new indexing schemes. But what are all these index types? What are some of the special features of these indexes? What are the size & performance tradeoffs? How do I know which ones are appropriate for my application?
Fortunately, this talk aims to answer all of these questions as we explore the whole family of PostgreSQL indexes: B-tree, expression, GiST (of all flavors), GIN and how they are used in theory and practice.
This article got published in the Software Developer's Journal's February Edition.
It describes the use of MapReduce paradigm to design Clustering algorithms and explain three algorithms using MapReduce.
- K-Means Clustering
- Canopy Clustering
- MinHash Clustering
While working as interns, we were tasked with the project of collecting different leadership games to be used for a leadership camp of high school students.
Hadoop installation on windows using virtual box and also hadoop installation on ubuntu
http://logicallearn2.blogspot.in/2018/01/hadoop-installation-on-ubuntu.html
Get to know the configuration with Hadoop installation types and also handling of the HDFS files.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Talk soon!
Setting up a HADOOP 2.2 cluster on CentOS 6Manish Chopra
Create your own Hadoop distributed cluster using 3 virtual machines. Linux (CentOS 6 or RHEL 6) can be used, along with Java and Hadoop binary distributions.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Communications Mining Series - Zero to Hero - Session 1
Hadoop installation and Running KMeans Clustering with MapReduce Program on Hadoop
1. 1
Hadoop Installation and Running KMeans Clustering
with MapReduce Program on Hadoop
Introduction
General issue that I will cover in this document are Hadoop installation (in section 1) and running
KMeans Clustering MapReduce program on Hadoop (section 2).
1 Hadoop Installation
I will install Hadoop Single Node Cluster mode in my personal computer using this following
environment .
1. Ubuntu 12.04
2. Java JDK 1.7.0 update 21
3. Hadoop 1.2.1 (stable)
1.1 Prerequisites
Before installing Hadoop, the following point must be done first before installing Hadoop in our
system
1. Sun JDK /Open JDK
I use Sun JDK from oracle instead of Open JDK, the resource package can be downloaded from here :
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html
2. Hadoop installer packages
In this report, I will use Hadoop version 1.2.1 (stable). The resource package can be downloaded
from her : http://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/
1.1.1 Configuring Java
In my computer, I have several installed java version, they are Java 7 and java 6. For running Hadoop
program, I need to configure which version I will use. I decide to use the newer version (java version
1.7 update 21) so the following are step by step for configuring latest java version.
1. Configure java
$ sudo update-alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java).
Selection Path Priority Status
------------------------------------------------------------
0 /usr/lib/jvm/jdk1.6.0_45/bin/java 1 auto mode
1 /usr/lib/jvm/jdk1.6.0_45/bin/java 1 manual mode
* 2 /usr/lib/jvm/jdk1.7.0_21/bin/java 1 manual mode
Press enter to keep the current choice[*], or type selection number: 2
2. 2
2. configure javac
$ sudo update-alternatives --config javac
There are 2 choices for the alternative javac (providing /usr/bin/javac).
Selection Path Priority Status
------------------------------------------------------------
0 /usr/lib/jvm/jdk1.6.0_45/bin/javac 1 auto mode
1 /usr/lib/jvm/jdk1.6.0_45/bin/javac 1 manual mode
* 2 /usr/lib/jvm/jdk1.7.0_21/bin/javac 1 manual mode
Press enter to keep the current choice[*], or type selection number: 2
3. configure javaws
$ sudo update-alternatives --config javaws
There are 2 choices for the alternative javaws (providing /usr/bin/javaws).
Selection Path Priority Status
------------------------------------------------------------
0 /usr/lib/jvm/jdk1.6.0_45/bin/javaws 1 auto mode
1 /usr/lib/jvm/jdk1.6.0_45/bin/javaws 1 manual mode
* 2 /usr/lib/jvm/jdk1.7.0_21/bin/javaws 1 manual mode
Press enter to keep the current choice[*], or type selection number: 2
4. check the configuration
To make sure the latest java, javac, and javaws successfully configure, I use this following command
tid@dbubuntu:~$ java -version
java version "1.7.0_21"
Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
1.1.2 Hadoop installer
After downloading Hadoop installer package, then we need to extract the installer package into the
desire directory. I downloaded Hadoop installer and locate in the ~/Download directory.
For extracting Hadoop installer package in the local directory, I use the following command
$ tar -xzfv hadoop-1.2.1.tar.gz
1.2 System configuration
In this section I will explain step-by step how to setup and preparing the system for Hadoop single
node cluster in my local compute. The system configuration consists of network configuration for
setup the hosts name, move Hadoop extracted package into desire folder, enabling ssh, and adding
and changing folder permission.
1.2.1 Network configuration
In the hadoop network configuration, all of the machines should have alias instead of IP address. To
configure network aliases, we can edit /etc/hosts on machine that will we use for Hadoop
master and slave. In my case, since I am using single node, the configuration can be done by these
step
1. Open file /etc/hosts as sudoers,
3. 3
$ sudo nano /etc/hosts
2. I add the following lines, inside the curly branch is my IP Address,
127.0.0.1 localhost
164.125.50.127 localhost
Note :
For configuring the local computer into pseudo distributed mode, the following configuration for
/etc/hosts is used
# /etc/hosts (for Hadoop Master and Slave)
192.168.0.1 master
192.168.0.2 slave
1.2.2 User configuration
For security issue, we better to create special user for Hadoop in each machine, however, since I am
working in local computer, I use existing user. The following command is for adding new user and
group
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
1.2.3 Configuring SSH
Hadoop requires SSH access to manage its nodes. In this configuration, I also configure SSH access to
localhost for hduser that I already made in the previous section and local existing user.
1. For generating ssh key for hduser, we can use the following command
$ su - hduser
Password:
hduser@dbubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
/home/hduser/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
b8:68:c1:4b:d1:fe:4b:40:2c:1c:b4:37:db:5f:76:ee hduser@dbubuntu
The key's randomart image is:
+--[ RSA 2048]----+
| .o |
| . = |
| = * |
| . * = |
| + = S o . |
| . + + . o o |
| + . o . . |
| . . . . |
| . E |
+-----------------+
hduser@dbubuntu:~$
4. 4
2. Enable ssh access for local machine with newly created key
cat /home/hduser/.ssh/id_rsa.pub >>
/home/hduser/.ssh/authorized_keys
Note :
Since I use local existing user, I do the above step 2 times, also for user tid
1.2.4 Extracting Hadoop installer
I copy the Hadoop Installer package from ~/Download directory into desire folder. In this case I use
/usr/local to locate hadoop installer package.
1. Command for moving extracted Hadoop into
$ cp ~/Downloads/Hadoop-1.2.1 /usr/local/
$ cd /usr/local
2. in order to make hadoop easy to access and handling some update version of Hadoop, I create
symbolic link of hadoop-1.2.1 into hadoop directory
$ ln -s hadoop-1.2.1 hadoop
3. change folder permission of hadoop, so it will accessible by user tid
$ sudo chown -R tid:hadoop hadoop
1.3 Hadoop Configuration
After sucessfully configuring the network, folder, and other configuration, in this part, I will explain
step by step hadoop configuration for each machine. Since I locate my hadoop inside
/usr/local/hadoop so it will be the active directory, and all of the hadoop configurations are
located in conf directory.
1.3.1 conf/masters
Since in this configuration for single node, the content of masters file by default should be like this
Localhost
Note : For multi node cluster we can add the master alias’s name regarding the network setup and
configuration in section 1.2.1
1.3.2 conf/slaves
Same as masters file, the default setup for slaves should be like this
Localhost
Note : for multi node cluster, we can add the master and slave alias(considering network
configuration in section 1.2.1), for example the slaves file might look like this
slave1
slave2
slave3
5. 5
1.3.3 conf/core-site.xml
For core site configuration I specify the location in directory /tmp/Hadoop/app. In this file, I give
the configuration of core site of cluster. Firstly, the file look like this following
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
</configuration>
then I change the configuration to look like this
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop/app</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
1.3.4 Conf/mapred-site.xml
I change the mapred-site.xml , that it look like this following
<configuration>
<!-- In: conf/mapred-site.xml -->
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>
The host and port that
the MapReduce job tracker runs
at. If "local", then jobs are run
in-process as a single map
and reduce task.
</description>
</property>
</configuration>
6. 6
1.3.5 hdfs-site.xml
For hdfs-site, for single cluster, I specify the number of replication only 1, if we have several
machine we can add the number of replication.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can
be specified when the file is created.
The default is used if replication
is not specified in create time.
</description>
</property>
</configuration>
1.3.6 Formatting HDFS via Namenode
Before starting the cluster, we should format the Hadoop File System. It can be formatted once or
more, however formatting the namenode means clear all of the data, so we need to be careful
otherwise we can lose our data
tid@dbubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format
1.4 Running Hadoop
After setting all needed configuration, finally we can start our Hadoop. For running Hadoop daemon,
there are several alternatives,
1.4.1 Starting all daemon at once
For starting all of the service in Hadoop, I use this following command
tid@dbubuntu:/usr/local/hadoop$ bin/start-all.sh
Then, checking whether all of daemons already running using the following command
tid@dbubuntu:/usr/local/hadoop$ jps
1.4.2 Stopping all daemon
For stopping all daemon, we can use the following command
tid@dbubuntu:/usr/local/hadoop$ bin/stop-all.sh
7. 7
2 Hadoop MapReduce Program (KMeans Clustering in Map
Reduce)
In this part I will explain how I can run map reduce program regarding reference from Thomas’s blog
for KMeans Clustering in MapReduce (http://codingwiththomas.blogspot.kr/2011/05/k-means-
clustering-with-mapreduce.html ) with some modification
2.1 Eclipse IDE Setup
There are several ways for setting up the IDE environment so we can easily create MapReduce
Program in Eclipse. For ease development and setup, I am using Eclipse plugin from self build plugin
by creating jar from Hadoop library. The following are step-by-step setting up eclipse IDE
Step 1: copy the pre-build eclipse plugin for Hadoop in directory plugins of Eclipse
Figure 1 Eclipse hadoop plugin inside plugins directory of eclipse
Step 2 : restart Eclipse, then in the perspective part, we will see other perspective in the right
corner, and choose MapReduce Perspective
Figure 2 MapReduce perspective in eclipse IDE
Step 3: Add the server in the map reduce panel.
In my case, because my server is located in the local machine, named as localhost, the detail will
looks like the following
8. 8
Figure 3 Configuration for adding Hadoop Location
After adding the server, in the right side, along with project explorer, we will see the HDFS file
explorer, and right now Eclipse is ready to use for developing MapReduce Application
Figure 4 HDFS Directory explorer
2.2 Source Code Preparation
In this part, I will describe how to prepare the project, package, and class for KMeans Clustering
MapReduce program.
Create new project
1. First, we need to create new MapReduce project, by clicking new project in the upper left
corner of eclipse, and after the following
window pup up, choose MapReduce
project
Figure 5 Choosing MapReduce project
9. 9
2. Fill the project name, and check the reference location of Hadoop. By default, if we are using
eclipse plugin for Hadoop, the folder will be directed to our Hadoop installation folder , then
click “Finish”
Create Package and Class
For KMeans Clustering MapReduce program, based on Thomas’s references, we need to create two
package, one package for clustering model, consists of Model class for Vector, Distance Measure,
and define the ClusterCenter (Vector.java, DistanceMeasurer.java, and ClusterCenter.java) and the
other is package for Main, Mapper, and Reducer Class (KMeans
1. com.clustering.model package Model Class (Vector.java)
Model Class : Vector.java
package com.clustering.model;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.Arrays;
import org.apache.hadoop.io.WritableComparable;
public class Vector implements WritableComparable<Vector>{
private double[] vector;
public Vector(){
super();
}
public Vector(Vector v){
super();
int l= v.vector.length;
this.vector= new double[l];
System.arraycopy(v.vector, 0,this.vector, 0, l);
10. 10
}
public Vector(double x, double y){
super();
this.vector = new double []{x,y};
}
@Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
int size = in.readInt();
vector = new double[size];
for(int i=0;i<size;i++)
vector[i]=in.readDouble();
}
@Override
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
out.writeInt(vector.length);
for(int i=0;i<vector.length;i++)
out.writeDouble(vector[i]);
}
@Override
public int compareTo(Vector o) {
// TODO Auto-generated method stub
boolean equals = true;
for (int i=0;i<vector.length;i++){
if (vector[i] != o.vector[i]) {
equals = false;
break;
}
}
if(equals)
return 0;
else
return 1;
}
public double[] getVector(){
return vector;
}
public void setVector(double[]vector){
this.vector=vector;
}
public String toString(){
return "Vector [vector=" + Arrays.toString(vector) + "]";
}
}
2. Distance Measurement Class
Distance Measurement class : DistanceMeasurer.java
package com.clustering.model;
11. 11
public class DistanceMeasurer {
public static final double measureDistance(ClusterCenter center,
Vector v) {
double sum = 0;
int length = v.getVector().length;
for (int i = 0; i < length; i++) {
sum += Math.abs(center.getCenter().getVector()[i]
- v.getVector()[i]);
}
return sum;
}
}
3. ClusterCenter
Cluster Center definition : ClusterCenter.java
package com.clustering.model;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;
public class ClusterCenter implements WritableComparable<ClusterCenter> {
private Vector center;
public ClusterCenter() {
super();
this.center = null;
}
public ClusterCenter(ClusterCenter center) {
super();
this.center = new Vector(center.center);
}
public ClusterCenter(Vector center) {
super();
this.center = center;
}
public boolean converged(ClusterCenter c) {
return compareTo(c) == 0 ? false : true;
}
@Override
public void write(DataOutput out) throws IOException {
center.write(out);
}
@Override
public void readFields(DataInput in) throws IOException {
this.center = new Vector();
center.readFields(in);
}
12. 12
@Override
public int compareTo(ClusterCenter o) {
return center.compareTo(o.getCenter());
}
/**
* @return the center
*/
public Vector getCenter() {
return center;
}
@Override
public String toString() {
return "ClusterCenter [center=" + center + "]";
}
}
After configuring the class model, the next one is MapReduce Classes, which consist of Mapper,
Reducer, and finally the Main class
1. Mapper class
Mapper class : KMeansMapper.java
package com.clustering.mapreduce;
import java.io.IOException;
import java.util.LinkedList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.mapreduce.Mapper;
import com.clustering.model.ClusterCenter;
import com.clustering.model.DistanceMeasurer;
import com.clustering.model.Vector;
public class KMeansMapper extends
Mapper<ClusterCenter, Vector, ClusterCenter, Vector>{
List<ClusterCenter> centers = new LinkedList<ClusterCenter>();
@Override
protected void setup(Context context) throws IOException,
InterruptedException {
super.setup(context);
Configuration conf = context.getConfiguration();
Path centroids = new Path(conf.get("centroid.path"));
FileSystem fs = FileSystem.get(conf);
SequenceFile.Reader reader = new SequenceFile.Reader(fs,
centroids,
conf);
13. 13
ClusterCenter key = new ClusterCenter();
IntWritable value = new IntWritable();
while (reader.next(key, value)) {
centers.add(new ClusterCenter(key));
}
reader.close();
}
@Override
protected void map(ClusterCenter key, Vector value, Context context)
throws IOException, InterruptedException {
ClusterCenter nearest = null;
double nearestDistance = Double.MAX_VALUE;
for (ClusterCenter c : centers) {
double dist = DistanceMeasurer.measureDistance(c,
value);
if (nearest == null) {
nearest = c;
nearestDistance = dist;
} else {
if (nearestDistance > dist) {
nearest = c;
nearestDistance = dist;
}
}
}
context.write(nearest, value);
}
}
2. Reducer class
Reducer class : KMeansReducer.java
package com.clustering.mapreduce;
import java.io.IOException;
import java.util.LinkedList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.mapreduce.Reducer;
import com.clustering.model.ClusterCenter;
import com.clustering.model.Vector;
public class KMeansReducer extends
Reducer<ClusterCenter, Vector, ClusterCenter, Vector>{
public static enum Counter{
CONVERGED
}
List<ClusterCenter> centers = new LinkedList<ClusterCenter>();
14. 14
protected void reduce(ClusterCenter key, Iterable<Vector> values,
Context context) throws IOException,
InterruptedException{
Vector newCenter = new Vector();
List<Vector> vectorList = new LinkedList<Vector>();
int vectorSize = key.getCenter().getVector().length;
newCenter.setVector(new double[vectorSize]);
for(Vector value :values){
vectorList.add(new Vector(value));
for(int i=0;i<value.getVector().length;i++){
newCenter.getVector()[i]+=value.getVector()[i];
}
}
for(int i=0;i<newCenter.getVector().length;i++){
newCenter.getVector()[i] =
newCenter.getVector()[i]/vectorList.size();
}
ClusterCenter center = new ClusterCenter(newCenter);
centers.add(center);
for(Vector vector:vectorList){
context.write(center, vector);
}
if(center.converged(key))
context.getCounter(Counter.CONVERGED).increment(1);
}
protected void cleanup(Context context) throws
IOException,InterruptedException{
super.cleanup(context);
Configuration conf = context.getConfiguration();
Path outPath = new Path(conf.get("centroid.path"));
FileSystem fs = FileSystem.get(conf);
fs.delete(outPath,true);
final SequenceFile.Writer out = SequenceFile.createWriter(fs,
context.getConfiguration(),
outPath, ClusterCenter.class, IntWritable.class);
final IntWritable value = new IntWritable(0);
for(ClusterCenter center:centers){
out.append(center, value);
}
out.close();
}
}
15. 15
3. Main class
Main class : KMeansClusteringJob.java
package com.clustering.mapreduce;
import java.io.IOException;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import com.clustering.model.ClusterCenter;
import com.clustering.model.Vector;
public class KMeansClusteringJob {
private static final Log LOG =
LogFactory.getLog(KMeansClusteringJob.class);
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
int iteration = 1;
Configuration conf = new Configuration();
conf.set("num.iteration", iteration + "");
Path in = new Path("files/clustering/import/data");
Path center = new
Path("files/clustering/import/center/cen.seq");
conf.set("centroid.path", center.toString());
Path out = new Path("files/clustering/depth_1");
Job job = new Job(conf);
job.setJobName("KMeans Clustering");
job.setMapperClass(KMeansMapper.class);
job.setReducerClass(KMeansReducer.class);
job.setJarByClass(KMeansMapper.class);
SequenceFileInputFormat.addInputPath(job, in);
FileSystem fs = FileSystem.get(conf);
if (fs.exists(out))
fs.delete(out, true);
if (fs.exists(center))
fs.delete(out, true);
if (fs.exists(in))
fs.delete(out, true);
final SequenceFile.Writer centerWriter =
SequenceFile.createWriter(fs,
conf, center, ClusterCenter.class,
IntWritable.class);
16. 16
final IntWritable value = new IntWritable(0);
centerWriter.append(new ClusterCenter(new Vector(1, 1)),
value);
centerWriter.append(new ClusterCenter(new Vector(5, 5)),
value);
centerWriter.close();
final SequenceFile.Writer dataWriter =
SequenceFile.createWriter(fs,
conf, in, ClusterCenter.class,
Vector.class);
dataWriter
.append(new ClusterCenter(new Vector(0,
0)), new Vector(1, 2));
dataWriter.append(new ClusterCenter(new Vector(0, 0)),
new Vector(16, 3));
dataWriter
.append(new ClusterCenter(new Vector(0,
0)), new Vector(3, 3));
dataWriter
.append(new ClusterCenter(new Vector(0,
0)), new Vector(2, 2));
dataWriter
.append(new ClusterCenter(new Vector(0,
0)), new Vector(2, 3));
dataWriter.append(new ClusterCenter(new Vector(0, 0)),
new Vector(25, 1));
dataWriter
.append(new ClusterCenter(new Vector(0,
0)), new Vector(7, 6));
dataWriter
.append(new ClusterCenter(new Vector(0,
0)), new Vector(6, 5));
dataWriter.append(new ClusterCenter(new Vector(0, 0)), new
Vector(-1,
-23));
dataWriter.close();
SequenceFileOutputFormat.setOutputPath(job, out);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setOutputKeyClass(ClusterCenter.class);
job.setOutputValueClass(Vector.class);
job.waitForCompletion(true);
long counter = job.getCounters()
.findCounter(KMeansReducer.Counter.CONVERGED).getValue();
iteration++;
while (counter > 0) {
conf = new Configuration();
conf.set("centroid.path", center.toString());
conf.set("num.iteration", iteration + "");
job = new Job(conf);
job.setJobName("KMeans Clustering " + iteration);
job.setMapperClass(KMeansMapper.class);
job.setReducerClass(KMeansReducer.class);
job.setJarByClass(KMeansMapper.class);
Input
vector
K-Center
vector
17. 17
in = new Path("files/clustering/depth_" +
(iteration - 1) + "/");
out = new Path("files/clustering/depth_" +
iteration);
SequenceFileInputFormat.addInputPath(job, in);
if (fs.exists(out))
fs.delete(out, true);
SequenceFileOutputFormat.setOutputPath(job, out);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setOutputKeyClass(ClusterCenter.class);
job.setOutputValueClass(Vector.class);
job.waitForCompletion(true);
iteration++;
counter = job.getCounters()
.findCounter(KMeansReducer.Counter.CONVERGED).getValue();
}
Path result = new Path("files/clustering/depth_" +
(iteration - 1) + "/");
FileStatus[] stati = fs.listStatus(result);
for (FileStatus status : stati) {
if (!status.isDir() &&
!status.getPath().toString().contains("/_")) {
Path path = status.getPath();
LOG.info("FOUND " + path.toString());
SequenceFile.Reader reader = new
SequenceFile.Reader(fs, path,
conf);
ClusterCenter key = new ClusterCenter();
Vector v = new Vector();
while (reader.next(key, v)) {
LOG.info(key + " / " + v);
}
reader.close();
}
}
}
}
18. 18
Final project listing will look like this
Figure 6 File Listing for KMeansMapReduce Program
2.3 Run the program
Unlike the wordcount program that we have to prepare the input files, in KMeansClustering
program the input is defined inside the KMeansClusteringJob class.
For running KMeansClustering job, since we are already configure the eclipse, we can run the
program natively inside Eclipse, So by pointing out the Main class (KMeansClusteringJob.java) we can
run the project as Hadoop Application
Figure 7 Run Project as hadoop application
The Input (to be defined in KMeansClusteringJob class)
Vector [vector=[16.0, 3.0]]
Vector [vector=[7.0, 6.0]]
Vector [vector=[6.0, 5.0]]
Vector [vector=[25.0, 1.0]]
Vector [vector=[1.0, 2.0]]
Vector [vector=[3.0, 3.0]]
Vector [vector=[2.0, 2.0]]
Vector [vector=[2.0, 3.0]]
Vector [vector=[-1.0, -23.0]]
Output from Thomas’s Blog :
ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[16.0, 3.0]]
ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[7.0, 6.0]]
ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[6.0, 5.0]]
ClusterCenter [center=Vector [vector=[13.5, 3.75]]] / Vector [vector=[25.0, 1.0]]
ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[1.0, 2.0]]
ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[3.0, 3.0]]
ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[2.0, 2.0]]
ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[2.0, 3.0]]
19. 19
ClusterCenter [center=Vector [vector=[1.4, -2.6]]] / Vector [vector=[-1.0, -23.0]]
Output of my KMeansClusteringJob :
file:/home/tid/eclipse/workspace/MRClustering/files/clustering/depth_3/part-r-00000
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[13.5, 3.75]]] / Vector [vector=[16.0, 3.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[13.5, 3.75]]] / Vector [vector=[7.0, 6.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[13.5, 3.75]]] / Vector [vector=[6.0, 5.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[13.5, 3.75]]] / Vector [vector=[25.0, 1.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[1.0, 2.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[3.0, 3.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[2.0, 2.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[2.0, 3.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[-1.0, -23.0]]
Complete Log Result of My KMeansClusteringJob
14/04/08 15:50:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
14/04/08 15:50:35 INFO compress.CodecPool: Got brand-new compressor
14/04/08 15:50:35 WARN mapred.JobClient: Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.
14/04/08 15:50:35 WARN mapred.JobClient: No job jar file set. User classes may not be
20. 20
found. See JobConf(Class) or JobConf#setJar(String).
14/04/08 15:50:35 INFO input.FileInputFormat: Total input paths to process : 1
14/04/08 15:50:35 INFO mapred.JobClient: Running job: job_local1343624176_0001
14/04/08 15:50:35 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/08 15:50:35 INFO mapred.LocalJobRunner: Starting task:
attempt_local1343624176_0001_m_000000_0
14/04/08 15:50:36 INFO util.ProcessTree: setsid exited with exit code 0
14/04/08 15:50:36 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@24bec229
14/04/08 15:50:36 INFO mapred.MapTask: Processing split:
file:/home/tid/eclipse/workspace/MRClustering/files/clustering/import/data:0+558
14/04/08 15:50:36 INFO mapred.MapTask: io.sort.mb = 100
14/04/08 15:50:36 INFO mapred.MapTask: data buffer = 79691776/99614720
14/04/08 15:50:36 INFO mapred.MapTask: record buffer = 262144/327680
14/04/08 15:50:36 INFO compress.CodecPool: Got brand-new decompressor
14/04/08 15:50:36 INFO compress.CodecPool: Got brand-new decompressor
14/04/08 15:50:36 INFO mapred.MapTask: Starting flush of map output
14/04/08 15:50:36 INFO mapred.MapTask: Finished spill 0
14/04/08 15:50:36 INFO mapred.Task: Task:attempt_local1343624176_0001_m_000000_0 is done.
And is in the process of commiting
14/04/08 15:50:36 INFO mapred.LocalJobRunner:
14/04/08 15:50:36 INFO mapred.Task: Task 'attempt_local1343624176_0001_m_000000_0' done.
14/04/08 15:50:36 INFO mapred.LocalJobRunner: Finishing task:
attempt_local1343624176_0001_m_000000_0
14/04/08 15:50:36 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/08 15:50:36 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@7b4b3d0e
14/04/08 15:50:36 INFO mapred.LocalJobRunner:
14/04/08 15:50:36 INFO mapred.Merger: Merging 1 sorted segments
14/04/08 15:50:36 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of
total size: 380 bytes
14/04/08 15:50:36 INFO mapred.LocalJobRunner:
14/04/08 15:50:36 INFO mapred.Task: Task:attempt_local1343624176_0001_r_000000_0 is done.
And is in the process of commiting
14/04/08 15:50:36 INFO mapred.LocalJobRunner:
14/04/08 15:50:36 INFO mapred.Task: Task attempt_local1343624176_0001_r_000000_0 is allowed
to commit now
14/04/08 15:50:36 INFO output.FileOutputCommitter: Saved output of task
'attempt_local1343624176_0001_r_000000_0' to files/clustering/depth_1
14/04/08 15:50:36 INFO mapred.LocalJobRunner: reduce > reduce
14/04/08 15:50:36 INFO mapred.Task: Task 'attempt_local1343624176_0001_r_000000_0' done.
14/04/08 15:50:36 INFO mapred.JobClient: map 100% reduce 100%
14/04/08 15:50:36 INFO mapred.JobClient: Job complete: job_local1343624176_0001
14/04/08 15:50:36 INFO mapred.JobClient: Counters: 21
14/04/08 15:50:36 INFO mapred.JobClient: File Output Format Counters
14/04/08 15:50:36 INFO mapred.JobClient: Bytes Written=537
14/04/08 15:50:36 INFO mapred.JobClient: File Input Format Counters
14/04/08 15:50:36 INFO mapred.JobClient: Bytes Read=574
14/04/08 15:50:36 INFO mapred.JobClient: FileSystemCounters
14/04/08 15:50:36 INFO mapred.JobClient: FILE_BYTES_READ=2380
14/04/08 15:50:36 INFO mapred.JobClient: FILE_BYTES_WRITTEN=106876
14/04/08 15:50:36 INFO mapred.JobClient: com.clustering.mapreduce.KMeansReducer$Counter
14/04/08 15:50:36 INFO mapred.JobClient: CONVERGED=2
14/04/08 15:50:36 INFO mapred.JobClient: Map-Reduce Framework
14/04/08 15:50:36 INFO mapred.JobClient: Reduce input groups=2
14/04/08 15:50:36 INFO mapred.JobClient: Map output materialized bytes=384
14/04/08 15:50:36 INFO mapred.JobClient: Combine output records=0
14/04/08 15:50:36 INFO mapred.JobClient: Map input records=9
14/04/08 15:50:36 INFO mapred.JobClient: Reduce shuffle bytes=0
14/04/08 15:50:36 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/08 15:50:36 INFO mapred.JobClient: Reduce output records=9
14/04/08 15:50:36 INFO mapred.JobClient: Spilled Records=18
14/04/08 15:50:36 INFO mapred.JobClient: Map output bytes=360
14/04/08 15:50:36 INFO mapred.JobClient: Total committed heap usage (bytes)=355991552
14/04/08 15:50:36 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/08 15:50:36 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
21. 21
14/04/08 15:50:36 INFO mapred.JobClient: SPLIT_RAW_BYTES=139
14/04/08 15:50:36 INFO mapred.JobClient: Map output records=9
14/04/08 15:50:36 INFO mapred.JobClient: Combine input records=0
14/04/08 15:50:36 INFO mapred.JobClient: Reduce input records=9
14/04/08 15:50:36 WARN mapred.JobClient: Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.
14/04/08 15:50:36 WARN mapred.JobClient: No job jar file set. User classes may not be
found. See JobConf(Class) or JobConf#setJar(String).
14/04/08 15:50:36 INFO input.FileInputFormat: Total input paths to process : 1
14/04/08 15:50:37 INFO mapred.JobClient: Running job: job_local1426850290_0002
14/04/08 15:50:37 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/08 15:50:37 INFO mapred.LocalJobRunner: Starting task:
attempt_local1426850290_0002_m_000000_0
14/04/08 15:50:37 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@26adcd34
14/04/08 15:50:37 INFO mapred.MapTask: Processing split:
file:/home/tid/eclipse/workspace/MRClustering/files/clustering/depth_1/part-r-00000:0+521
14/04/08 15:50:37 INFO mapred.MapTask: io.sort.mb = 100
14/04/08 15:50:37 INFO mapred.MapTask: data buffer = 79691776/99614720
14/04/08 15:50:37 INFO mapred.MapTask: record buffer = 262144/327680
14/04/08 15:50:37 INFO mapred.MapTask: Starting flush of map output
14/04/08 15:50:37 INFO mapred.MapTask: Finished spill 0
14/04/08 15:50:37 INFO mapred.Task: Task:attempt_local1426850290_0002_m_000000_0 is done.
And is in the process of commiting
14/04/08 15:50:37 INFO mapred.LocalJobRunner:
14/04/08 15:50:37 INFO mapred.Task: Task 'attempt_local1426850290_0002_m_000000_0' done.
14/04/08 15:50:37 INFO mapred.LocalJobRunner: Finishing task:
attempt_local1426850290_0002_m_000000_0
14/04/08 15:50:37 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/08 15:50:37 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3740f768
14/04/08 15:50:37 INFO mapred.LocalJobRunner:
14/04/08 15:50:37 INFO mapred.Merger: Merging 1 sorted segments
14/04/08 15:50:37 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of
total size: 380 bytes
14/04/08 15:50:37 INFO mapred.LocalJobRunner:
14/04/08 15:50:37 INFO mapred.Task: Task:attempt_local1426850290_0002_r_000000_0 is done.
And is in the process of commiting
14/04/08 15:50:37 INFO mapred.LocalJobRunner:
14/04/08 15:50:37 INFO mapred.Task: Task attempt_local1426850290_0002_r_000000_0 is allowed
to commit now
14/04/08 15:50:37 INFO output.FileOutputCommitter: Saved output of task
'attempt_local1426850290_0002_r_000000_0' to files/clustering/depth_2
14/04/08 15:50:37 INFO mapred.LocalJobRunner: reduce > reduce
14/04/08 15:50:37 INFO mapred.Task: Task 'attempt_local1426850290_0002_r_000000_0' done.
14/04/08 15:50:38 INFO mapred.JobClient: map 100% reduce 100%
14/04/08 15:50:38 INFO mapred.JobClient: Job complete: job_local1426850290_0002
14/04/08 15:50:38 INFO mapred.JobClient: Counters: 21
14/04/08 15:50:38 INFO mapred.JobClient: File Output Format Counters
14/04/08 15:50:38 INFO mapred.JobClient: Bytes Written=537
14/04/08 15:50:38 INFO mapred.JobClient: File Input Format Counters
14/04/08 15:50:38 INFO mapred.JobClient: Bytes Read=537
14/04/08 15:50:38 INFO mapred.JobClient: FileSystemCounters
14/04/08 15:50:38 INFO mapred.JobClient: FILE_BYTES_READ=5088
14/04/08 15:50:38 INFO mapred.JobClient: FILE_BYTES_WRITTEN=212938
14/04/08 15:50:38 INFO mapred.JobClient: com.clustering.mapreduce.KMeansReducer$Counter
14/04/08 15:50:38 INFO mapred.JobClient: CONVERGED=2
14/04/08 15:50:38 INFO mapred.JobClient: Map-Reduce Framework
14/04/08 15:50:38 INFO mapred.JobClient: Reduce input groups=2
14/04/08 15:50:38 INFO mapred.JobClient: Map output materialized bytes=384
14/04/08 15:50:38 INFO mapred.JobClient: Combine output records=0
14/04/08 15:50:38 INFO mapred.JobClient: Map input records=9
14/04/08 15:50:38 INFO mapred.JobClient: Reduce shuffle bytes=0
14/04/08 15:50:38 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/08 15:50:38 INFO mapred.JobClient: Reduce output records=9
14/04/08 15:50:38 INFO mapred.JobClient: Spilled Records=18
22. 22
14/04/08 15:50:38 INFO mapred.JobClient: Map output bytes=360
14/04/08 15:50:38 INFO mapred.JobClient: Total committed heap usage (bytes)=555352064
14/04/08 15:50:38 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/08 15:50:38 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/08 15:50:38 INFO mapred.JobClient: SPLIT_RAW_BYTES=148
14/04/08 15:50:38 INFO mapred.JobClient: Map output records=9
14/04/08 15:50:38 INFO mapred.JobClient: Combine input records=0
14/04/08 15:50:38 INFO mapred.JobClient: Reduce input records=9
14/04/08 15:50:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.
14/04/08 15:50:38 WARN mapred.JobClient: No job jar file set. User classes may not be
found. See JobConf(Class) or JobConf#setJar(String).
14/04/08 15:50:38 INFO input.FileInputFormat: Total input paths to process : 1
14/04/08 15:50:38 INFO mapred.JobClient: Running job: job_local466621791_0003
14/04/08 15:50:38 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/08 15:50:38 INFO mapred.LocalJobRunner: Starting task:
attempt_local466621791_0003_m_000000_0
14/04/08 15:50:38 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@373fdd1a
14/04/08 15:50:38 INFO mapred.MapTask: Processing split:
file:/home/tid/eclipse/workspace/MRClustering/files/clustering/depth_2/part-r-00000:0+521
14/04/08 15:50:38 INFO mapred.MapTask: io.sort.mb = 100
14/04/08 15:50:38 INFO mapred.MapTask: data buffer = 79691776/99614720
14/04/08 15:50:38 INFO mapred.MapTask: record buffer = 262144/327680
14/04/08 15:50:38 INFO mapred.MapTask: Starting flush of map output
14/04/08 15:50:38 INFO mapred.MapTask: Finished spill 0
14/04/08 15:50:38 INFO mapred.Task: Task:attempt_local466621791_0003_m_000000_0 is done.
And is in the process of commiting
14/04/08 15:50:38 INFO mapred.LocalJobRunner:
14/04/08 15:50:38 INFO mapred.Task: Task 'attempt_local466621791_0003_m_000000_0' done.
14/04/08 15:50:38 INFO mapred.LocalJobRunner: Finishing task:
attempt_local466621791_0003_m_000000_0
14/04/08 15:50:38 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/08 15:50:38 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@78d3cdb9
14/04/08 15:50:38 INFO mapred.LocalJobRunner:
14/04/08 15:50:38 INFO mapred.Merger: Merging 1 sorted segments
14/04/08 15:50:38 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of
total size: 380 bytes
14/04/08 15:50:38 INFO mapred.LocalJobRunner:
14/04/08 15:50:38 INFO mapred.Task: Task:attempt_local466621791_0003_r_000000_0 is done.
And is in the process of commiting
14/04/08 15:50:38 INFO mapred.LocalJobRunner:
14/04/08 15:50:38 INFO mapred.Task: Task attempt_local466621791_0003_r_000000_0 is allowed
to commit now
14/04/08 15:50:38 INFO output.FileOutputCommitter: Saved output of task
'attempt_local466621791_0003_r_000000_0' to files/clustering/depth_3
14/04/08 15:50:38 INFO mapred.LocalJobRunner: reduce > reduce
14/04/08 15:50:38 INFO mapred.Task: Task 'attempt_local466621791_0003_r_000000_0' done.
14/04/08 15:50:39 INFO mapred.JobClient: map 100% reduce 100%
14/04/08 15:50:39 INFO mapred.JobClient: Job complete: job_local466621791_0003
14/04/08 15:50:39 INFO mapred.JobClient: Counters: 20
14/04/08 15:50:39 INFO mapred.JobClient: File Output Format Counters
14/04/08 15:50:39 INFO mapred.JobClient: Bytes Written=537
14/04/08 15:50:39 INFO mapred.JobClient: File Input Format Counters
14/04/08 15:50:39 INFO mapred.JobClient: Bytes Read=537
14/04/08 15:50:39 INFO mapred.JobClient: FileSystemCounters
14/04/08 15:50:39 INFO mapred.JobClient: FILE_BYTES_READ=7796
14/04/08 15:50:39 INFO mapred.JobClient: FILE_BYTES_WRITTEN=318992
14/04/08 15:50:39 INFO mapred.JobClient: Map-Reduce Framework
14/04/08 15:50:39 INFO mapred.JobClient: Reduce input groups=2
14/04/08 15:50:39 INFO mapred.JobClient: Map output materialized bytes=384
14/04/08 15:50:39 INFO mapred.JobClient: Combine output records=0
14/04/08 15:50:39 INFO mapred.JobClient: Map input records=9
14/04/08 15:50:39 INFO mapred.JobClient: Reduce shuffle bytes=0
14/04/08 15:50:39 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
23. 23
14/04/08 15:50:39 INFO mapred.JobClient: Reduce output records=9
14/04/08 15:50:39 INFO mapred.JobClient: Spilled Records=18
14/04/08 15:50:39 INFO mapred.JobClient: Map output bytes=360
14/04/08 15:50:39 INFO mapred.JobClient: Total committed heap usage (bytes)=754712576
14/04/08 15:50:39 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/08 15:50:39 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/08 15:50:39 INFO mapred.JobClient: SPLIT_RAW_BYTES=148
14/04/08 15:50:39 INFO mapred.JobClient: Map output records=9
14/04/08 15:50:39 INFO mapred.JobClient: Combine input records=0
14/04/08 15:50:39 INFO mapred.JobClient: Reduce input records=9
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: FOUND
file:/home/tid/eclipse/workspace/MRClustering/files/clustering/depth_3/part-r-00000
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[13.5, 3.75]]] / Vector [vector=[16.0, 3.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[13.5, 3.75]]] / Vector [vector=[7.0, 6.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[13.5, 3.75]]] / Vector [vector=[6.0, 5.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[13.5, 3.75]]] / Vector [vector=[25.0, 1.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[1.0, 2.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[3.0, 3.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[2.0, 2.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[2.0, 3.0]]
14/04/08 15:50:39 INFO mapreduce.KMeansClusteringJob: ClusterCenter [center=Vector
[vector=[1.4, -2.6]]] / Vector [vector=[-1.0, -23.0]]
References
Hadoop Installation on single node cluster - http://www.michael-noll.com/tutorials/running-
hadoop-on-ubuntu-linux-single-node-cluster/
KMeansClustering with MapReduce - http://codingwiththomas.blogspot.kr/2011/05/k-
means-clustering-with-mapreduce.html