SlideShare a Scribd company logo
1 of 22
Download to read offline
k-means algorithm implementation on Hadoop
Athens University of Economics and Business
Dpt. Of Management Science and Technology
Prof. Damianos Chatziantoniou
| lkoutsokera@gmail.com
| stratos.gounidellis@gmail.com
Lamprini Koutsokera (8130074)
Stratos Gounidellis (8130029)
BDSMasters
Tools
2
Running Hadoop on Ubuntu Linux (Single-Node Cluster) [1]
Prerequisites
• jdk-8uversion-linux-x64.tar.gz (Download)
• Directory modification
• % tar zxvf jdk-8uversion-linux-x64.tar.gz
Hadoop system user
$ sudo addgroup hadoop $ sudo adduser --ingroup hadoop
hduser
Disabling IPv6
/etc/sysctl.conf
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Configuring SSH
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo chmod 750 /app/hadoop/tmp
$ ssh-keygen -t rsa -P ""
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ ssh localhost
Generate an SSH key for the hduser
hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
Create an RSA key pair with an empty password
References: [1], [2]
3
Running Hadoop on Ubuntu Linux (Single-Node Cluster) [2]
Hadoop configuration
$ cd /usr/local
$ sudo tar xzf hadoop-2.7.3.tar.gz
$ sudo mv hadoop-2.7.3 hadoop
$ sudo chown -R hduser:hadoop hadoop
Configuration
conf/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo chmod 750 /app/hadoop/tmp
ownerships and permissions
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64“
hadoop-env.sh
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
conf/mapred-site.xml
References: [1], [2]
4
Running Hadoop on Ubuntu Linux (Single-Node Cluster) [3]
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when
the file is created.
The default is used if replication is not specified in create
time.
</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary
directories.</description>
</property>
</configuration>
<configuration>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>2</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
yarn-site.xml
References: [1], [2]
5
Running Hadoop on Ubuntu Linux (Single-Node Cluster) [4]
conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>
</configuration>
hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop
namenode -format
Formatting the HDFS filesystem via the
NameNode
hduser@ubuntu:~$ /usr/local/hadoop/bin/start-
all.sh
Starting your single-node cluster
hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh
Stopping your single-node cluster
References: [1], [2]
6
Setting up a Single Node Cluster - Hadoop Distributed File System (HDFS)
To access Hadoop WEB UI , we need to type
http://localhost:50070/ though our core-site.xml
that has as value the http://localhost:9000.
Our HDFS cluster consists of
a single NameNode, a master
server that manages the file system
namespace and regulates access
to files by clients. In addition, there are
a number of DataNodes, usually one
per node in the cluster, which manage
storage attached to the nodes that
they run on.
7
From hardware/software to algorithms
8
Data Generation
• Isotropic Gaussian blobs
• 2.000.000 points
• centers = [[25, 25], [-1, -1], [-25, -25]]
• cluster_std = 3.5
9
K-means++ : Calculation of initial centroids
Reference: [3] 10
K-means : Clustering algorithm
References: [4], [5] 11
K-means using Map Reduce
Do
Map
Input is a data point and k centers are broadcasted
Finds the closest center among k centers for the input point
Reduce
Input is one of k centers and all data points having this center as their closest center.
Calculates the new center using data points
Until all of new centers are not changed
Reference: [6] 12
K-means using Map Reduce [1]
k = 3
map
map
map
.
.
.
Centroids are broadcasted to
every map function
key value
[centroid1, point1]
[centroid2, point3]
[centroid3, point10]
[centroid2, point45]
. . . . . . . . . . . . . . . .
[centroidx, pointx]
Mapper
map
map
map
.
.
.
Reference: [6] 13
K-means using Map Reduce [2]
combine
combine
combine
.
.
.
key values
[centroid1, partialsum1]
[centroid2, partialsum1]
[centroid3, partialsum1]
[centroid2, partialsum2]
. . . . . . . . . . . . . . . .
[centroidx, partialsumx]
Combiner
key values
[centroid1, point1, point4]
[centroid2, point3, point7, point9]
[centroid3, point10]
[centroid2, point45, point 73]
. . . . . . . . . . . . . . . .
[centroid3, pointx, . . . ]
Reference: [6] 14
K-means using Map Reduce [3]
reduce
reduce
reduce
.
.
.
key value
[centroid1, centroid1’]
[centroid2, centroid2’]
[centroid3, centroid3’]
Reducer
key values
[centroid1, partialsum1, . . . ]
[centroid2, partialsum1, . . . ]
[centroid3, partialsum1, . . . ]
Reference: [6] 15
From algorithms to coding
16
Python Coding [1]
The initial task of the project is to generate a set of more than one million data points to
be used later as an input for the k-means clustering algorithm. Using this python script
three isotropic Gaussian blobs for clustering are generated. More specifically, the centers
are the following data points [25, 25], [-1, -1], [-25, -25].
The silhouette score constitutes a useful criterion for determining the proper number of
clusters. A silhouette close to 1 implies the datum is in an appropriate cluster, while a
silhouette close to −1 implies the datum is in the wrong cluster. The specific python script
calculates the silhouette score for different numbers of clusters ranging from 2 to 6.
17
Python Coding [2]
18
In order to implement k-means algorithm on hadoop mrjob is used. Mrjob is a python
package, which allows to write multi-step MapReduce jobs in pure Python and run them on
a hadoop cluster. In our case mrjob run on a single-node cluster.
• The mapper function returns each data point and the cluster, to which it belongs.
• The combiner function returns partial sums of batches of data points belonging to the
same cluster.
• The reducer returns the new centroids of each cluster.
• If the centroids remain unchanged the algorithm terminates. Otherwise, the steps are
repeated from the beginning.
This python script calls the k-means algorithm implemented on hadoop. However, before
implementing k-means the initial centroids are computed using the k-means++ algorithm
proposed in 2007 by Arthur and Vassilvitskii.
Coding Running [1]
19
Coding Running [2]
kmeans
20
Coding Running [3]
21
References
[1]: Noll, M. Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll. [online]
Available at: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ [Accessed
29 Mar. 2017].
[2]: Stackoverflow.com. (2017). Error launching job using mrjob on Hadoop. [online]
Available at: http://stackoverflow.com/questions/25358793/error-launching-job-using-mrjob-on-hadoop [Accessed
29 Mar. 2017].
[3]: David Arthur, and Sergei Vassilvitskii, (2007). k-means++: the advantages of careful seeding - Proceedings of
the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, January 7-9, 2007. 1st
ed. New York: ACM, pp.1027–1035.
[4]: Nlp.stanford.edu. K-means. Available at: http://nlp.stanford.edu/IR-book/html/htmledition/k-means-1.html
[Accessed 15 Mar. 2017]
[5]: Home.deib.polimi.it. (n.d.). Clustering - K-means. [online] Available at:
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html [Accessed 8 Mar. 2017].
[6]: Kyuseok Shim, "MapReduce Algorithms for Big Data Analysis", VLDB Conference, 2012
| lkoutsokera@gmail.com
| stratos.gounidellis@gmail.com
Lamprini Koutsokera (8130074)
Stratos Gounidellis (8130029)
BDSMasters 22

More Related Content

What's hot

Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...InfluxData
 
RDO hangout on gnocchi
RDO hangout on gnocchiRDO hangout on gnocchi
RDO hangout on gnocchiEoghan Glynn
 
CloudClustering: Toward a scalable machine learning toolkit for Windows Azure
CloudClustering: Toward a scalable machine learning toolkit for Windows AzureCloudClustering: Toward a scalable machine learning toolkit for Windows Azure
CloudClustering: Toward a scalable machine learning toolkit for Windows AzureAnkur Dave
 
Weather of the Century: Design and Performance
Weather of the Century: Design and PerformanceWeather of the Century: Design and Performance
Weather of the Century: Design and PerformanceMongoDB
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
 
The Weather of the Century
The Weather of the CenturyThe Weather of the Century
The Weather of the CenturyMongoDB
 
The Weather of the Century Part 3: Visualization
The Weather of the Century Part 3: VisualizationThe Weather of the Century Part 3: Visualization
The Weather of the Century Part 3: VisualizationMongoDB
 
Anais Dotis-Georgiou [InfluxData] | Learn Flux by Example | InfluxDays NA 2021
Anais Dotis-Georgiou [InfluxData] | Learn Flux by Example | InfluxDays NA 2021Anais Dotis-Georgiou [InfluxData] | Learn Flux by Example | InfluxDays NA 2021
Anais Dotis-Georgiou [InfluxData] | Learn Flux by Example | InfluxDays NA 2021InfluxData
 
Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl
Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas FittlMonitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl
Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas FittlCitus Data
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codesNAVER D2
 
Big Data Solutions for the Climate Community
Big Data Solutions for the Climate CommunityBig Data Solutions for the Climate Community
Big Data Solutions for the Climate CommunityEUDAT
 
Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Flink Forward
 
From Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and ScaleFrom Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and ScaleBadrish Chandramouli
 
Storing metrics at scale with Gnocchi
Storing metrics at scale with GnocchiStoring metrics at scale with Gnocchi
Storing metrics at scale with GnocchiGordon Chung
 
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotDistributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotCitus Data
 
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEODangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEOAltinity Ltd
 
Virtual Knowledge Graphs for Federated Log Analysis
Virtual Knowledge Graphs for Federated Log AnalysisVirtual Knowledge Graphs for Federated Log Analysis
Virtual Knowledge Graphs for Federated Log AnalysisKabul Kurniawan
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINEDB
 

What's hot (20)

Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
 
RDO hangout on gnocchi
RDO hangout on gnocchiRDO hangout on gnocchi
RDO hangout on gnocchi
 
CloudClustering: Toward a scalable machine learning toolkit for Windows Azure
CloudClustering: Toward a scalable machine learning toolkit for Windows AzureCloudClustering: Toward a scalable machine learning toolkit for Windows Azure
CloudClustering: Toward a scalable machine learning toolkit for Windows Azure
 
Weather of the Century: Design and Performance
Weather of the Century: Design and PerformanceWeather of the Century: Design and Performance
Weather of the Century: Design and Performance
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
The Weather of the Century
The Weather of the CenturyThe Weather of the Century
The Weather of the Century
 
The Weather of the Century Part 3: Visualization
The Weather of the Century Part 3: VisualizationThe Weather of the Century Part 3: Visualization
The Weather of the Century Part 3: Visualization
 
Anais Dotis-Georgiou [InfluxData] | Learn Flux by Example | InfluxDays NA 2021
Anais Dotis-Georgiou [InfluxData] | Learn Flux by Example | InfluxDays NA 2021Anais Dotis-Georgiou [InfluxData] | Learn Flux by Example | InfluxDays NA 2021
Anais Dotis-Georgiou [InfluxData] | Learn Flux by Example | InfluxDays NA 2021
 
Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl
Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas FittlMonitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl
Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codes
 
Big Data Solutions for the Climate Community
Big Data Solutions for the Climate CommunityBig Data Solutions for the Climate Community
Big Data Solutions for the Climate Community
 
Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School
 
From Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and ScaleFrom Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and Scale
 
Ac cuda c_5
Ac cuda c_5Ac cuda c_5
Ac cuda c_5
 
Storing metrics at scale with Gnocchi
Storing metrics at scale with GnocchiStoring metrics at scale with Gnocchi
Storing metrics at scale with Gnocchi
 
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotDistributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
 
Gnocchi v3
Gnocchi v3Gnocchi v3
Gnocchi v3
 
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEODangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
 
Virtual Knowledge Graphs for Federated Log Analysis
Virtual Knowledge Graphs for Federated Log AnalysisVirtual Knowledge Graphs for Federated Log Analysis
Virtual Knowledge Graphs for Federated Log Analysis
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
 

Similar to k-means algorithm implementation on Hadoop

Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Codemotion
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Codemotion
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an exampleNikita Kesharwani
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Jyotirmoy Sundi
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterIRJET Journal
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeRizwan Habib
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 

Similar to k-means algorithm implementation on Hadoop (20)

Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
 
Data Science
Data ScienceData Science
Data Science
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop Cluster
 
H04502048051
H04502048051H04502048051
H04502048051
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Hadoop
HadoopHadoop
Hadoop
 
Handout3o
Handout3oHandout3o
Handout3o
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
NodeJS for Beginner
NodeJS for BeginnerNodeJS for Beginner
NodeJS for Beginner
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 

Recently uploaded

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 

Recently uploaded (20)

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 

k-means algorithm implementation on Hadoop

  • 1. k-means algorithm implementation on Hadoop Athens University of Economics and Business Dpt. Of Management Science and Technology Prof. Damianos Chatziantoniou | lkoutsokera@gmail.com | stratos.gounidellis@gmail.com Lamprini Koutsokera (8130074) Stratos Gounidellis (8130029) BDSMasters
  • 3. Running Hadoop on Ubuntu Linux (Single-Node Cluster) [1] Prerequisites • jdk-8uversion-linux-x64.tar.gz (Download) • Directory modification • % tar zxvf jdk-8uversion-linux-x64.tar.gz Hadoop system user $ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser Disabling IPv6 /etc/sysctl.conf net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 Configuring SSH $ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp $ sudo chmod 750 /app/hadoop/tmp $ ssh-keygen -t rsa -P "" $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys $ ssh localhost Generate an SSH key for the hduser hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys Create an RSA key pair with an empty password References: [1], [2] 3
  • 4. Running Hadoop on Ubuntu Linux (Single-Node Cluster) [2] Hadoop configuration $ cd /usr/local $ sudo tar xzf hadoop-2.7.3.tar.gz $ sudo mv hadoop-2.7.3 hadoop $ sudo chown -R hduser:hadoop hadoop Configuration conf/core-site.xml <configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> $ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp $ sudo chmod 750 /app/hadoop/tmp ownerships and permissions export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64“ hadoop-env.sh <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> conf/mapred-site.xml References: [1], [2] 4
  • 5. Running Hadoop on Ubuntu Linux (Single-Node Cluster) [3] hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> </configuration> <configuration> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>2</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> yarn-site.xml References: [1], [2] 5
  • 6. Running Hadoop on Ubuntu Linux (Single-Node Cluster) [4] conf/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> </property> </configuration> hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format Formatting the HDFS filesystem via the NameNode hduser@ubuntu:~$ /usr/local/hadoop/bin/start- all.sh Starting your single-node cluster hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh Stopping your single-node cluster References: [1], [2] 6
  • 7. Setting up a Single Node Cluster - Hadoop Distributed File System (HDFS) To access Hadoop WEB UI , we need to type http://localhost:50070/ though our core-site.xml that has as value the http://localhost:9000. Our HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. 7
  • 9. Data Generation • Isotropic Gaussian blobs • 2.000.000 points • centers = [[25, 25], [-1, -1], [-25, -25]] • cluster_std = 3.5 9
  • 10. K-means++ : Calculation of initial centroids Reference: [3] 10
  • 11. K-means : Clustering algorithm References: [4], [5] 11
  • 12. K-means using Map Reduce Do Map Input is a data point and k centers are broadcasted Finds the closest center among k centers for the input point Reduce Input is one of k centers and all data points having this center as their closest center. Calculates the new center using data points Until all of new centers are not changed Reference: [6] 12
  • 13. K-means using Map Reduce [1] k = 3 map map map . . . Centroids are broadcasted to every map function key value [centroid1, point1] [centroid2, point3] [centroid3, point10] [centroid2, point45] . . . . . . . . . . . . . . . . [centroidx, pointx] Mapper map map map . . . Reference: [6] 13
  • 14. K-means using Map Reduce [2] combine combine combine . . . key values [centroid1, partialsum1] [centroid2, partialsum1] [centroid3, partialsum1] [centroid2, partialsum2] . . . . . . . . . . . . . . . . [centroidx, partialsumx] Combiner key values [centroid1, point1, point4] [centroid2, point3, point7, point9] [centroid3, point10] [centroid2, point45, point 73] . . . . . . . . . . . . . . . . [centroid3, pointx, . . . ] Reference: [6] 14
  • 15. K-means using Map Reduce [3] reduce reduce reduce . . . key value [centroid1, centroid1’] [centroid2, centroid2’] [centroid3, centroid3’] Reducer key values [centroid1, partialsum1, . . . ] [centroid2, partialsum1, . . . ] [centroid3, partialsum1, . . . ] Reference: [6] 15
  • 16. From algorithms to coding 16
  • 17. Python Coding [1] The initial task of the project is to generate a set of more than one million data points to be used later as an input for the k-means clustering algorithm. Using this python script three isotropic Gaussian blobs for clustering are generated. More specifically, the centers are the following data points [25, 25], [-1, -1], [-25, -25]. The silhouette score constitutes a useful criterion for determining the proper number of clusters. A silhouette close to 1 implies the datum is in an appropriate cluster, while a silhouette close to −1 implies the datum is in the wrong cluster. The specific python script calculates the silhouette score for different numbers of clusters ranging from 2 to 6. 17
  • 18. Python Coding [2] 18 In order to implement k-means algorithm on hadoop mrjob is used. Mrjob is a python package, which allows to write multi-step MapReduce jobs in pure Python and run them on a hadoop cluster. In our case mrjob run on a single-node cluster. • The mapper function returns each data point and the cluster, to which it belongs. • The combiner function returns partial sums of batches of data points belonging to the same cluster. • The reducer returns the new centroids of each cluster. • If the centroids remain unchanged the algorithm terminates. Otherwise, the steps are repeated from the beginning. This python script calls the k-means algorithm implemented on hadoop. However, before implementing k-means the initial centroids are computed using the k-means++ algorithm proposed in 2007 by Arthur and Vassilvitskii.
  • 22. References [1]: Noll, M. Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll. [online] Available at: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ [Accessed 29 Mar. 2017]. [2]: Stackoverflow.com. (2017). Error launching job using mrjob on Hadoop. [online] Available at: http://stackoverflow.com/questions/25358793/error-launching-job-using-mrjob-on-hadoop [Accessed 29 Mar. 2017]. [3]: David Arthur, and Sergei Vassilvitskii, (2007). k-means++: the advantages of careful seeding - Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, January 7-9, 2007. 1st ed. New York: ACM, pp.1027–1035. [4]: Nlp.stanford.edu. K-means. Available at: http://nlp.stanford.edu/IR-book/html/htmledition/k-means-1.html [Accessed 15 Mar. 2017] [5]: Home.deib.polimi.it. (n.d.). Clustering - K-means. [online] Available at: http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html [Accessed 8 Mar. 2017]. [6]: Kyuseok Shim, "MapReduce Algorithms for Big Data Analysis", VLDB Conference, 2012 | lkoutsokera@gmail.com | stratos.gounidellis@gmail.com Lamprini Koutsokera (8130074) Stratos Gounidellis (8130029) BDSMasters 22