SlideShare a Scribd company logo
1 of 22
Download to read offline
k-means algorithm implementation on Hadoop
Athens University of Economics and Business
Dpt. Of Management Science and Technology
Prof. Damianos Chatziantoniou
| lkoutsokera@gmail.com
| stratos.gounidellis@gmail.com
Lamprini Koutsokera
Stratos Gounidellis
BDSMasters
Tools
2
Running Hadoop on Ubuntu Linux (Single-Node Cluster) [1]
Prerequisites
• jdk-8uversion-linux-x64.tar.gz (Download)
• Directory modification
• % tar zxvf jdk-8uversion-linux-x64.tar.gz
Hadoop system user
$ sudo addgroup hadoop $ sudo adduser --ingroup hadoop
hduser
Disabling IPv6
/etc/sysctl.conf
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Configuring SSH
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo chmod 750 /app/hadoop/tmp
$ ssh-keygen -t rsa -P ""
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ ssh localhost
Generate an SSH key for the hduser
hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
Create an RSA key pair with an empty password
References: [1], [2]
3
Running Hadoop on Ubuntu Linux (Single-Node Cluster) [2]
Hadoop configuration
$ cd /usr/local
$ sudo tar xzf hadoop-2.7.3.tar.gz
$ sudo mv hadoop-2.7.3 hadoop
$ sudo chown -R hduser:hadoop hadoop
Configuration
conf/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo chmod 750 /app/hadoop/tmp
ownerships and permissions
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64“
hadoop-env.sh
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
conf/mapred-site.xml
References: [1], [2]
4
Running Hadoop on Ubuntu Linux (Single-Node Cluster) [3]
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when
the file is created.
The default is used if replication is not specified in create
time.
</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary
directories.</description>
</property>
</configuration>
<configuration>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>2</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
yarn-site.xml
References: [1], [2]
5
Running Hadoop on Ubuntu Linux (Single-Node Cluster) [4]
conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>
</configuration>
hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop
namenode -format
Formatting the HDFS filesystem via the
NameNode
hduser@ubuntu:~$ /usr/local/hadoop/bin/start-
all.sh
Starting your single-node cluster
hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh
Stopping your single-node cluster
References: [1], [2]
6
Setting up a Single Node Cluster - Hadoop Distributed File System (HDFS)
To access Hadoop WEB UI , we need to type
http://localhost:50070/ though our core-site.xml
that has as value the http://localhost:9000.
Our HDFS cluster consists of
a single NameNode, a master
server that manages the file system
namespace and regulates access
to files by clients. In addition, there are
a number of DataNodes, usually one
per node in the cluster, which manage
storage attached to the nodes that
they run on.
7
From hardware/software to algorithms
8
Data Generation
• Isotropic Gaussian blobs
• 2.000.000 points
• centers = [[25, 25], [-1, -1], [-25, -25]]
• cluster_std = 3.5
9
K-means++ : Calculation of initial centroids
Reference: [3] 10
K-means : Clustering algorithm
References: [4], [5] 11
K-means using Map Reduce
Do
Map
Input is a data point and k centers are broadcasted
Finds the closest center among k centers for the input point
Reduce
Input is one of k centers and all data points having this center as their closest center.
Calculates the new center using data points
Until all of new centers are not changed
Reference: [6] 12
K-means using Map Reduce [1]
k = 3
map
map
map
.
.
.
Centroids are broadcasted to
every map function
key value
[centroid1, point1]
[centroid2, point3]
[centroid3, point10]
[centroid2, point45]
. . . . . . . . . . . . . . . .
[centroidx, pointx]
Mapper
map
map
map
.
.
.
Reference: [6] 13
K-means using Map Reduce [2]
combine
combine
combine
.
.
.
key values
[centroid1, partialsum1]
[centroid2, partialsum1]
[centroid3, partialsum1]
[centroid2, partialsum2]
. . . . . . . . . . . . . . . .
[centroidx, partialsumx]
Combiner
key values
[centroid1, point1, point4]
[centroid2, point3, point7, point9]
[centroid3, point10]
[centroid2, point45, point 73]
. . . . . . . . . . . . . . . .
[centroid3, pointx, . . . ]
Reference: [6] 14
K-means using Map Reduce [3]
reduce
reduce
reduce
.
.
.
key value
[centroid1, centroid1’]
[centroid2, centroid2’]
[centroid3, centroid3’]
Reducer
key values
[centroid1, partialsum1, . . . ]
[centroid2, partialsum1, . . . ]
[centroid3, partialsum1, . . . ]
Reference: [6] 15
From algorithms to coding
16
Python Coding [1]
The initial task of the project is to generate a set of more than one million data points to
be used later as an input for the k-means clustering algorithm. Using this python script
three isotropic Gaussian blobs for clustering are generated. More specifically, the centers
are the following data points [25, 25], [-1, -1], [-25, -25].
The silhouette score constitutes a useful criterion for determining the proper number of
clusters. A silhouette close to 1 implies the datum is in an appropriate cluster, while a
silhouette close to −1 implies the datum is in the wrong cluster. The specific python script
calculates the silhouette score for different numbers of clusters ranging from 2 to 6.
17
Python Coding [2]
18
In order to implement k-means algorithm on hadoop mrjob is used. Mrjob is a python
package, which allows to write multi-step MapReduce jobs in pure Python and run them on
a hadoop cluster. In our case mrjob run on a single-node cluster.
• The mapper function returns each data point and the cluster, to which it belongs.
• The combiner function returns partial sums of batches of data points belonging to the
same cluster.
• The reducer returns the new centroids of each cluster.
• If the centroids remain unchanged the algorithm terminates. Otherwise, the steps are
repeated from the beginning.
This python script calls the k-means algorithm implemented on hadoop. However, before
implementing k-means the initial centroids are computed using the k-means++ algorithm
proposed in 2007 by Arthur and Vassilvitskii.
Coding Running [1]
19
Coding Running [2]
kmeans
20
Coding Running [3]
21
References
[1]: Noll, M. Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll. [online]
Available at: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ [Accessed
29 Mar. 2017].
[2]: Stackoverflow.com. (2017). Error launching job using mrjob on Hadoop. [online]
Available at: http://stackoverflow.com/questions/25358793/error-launching-job-using-mrjob-on-hadoop [Accessed
29 Mar. 2017].
[3]: David Arthur, and Sergei Vassilvitskii, (2007). k-means++: the advantages of careful seeding - Proceedings of
the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, January 7-9, 2007. 1st
ed. New York: ACM, pp.1027–1035.
[4]: Nlp.stanford.edu. K-means. Available at: http://nlp.stanford.edu/IR-book/html/htmledition/k-means-1.html
[Accessed 15 Mar. 2017]
[5]: Home.deib.polimi.it. (n.d.). Clustering - K-means. [online] Available at:
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html [Accessed 8 Mar. 2017].
[6]: Kyuseok Shim, "MapReduce Algorithms for Big Data Analysis", VLDB Conference, 2012
| lkoutsokera@gmail.com
| stratos.gounidellis@gmail.com
Lamprini Koutsokera
Stratos Gounidellis
BDSMasters 22

More Related Content

What's hot

Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Edureka!
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Keep me in the Loop: INotify in HDFS
Keep me in the Loop: INotify in HDFSKeep me in the Loop: INotify in HDFS
Keep me in the Loop: INotify in HDFSDataWorks Summit
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Chapter1 introduction
Chapter1 introductionChapter1 introduction
Chapter1 introductionDinesh K
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 

What's hot (20)

Mongo db workshop # 01
Mongo db workshop # 01Mongo db workshop # 01
Mongo db workshop # 01
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Sqoop
SqoopSqoop
Sqoop
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Keep me in the Loop: INotify in HDFS
Keep me in the Loop: INotify in HDFSKeep me in the Loop: INotify in HDFS
Keep me in the Loop: INotify in HDFS
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Apache Zookeeper
Apache ZookeeperApache Zookeeper
Apache Zookeeper
 
Chapter1 introduction
Chapter1 introductionChapter1 introduction
Chapter1 introduction
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 

Similar to Implementation of k means algorithm on Hadoop

Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Codemotion
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Codemotion
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an exampleNikita Kesharwani
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterIRJET Journal
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Jyotirmoy Sundi
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance ComputersDave Hiltbrand
 
Cloud Meetup - Automation in the Cloud
Cloud Meetup - Automation in the CloudCloud Meetup - Automation in the Cloud
Cloud Meetup - Automation in the Cloudpetriojala123
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeRizwan Habib
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
 

Similar to Implementation of k means algorithm on Hadoop (20)

Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
 
Data Science
Data ScienceData Science
Data Science
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop Cluster
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
H04502048051
H04502048051H04502048051
H04502048051
 
Hadoop
HadoopHadoop
Hadoop
 
Handout3o
Handout3oHandout3o
Handout3o
 
NodeJS for Beginner
NodeJS for BeginnerNodeJS for Beginner
NodeJS for Beginner
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
Cloud Meetup - Automation in the Cloud
Cloud Meetup - Automation in the CloudCloud Meetup - Automation in the Cloud
Cloud Meetup - Automation in the Cloud
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 

Recently uploaded

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 

Recently uploaded (20)

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 

Implementation of k means algorithm on Hadoop

  • 1. k-means algorithm implementation on Hadoop Athens University of Economics and Business Dpt. Of Management Science and Technology Prof. Damianos Chatziantoniou | lkoutsokera@gmail.com | stratos.gounidellis@gmail.com Lamprini Koutsokera Stratos Gounidellis BDSMasters
  • 3. Running Hadoop on Ubuntu Linux (Single-Node Cluster) [1] Prerequisites • jdk-8uversion-linux-x64.tar.gz (Download) • Directory modification • % tar zxvf jdk-8uversion-linux-x64.tar.gz Hadoop system user $ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser Disabling IPv6 /etc/sysctl.conf net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 Configuring SSH $ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp $ sudo chmod 750 /app/hadoop/tmp $ ssh-keygen -t rsa -P "" $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys $ ssh localhost Generate an SSH key for the hduser hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys Create an RSA key pair with an empty password References: [1], [2] 3
  • 4. Running Hadoop on Ubuntu Linux (Single-Node Cluster) [2] Hadoop configuration $ cd /usr/local $ sudo tar xzf hadoop-2.7.3.tar.gz $ sudo mv hadoop-2.7.3 hadoop $ sudo chown -R hduser:hadoop hadoop Configuration conf/core-site.xml <configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> $ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp $ sudo chmod 750 /app/hadoop/tmp ownerships and permissions export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64“ hadoop-env.sh <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> conf/mapred-site.xml References: [1], [2] 4
  • 5. Running Hadoop on Ubuntu Linux (Single-Node Cluster) [3] hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> </configuration> <configuration> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>2</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> yarn-site.xml References: [1], [2] 5
  • 6. Running Hadoop on Ubuntu Linux (Single-Node Cluster) [4] conf/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> </property> </configuration> hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format Formatting the HDFS filesystem via the NameNode hduser@ubuntu:~$ /usr/local/hadoop/bin/start- all.sh Starting your single-node cluster hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh Stopping your single-node cluster References: [1], [2] 6
  • 7. Setting up a Single Node Cluster - Hadoop Distributed File System (HDFS) To access Hadoop WEB UI , we need to type http://localhost:50070/ though our core-site.xml that has as value the http://localhost:9000. Our HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. 7
  • 9. Data Generation • Isotropic Gaussian blobs • 2.000.000 points • centers = [[25, 25], [-1, -1], [-25, -25]] • cluster_std = 3.5 9
  • 10. K-means++ : Calculation of initial centroids Reference: [3] 10
  • 11. K-means : Clustering algorithm References: [4], [5] 11
  • 12. K-means using Map Reduce Do Map Input is a data point and k centers are broadcasted Finds the closest center among k centers for the input point Reduce Input is one of k centers and all data points having this center as their closest center. Calculates the new center using data points Until all of new centers are not changed Reference: [6] 12
  • 13. K-means using Map Reduce [1] k = 3 map map map . . . Centroids are broadcasted to every map function key value [centroid1, point1] [centroid2, point3] [centroid3, point10] [centroid2, point45] . . . . . . . . . . . . . . . . [centroidx, pointx] Mapper map map map . . . Reference: [6] 13
  • 14. K-means using Map Reduce [2] combine combine combine . . . key values [centroid1, partialsum1] [centroid2, partialsum1] [centroid3, partialsum1] [centroid2, partialsum2] . . . . . . . . . . . . . . . . [centroidx, partialsumx] Combiner key values [centroid1, point1, point4] [centroid2, point3, point7, point9] [centroid3, point10] [centroid2, point45, point 73] . . . . . . . . . . . . . . . . [centroid3, pointx, . . . ] Reference: [6] 14
  • 15. K-means using Map Reduce [3] reduce reduce reduce . . . key value [centroid1, centroid1’] [centroid2, centroid2’] [centroid3, centroid3’] Reducer key values [centroid1, partialsum1, . . . ] [centroid2, partialsum1, . . . ] [centroid3, partialsum1, . . . ] Reference: [6] 15
  • 16. From algorithms to coding 16
  • 17. Python Coding [1] The initial task of the project is to generate a set of more than one million data points to be used later as an input for the k-means clustering algorithm. Using this python script three isotropic Gaussian blobs for clustering are generated. More specifically, the centers are the following data points [25, 25], [-1, -1], [-25, -25]. The silhouette score constitutes a useful criterion for determining the proper number of clusters. A silhouette close to 1 implies the datum is in an appropriate cluster, while a silhouette close to −1 implies the datum is in the wrong cluster. The specific python script calculates the silhouette score for different numbers of clusters ranging from 2 to 6. 17
  • 18. Python Coding [2] 18 In order to implement k-means algorithm on hadoop mrjob is used. Mrjob is a python package, which allows to write multi-step MapReduce jobs in pure Python and run them on a hadoop cluster. In our case mrjob run on a single-node cluster. • The mapper function returns each data point and the cluster, to which it belongs. • The combiner function returns partial sums of batches of data points belonging to the same cluster. • The reducer returns the new centroids of each cluster. • If the centroids remain unchanged the algorithm terminates. Otherwise, the steps are repeated from the beginning. This python script calls the k-means algorithm implemented on hadoop. However, before implementing k-means the initial centroids are computed using the k-means++ algorithm proposed in 2007 by Arthur and Vassilvitskii.
  • 22. References [1]: Noll, M. Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll. [online] Available at: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ [Accessed 29 Mar. 2017]. [2]: Stackoverflow.com. (2017). Error launching job using mrjob on Hadoop. [online] Available at: http://stackoverflow.com/questions/25358793/error-launching-job-using-mrjob-on-hadoop [Accessed 29 Mar. 2017]. [3]: David Arthur, and Sergei Vassilvitskii, (2007). k-means++: the advantages of careful seeding - Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, January 7-9, 2007. 1st ed. New York: ACM, pp.1027–1035. [4]: Nlp.stanford.edu. K-means. Available at: http://nlp.stanford.edu/IR-book/html/htmledition/k-means-1.html [Accessed 15 Mar. 2017] [5]: Home.deib.polimi.it. (n.d.). Clustering - K-means. [online] Available at: http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html [Accessed 8 Mar. 2017]. [6]: Kyuseok Shim, "MapReduce Algorithms for Big Data Analysis", VLDB Conference, 2012 | lkoutsokera@gmail.com | stratos.gounidellis@gmail.com Lamprini Koutsokera Stratos Gounidellis BDSMasters 22