SlideShare a Scribd company logo
k-means algorithm implementation on Hadoop
Athens University of Economics and Business
Dpt. Of Management Science and Technology
Prof. Damianos Chatziantoniou
| lkoutsokera@gmail.com
| stratos.gounidellis@gmail.com
Lamprini Koutsokera
Stratos Gounidellis
BDSMasters
Tools
2
Running Hadoop on Ubuntu Linux (Single-Node Cluster) [1]
Prerequisites
• jdk-8uversion-linux-x64.tar.gz (Download)
• Directory modification
• % tar zxvf jdk-8uversion-linux-x64.tar.gz
Hadoop system user
$ sudo addgroup hadoop $ sudo adduser --ingroup hadoop
hduser
Disabling IPv6
/etc/sysctl.conf
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Configuring SSH
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo chmod 750 /app/hadoop/tmp
$ ssh-keygen -t rsa -P ""
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ ssh localhost
Generate an SSH key for the hduser
hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
Create an RSA key pair with an empty password
References: [1], [2]
3
Running Hadoop on Ubuntu Linux (Single-Node Cluster) [2]
Hadoop configuration
$ cd /usr/local
$ sudo tar xzf hadoop-2.7.3.tar.gz
$ sudo mv hadoop-2.7.3 hadoop
$ sudo chown -R hduser:hadoop hadoop
Configuration
conf/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo chmod 750 /app/hadoop/tmp
ownerships and permissions
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64“
hadoop-env.sh
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
conf/mapred-site.xml
References: [1], [2]
4
Running Hadoop on Ubuntu Linux (Single-Node Cluster) [3]
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when
the file is created.
The default is used if replication is not specified in create
time.
</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary
directories.</description>
</property>
</configuration>
<configuration>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>2</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
yarn-site.xml
References: [1], [2]
5
Running Hadoop on Ubuntu Linux (Single-Node Cluster) [4]
conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>
</configuration>
hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop
namenode -format
Formatting the HDFS filesystem via the
NameNode
hduser@ubuntu:~$ /usr/local/hadoop/bin/start-
all.sh
Starting your single-node cluster
hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh
Stopping your single-node cluster
References: [1], [2]
6
Setting up a Single Node Cluster - Hadoop Distributed File System (HDFS)
To access Hadoop WEB UI , we need to type
http://localhost:50070/ though our core-site.xml
that has as value the http://localhost:9000.
Our HDFS cluster consists of
a single NameNode, a master
server that manages the file system
namespace and regulates access
to files by clients. In addition, there are
a number of DataNodes, usually one
per node in the cluster, which manage
storage attached to the nodes that
they run on.
7
From hardware/software to algorithms
8
Data Generation
• Isotropic Gaussian blobs
• 2.000.000 points
• centers = [[25, 25], [-1, -1], [-25, -25]]
• cluster_std = 3.5
9
K-means++ : Calculation of initial centroids
Reference: [3] 10
K-means : Clustering algorithm
References: [4], [5] 11
K-means using Map Reduce
Do
Map
Input is a data point and k centers are broadcasted
Finds the closest center among k centers for the input point
Reduce
Input is one of k centers and all data points having this center as their closest center.
Calculates the new center using data points
Until all of new centers are not changed
Reference: [6] 12
K-means using Map Reduce [1]
k = 3
map
map
map
.
.
.
Centroids are broadcasted to
every map function
key value
[centroid1, point1]
[centroid2, point3]
[centroid3, point10]
[centroid2, point45]
. . . . . . . . . . . . . . . .
[centroidx, pointx]
Mapper
map
map
map
.
.
.
Reference: [6] 13
K-means using Map Reduce [2]
combine
combine
combine
.
.
.
key values
[centroid1, partialsum1]
[centroid2, partialsum1]
[centroid3, partialsum1]
[centroid2, partialsum2]
. . . . . . . . . . . . . . . .
[centroidx, partialsumx]
Combiner
key values
[centroid1, point1, point4]
[centroid2, point3, point7, point9]
[centroid3, point10]
[centroid2, point45, point 73]
. . . . . . . . . . . . . . . .
[centroid3, pointx, . . . ]
Reference: [6] 14
K-means using Map Reduce [3]
reduce
reduce
reduce
.
.
.
key value
[centroid1, centroid1’]
[centroid2, centroid2’]
[centroid3, centroid3’]
Reducer
key values
[centroid1, partialsum1, . . . ]
[centroid2, partialsum1, . . . ]
[centroid3, partialsum1, . . . ]
Reference: [6] 15
From algorithms to coding
16
Python Coding [1]
The initial task of the project is to generate a set of more than one million data points to
be used later as an input for the k-means clustering algorithm. Using this python script
three isotropic Gaussian blobs for clustering are generated. More specifically, the centers
are the following data points [25, 25], [-1, -1], [-25, -25].
The silhouette score constitutes a useful criterion for determining the proper number of
clusters. A silhouette close to 1 implies the datum is in an appropriate cluster, while a
silhouette close to −1 implies the datum is in the wrong cluster. The specific python script
calculates the silhouette score for different numbers of clusters ranging from 2 to 6.
17
Python Coding [2]
18
In order to implement k-means algorithm on hadoop mrjob is used. Mrjob is a python
package, which allows to write multi-step MapReduce jobs in pure Python and run them on
a hadoop cluster. In our case mrjob run on a single-node cluster.
• The mapper function returns each data point and the cluster, to which it belongs.
• The combiner function returns partial sums of batches of data points belonging to the
same cluster.
• The reducer returns the new centroids of each cluster.
• If the centroids remain unchanged the algorithm terminates. Otherwise, the steps are
repeated from the beginning.
This python script calls the k-means algorithm implemented on hadoop. However, before
implementing k-means the initial centroids are computed using the k-means++ algorithm
proposed in 2007 by Arthur and Vassilvitskii.
Coding Running [1]
19
Coding Running [2]
kmeans
20
Coding Running [3]
21
References
[1]: Noll, M. Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll. [online]
Available at: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ [Accessed
29 Mar. 2017].
[2]: Stackoverflow.com. (2017). Error launching job using mrjob on Hadoop. [online]
Available at: http://stackoverflow.com/questions/25358793/error-launching-job-using-mrjob-on-hadoop [Accessed
29 Mar. 2017].
[3]: David Arthur, and Sergei Vassilvitskii, (2007). k-means++: the advantages of careful seeding - Proceedings of
the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, January 7-9, 2007. 1st
ed. New York: ACM, pp.1027–1035.
[4]: Nlp.stanford.edu. K-means. Available at: http://nlp.stanford.edu/IR-book/html/htmledition/k-means-1.html
[Accessed 15 Mar. 2017]
[5]: Home.deib.polimi.it. (n.d.). Clustering - K-means. [online] Available at:
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html [Accessed 8 Mar. 2017].
[6]: Kyuseok Shim, "MapReduce Algorithms for Big Data Analysis", VLDB Conference, 2012
| lkoutsokera@gmail.com
| stratos.gounidellis@gmail.com
Lamprini Koutsokera
Stratos Gounidellis
BDSMasters 22

More Related Content

What's hot

Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
PolarSeven Pty Ltd
 
MongoDB Cheat Sheet – Quick Reference
MongoDB Cheat Sheet – Quick ReferenceMongoDB Cheat Sheet – Quick Reference
MongoDB Cheat Sheet – Quick Reference
César Trigo
 
Redis and it's data types
Redis and it's data typesRedis and it's data types
Redis and it's data types
Aniruddha Chakrabarti
 
Singly link list
Singly link listSingly link list
Singly link list
Rojin Khadka
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
Dr-Dipali Meher
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
introdution to SQL and SQL functions
introdution to SQL and SQL functionsintrodution to SQL and SQL functions
introdution to SQL and SQL functions
farwa waqar
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Ravi Teja
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
Dushhyant Kumar
 
Data Structures (CS8391)
Data Structures (CS8391)Data Structures (CS8391)
Data Structures (CS8391)
Elavarasi K
 
PostgreSQL
PostgreSQLPostgreSQL
Yarn.ppt
Yarn.pptYarn.ppt
Cours Big Data Chap3
Cours Big Data Chap3Cours Big Data Chap3
Cours Big Data Chap3
Amal Abid
 
Numpy tutorial
Numpy tutorialNumpy tutorial
Numpy tutorial
HarikaReddy115
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
Avkash Chauhan
 
Recommender systems for E-commerce
Recommender systems for E-commerceRecommender systems for E-commerce
Recommender systems for E-commerce
Alexander Konduforov
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
nehabsairam
 
php
phpphp
Movie recommendation project
Movie recommendation projectMovie recommendation project
Movie recommendation project
Abhishek Jaisingh
 

What's hot (20)

Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
MongoDB Cheat Sheet – Quick Reference
MongoDB Cheat Sheet – Quick ReferenceMongoDB Cheat Sheet – Quick Reference
MongoDB Cheat Sheet – Quick Reference
 
Redis and it's data types
Redis and it's data typesRedis and it's data types
Redis and it's data types
 
Singly link list
Singly link listSingly link list
Singly link list
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
introdution to SQL and SQL functions
introdution to SQL and SQL functionsintrodution to SQL and SQL functions
introdution to SQL and SQL functions
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Data Structures (CS8391)
Data Structures (CS8391)Data Structures (CS8391)
Data Structures (CS8391)
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
Yarn.ppt
Yarn.pptYarn.ppt
Yarn.ppt
 
Cours Big Data Chap3
Cours Big Data Chap3Cours Big Data Chap3
Cours Big Data Chap3
 
Numpy tutorial
Numpy tutorialNumpy tutorial
Numpy tutorial
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Recommender systems for E-commerce
Recommender systems for E-commerceRecommender systems for E-commerce
Recommender systems for E-commerce
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
php
phpphp
php
 
Movie recommendation project
Movie recommendation projectMovie recommendation project
Movie recommendation project
 

Similar to Implementation of k means algorithm on Hadoop

Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Codemotion
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Codemotion
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
Nikita Kesharwani
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop Cluster
IRJET Journal
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Jyotirmoy Sundi
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
Sreenu Musham
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
H04502048051
H04502048051H04502048051
H04502048051
ijceronline
 
Hadoop
HadoopHadoop
Hadoop
Anil Reddy
 
Handout3o
Handout3oHandout3o
Handout3o
Shahbaz Sidhu
 
NodeJS for Beginner
NodeJS for BeginnerNodeJS for Beginner
NodeJS for Beginner
Apaichon Punopas
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
Dave Hiltbrand
 
Cloud Meetup - Automation in the Cloud
Cloud Meetup - Automation in the CloudCloud Meetup - Automation in the Cloud
Cloud Meetup - Automation in the Cloud
petriojala123
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
MLconf
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
Rizwan Habib
 

Similar to Implementation of k means algorithm on Hadoop (20)

Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
 
Data Science
Data ScienceData Science
Data Science
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop Cluster
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
H04502048051
H04502048051H04502048051
H04502048051
 
Hadoop
HadoopHadoop
Hadoop
 
Handout3o
Handout3oHandout3o
Handout3o
 
NodeJS for Beginner
NodeJS for BeginnerNodeJS for Beginner
NodeJS for Beginner
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
Cloud Meetup - Automation in the Cloud
Cloud Meetup - Automation in the CloudCloud Meetup - Automation in the Cloud
Cloud Meetup - Automation in the Cloud
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
 

Recently uploaded

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 

Recently uploaded (20)

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 

Implementation of k means algorithm on Hadoop

  • 1. k-means algorithm implementation on Hadoop Athens University of Economics and Business Dpt. Of Management Science and Technology Prof. Damianos Chatziantoniou | lkoutsokera@gmail.com | stratos.gounidellis@gmail.com Lamprini Koutsokera Stratos Gounidellis BDSMasters
  • 3. Running Hadoop on Ubuntu Linux (Single-Node Cluster) [1] Prerequisites • jdk-8uversion-linux-x64.tar.gz (Download) • Directory modification • % tar zxvf jdk-8uversion-linux-x64.tar.gz Hadoop system user $ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser Disabling IPv6 /etc/sysctl.conf net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 Configuring SSH $ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp $ sudo chmod 750 /app/hadoop/tmp $ ssh-keygen -t rsa -P "" $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys $ ssh localhost Generate an SSH key for the hduser hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys Create an RSA key pair with an empty password References: [1], [2] 3
  • 4. Running Hadoop on Ubuntu Linux (Single-Node Cluster) [2] Hadoop configuration $ cd /usr/local $ sudo tar xzf hadoop-2.7.3.tar.gz $ sudo mv hadoop-2.7.3 hadoop $ sudo chown -R hduser:hadoop hadoop Configuration conf/core-site.xml <configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> $ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp $ sudo chmod 750 /app/hadoop/tmp ownerships and permissions export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64“ hadoop-env.sh <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> conf/mapred-site.xml References: [1], [2] 4
  • 5. Running Hadoop on Ubuntu Linux (Single-Node Cluster) [3] hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> </configuration> <configuration> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>2</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> yarn-site.xml References: [1], [2] 5
  • 6. Running Hadoop on Ubuntu Linux (Single-Node Cluster) [4] conf/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> </property> </configuration> hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format Formatting the HDFS filesystem via the NameNode hduser@ubuntu:~$ /usr/local/hadoop/bin/start- all.sh Starting your single-node cluster hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh Stopping your single-node cluster References: [1], [2] 6
  • 7. Setting up a Single Node Cluster - Hadoop Distributed File System (HDFS) To access Hadoop WEB UI , we need to type http://localhost:50070/ though our core-site.xml that has as value the http://localhost:9000. Our HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. 7
  • 9. Data Generation • Isotropic Gaussian blobs • 2.000.000 points • centers = [[25, 25], [-1, -1], [-25, -25]] • cluster_std = 3.5 9
  • 10. K-means++ : Calculation of initial centroids Reference: [3] 10
  • 11. K-means : Clustering algorithm References: [4], [5] 11
  • 12. K-means using Map Reduce Do Map Input is a data point and k centers are broadcasted Finds the closest center among k centers for the input point Reduce Input is one of k centers and all data points having this center as their closest center. Calculates the new center using data points Until all of new centers are not changed Reference: [6] 12
  • 13. K-means using Map Reduce [1] k = 3 map map map . . . Centroids are broadcasted to every map function key value [centroid1, point1] [centroid2, point3] [centroid3, point10] [centroid2, point45] . . . . . . . . . . . . . . . . [centroidx, pointx] Mapper map map map . . . Reference: [6] 13
  • 14. K-means using Map Reduce [2] combine combine combine . . . key values [centroid1, partialsum1] [centroid2, partialsum1] [centroid3, partialsum1] [centroid2, partialsum2] . . . . . . . . . . . . . . . . [centroidx, partialsumx] Combiner key values [centroid1, point1, point4] [centroid2, point3, point7, point9] [centroid3, point10] [centroid2, point45, point 73] . . . . . . . . . . . . . . . . [centroid3, pointx, . . . ] Reference: [6] 14
  • 15. K-means using Map Reduce [3] reduce reduce reduce . . . key value [centroid1, centroid1’] [centroid2, centroid2’] [centroid3, centroid3’] Reducer key values [centroid1, partialsum1, . . . ] [centroid2, partialsum1, . . . ] [centroid3, partialsum1, . . . ] Reference: [6] 15
  • 16. From algorithms to coding 16
  • 17. Python Coding [1] The initial task of the project is to generate a set of more than one million data points to be used later as an input for the k-means clustering algorithm. Using this python script three isotropic Gaussian blobs for clustering are generated. More specifically, the centers are the following data points [25, 25], [-1, -1], [-25, -25]. The silhouette score constitutes a useful criterion for determining the proper number of clusters. A silhouette close to 1 implies the datum is in an appropriate cluster, while a silhouette close to −1 implies the datum is in the wrong cluster. The specific python script calculates the silhouette score for different numbers of clusters ranging from 2 to 6. 17
  • 18. Python Coding [2] 18 In order to implement k-means algorithm on hadoop mrjob is used. Mrjob is a python package, which allows to write multi-step MapReduce jobs in pure Python and run them on a hadoop cluster. In our case mrjob run on a single-node cluster. • The mapper function returns each data point and the cluster, to which it belongs. • The combiner function returns partial sums of batches of data points belonging to the same cluster. • The reducer returns the new centroids of each cluster. • If the centroids remain unchanged the algorithm terminates. Otherwise, the steps are repeated from the beginning. This python script calls the k-means algorithm implemented on hadoop. However, before implementing k-means the initial centroids are computed using the k-means++ algorithm proposed in 2007 by Arthur and Vassilvitskii.
  • 22. References [1]: Noll, M. Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll. [online] Available at: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ [Accessed 29 Mar. 2017]. [2]: Stackoverflow.com. (2017). Error launching job using mrjob on Hadoop. [online] Available at: http://stackoverflow.com/questions/25358793/error-launching-job-using-mrjob-on-hadoop [Accessed 29 Mar. 2017]. [3]: David Arthur, and Sergei Vassilvitskii, (2007). k-means++: the advantages of careful seeding - Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, January 7-9, 2007. 1st ed. New York: ACM, pp.1027–1035. [4]: Nlp.stanford.edu. K-means. Available at: http://nlp.stanford.edu/IR-book/html/htmledition/k-means-1.html [Accessed 15 Mar. 2017] [5]: Home.deib.polimi.it. (n.d.). Clustering - K-means. [online] Available at: http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html [Accessed 8 Mar. 2017]. [6]: Kyuseok Shim, "MapReduce Algorithms for Big Data Analysis", VLDB Conference, 2012 | lkoutsokera@gmail.com | stratos.gounidellis@gmail.com Lamprini Koutsokera Stratos Gounidellis BDSMasters 22