Abhijit Kumar Behera 
M.Tech (CSE) 
Roll No. 1350001 
School of Computer Engineering 
Guided By : Dr. Laxman Sahoo
Contents 
 Introduction 
 Apache Hadoop related projects 
 Application of Mahout 
 Literature Survey 
 Plan of Action 
 Conclusion 
 References
Introduction 
•The K-means algorithm is one of the most well-known clustering 
algorithms that has been frequently used to variety of problems. 
•MapReduce as the most popular cloud computing parallel 
framework is effective to handle massive data, the researches of K-means 
clustering algorithm which is based on MapReduce 
become a focus for scholars.
Components of Hadoop 
HDFS 
•Name Node 
•Data Node 
•Secondary 
Name Node 
 Map Reduce 
•Map() 
•Combine() 
•Reduce() 
YARN 
•Job Tracker 
•TaskTracker 
HBase
MapReduce Word count process
HBase 
Hadoop 
( HDFS and 
MapReduce) 
Mahout 
Spark 
HIVE 
Zookeeper Sqoop 
PIG 
Apache Hadoop Projects
Application of Mahout 
 Collaborative Filtering 
 Matrix factorization based recommenders 
 A user based Recommender 
 Clustering 
 Canopy Clustering 
 K-Means Clustering 
 Fuzzy K-Means 
 Affinity Propagation Clustering 
 Classification 
 Naive Bayes 
 Random forest classifier
Literature Survey 
An Improved parallel K-means Clustering Algorithm with 
MapReduce 
Authors Name: Qing Liao, Fan Yang, Jingming Zhao 
Journal : Communication Technology (ICCT), IEEE 
Year of Publication:2014 
Parallel K-means Algorithm 
1) Initial 
2) Mapper 
3) Reducer
Literature Survey...
Literature Survey 
Clouds for Scalable Big Data Analytics 
Authors Name: Domenico Talia 
Journal: IEEE Computer Society 
Year of Publication:2013 
In this paper, author describe how cloud comp uting enhance the development and 
functionality of Big Data Analytics when it deployed into it. 
Cloud Service Model Features Users 
Data analytics software as a service A single and complete data mining 
application or task (including data sources) 
offered as a service 
End users, analytics managers, data 
analysts 
Data analytics platform as a service A data analysis suite or framework for 
programming or developing high-level 
applications, hiding the cloud 
infrastructure and data storage 
Data mining application developers, 
data scientists 
Data analytics infrastructure as a 
service 
A set of virtualized resources provided to a 
programmer or data mining researcher for 
developing, configuring, and running data 
analysis frameworks or applications 
Data mining programmers, data 
management developers, data 
mining researchers
Plan of Action 
August - October 2014 Literature survey is done. 
November 2014 
Problem definition formulation is 
done and problem solving outline are 
yet to be done 
December 2014- January 2015 
Find out the appropriate solution of 
the problem yet to be formulated 
February-May 2015 
Final implementation of the solution 
with result yet to be done
Conclusion 
Large-scale data mining has been a new challenge in recent years. 
Using the Map-Reduce frame work the big data analytics can be 
accomplished. The K-means algorithm is one of the most well-known 
clustering algorithms. However, its processing performance 
has usually encountered a bottleneck if being utilized to deal with 
massive data. A parallel K-means algorithm with MapReduce which 
shows obvious advantage is implemented to handle massive data.
References 
[1] Walisa Romsaiyud, Wichian Premchaiswadi, " An Adaptive Machine Learning on Map- 
Reduce Framework for Improving performance of Large-Scale Data Analysis on EC ", 
Eleventh IEEE Int'l Conf. on ICT and knowledge Engineering, 2014 
[2] Domenico Talia," Clouds for Scalable Big Data Analytics ", IEEE Computer Society, 2013 
[3] Feng Ye, Zhijan Wang , "Cloud-based Big Data Mining & Analyzing Services 
Platform integrating R", IEEE International Conference on Advance Cloud and Big Data 
, 2013 
[4].DzApache-Hadoopdz-http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F
MACHINE LEARNING ON MAPREDUCE FRAMEWORK

MACHINE LEARNING ON MAPREDUCE FRAMEWORK

  • 1.
    Abhijit Kumar Behera M.Tech (CSE) Roll No. 1350001 School of Computer Engineering Guided By : Dr. Laxman Sahoo
  • 2.
    Contents  Introduction  Apache Hadoop related projects  Application of Mahout  Literature Survey  Plan of Action  Conclusion  References
  • 3.
    Introduction •The K-meansalgorithm is one of the most well-known clustering algorithms that has been frequently used to variety of problems. •MapReduce as the most popular cloud computing parallel framework is effective to handle massive data, the researches of K-means clustering algorithm which is based on MapReduce become a focus for scholars.
  • 4.
    Components of Hadoop HDFS •Name Node •Data Node •Secondary Name Node  Map Reduce •Map() •Combine() •Reduce() YARN •Job Tracker •TaskTracker HBase
  • 5.
  • 6.
    HBase Hadoop (HDFS and MapReduce) Mahout Spark HIVE Zookeeper Sqoop PIG Apache Hadoop Projects
  • 7.
    Application of Mahout  Collaborative Filtering  Matrix factorization based recommenders  A user based Recommender  Clustering  Canopy Clustering  K-Means Clustering  Fuzzy K-Means  Affinity Propagation Clustering  Classification  Naive Bayes  Random forest classifier
  • 8.
    Literature Survey AnImproved parallel K-means Clustering Algorithm with MapReduce Authors Name: Qing Liao, Fan Yang, Jingming Zhao Journal : Communication Technology (ICCT), IEEE Year of Publication:2014 Parallel K-means Algorithm 1) Initial 2) Mapper 3) Reducer
  • 9.
  • 10.
    Literature Survey Cloudsfor Scalable Big Data Analytics Authors Name: Domenico Talia Journal: IEEE Computer Society Year of Publication:2013 In this paper, author describe how cloud comp uting enhance the development and functionality of Big Data Analytics when it deployed into it. Cloud Service Model Features Users Data analytics software as a service A single and complete data mining application or task (including data sources) offered as a service End users, analytics managers, data analysts Data analytics platform as a service A data analysis suite or framework for programming or developing high-level applications, hiding the cloud infrastructure and data storage Data mining application developers, data scientists Data analytics infrastructure as a service A set of virtualized resources provided to a programmer or data mining researcher for developing, configuring, and running data analysis frameworks or applications Data mining programmers, data management developers, data mining researchers
  • 11.
    Plan of Action August - October 2014 Literature survey is done. November 2014 Problem definition formulation is done and problem solving outline are yet to be done December 2014- January 2015 Find out the appropriate solution of the problem yet to be formulated February-May 2015 Final implementation of the solution with result yet to be done
  • 12.
    Conclusion Large-scale datamining has been a new challenge in recent years. Using the Map-Reduce frame work the big data analytics can be accomplished. The K-means algorithm is one of the most well-known clustering algorithms. However, its processing performance has usually encountered a bottleneck if being utilized to deal with massive data. A parallel K-means algorithm with MapReduce which shows obvious advantage is implemented to handle massive data.
  • 13.
    References [1] WalisaRomsaiyud, Wichian Premchaiswadi, " An Adaptive Machine Learning on Map- Reduce Framework for Improving performance of Large-Scale Data Analysis on EC ", Eleventh IEEE Int'l Conf. on ICT and knowledge Engineering, 2014 [2] Domenico Talia," Clouds for Scalable Big Data Analytics ", IEEE Computer Society, 2013 [3] Feng Ye, Zhijan Wang , "Cloud-based Big Data Mining & Analyzing Services Platform integrating R", IEEE International Conference on Advance Cloud and Big Data , 2013 [4].DzApache-Hadoopdz-http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F