Master's Thesis

UNIVERISTY OF GREENWICH
Machine Learning with Hadoop in the
Cloud
by
Sridhar Mamella
A thesis submitted in partial fulﬁlment for the degree of
Master of Science
in Big Data and Business Intelligence
in the
Faculty of Architecture, Computing, and Humanities
Department of Computing and Information Systems
September 2015

Declaration of Authorship
I, SRIDHAR MAMELLA, declare that this thesis titled, ‘MACHINE LEARNING WITH
HADOOP IN THE CLOUD’ and the work presented in it are my own. I conﬁrm that:
This work was done mainly while in candidature for a postgraduate degree at this
University.
Where any part of this thesis has previously been submitted for a degree or any
other qualiﬁcation at this University or any other institution, this has been clearly
stated.
Where I have consulted the published work of others, this is always clearly at-
tributed.
Where I have quoted from the work of others, the source is always given. With
the exception of such quotations, this thesis is entirely my own work.
I have acknowledged all main sources of help.
Where the the use of external databases has been availed, there the attributes were
throughly checked - the nature of such datasets are open-sourced.
Signed:
Sridhar Mamella
Date:
15 September 2015
i

“Design is not just what it looks like and feels like. Design is how it works.”
Steve Jobs (1955 - 2011)

UNIVERISTY OF GREENWICH
Abstract
Faculty of Architecture, Computing, and Humanities
Department of Computing and Information Systems
Master of Science
by Sridhar Mamella
The work presented in this thesis describes a new infrastructure for scalable machine
learning in the cloud with visualisation support. Here, a learner-agnostic and data-
parallel approach towards cloud-based distributed learning is implemented, which makes
use of the existing single-machine algorithms, without any dependence on distributed
file systems or shared memory between instances. Therefore, the product designed,
implemented, and released, as a web service is an asynchronous and decentralised peer
discovery protocol. This model learns and configures a distributed network of learners.
These end models are then filtered, fused, and graphically produced to produce the
predictions. This thoughtfully designed framework is built entirely in the cloud, and
makes use of a real-world Immigration and Gross Domestic Product (GDP) relationship-
problem from a large government database. These datasets were made available from the
National Office of National Statistics (ONS), within the United Kingdom. The obtained
results demonstrate the reliability and robustness of the system built, and shows how it
can be scaled to handle petabytes of data ? if required, with just a few clicks. Finally, the
end results are compared with a traditional Machine Learning approach, which makes
use of a programming language as the vital tool to effectively leverage the datasets stored
in the cloud and compares this to the proposed cloud-based solution. . .

Acknowledgements
The research and ideas presented in this thesis was carried out under the supervision of
Dr. Mohammed Hassouna, and Dr. Asif Malik.
Mohammed, has not only provided invaluable guidance and support during this re-
search, but also mentored me. Furthermore, his expertise, understanding, and patience,
contributed considerably towards my postgraduate experience. I appreciate his vast
knowledge and skill-set which has sparked a greater interest in Machine Learning. Fi-
nally, his insights and dedication were critical to formulating much of this presented
work.
Asif, with his knowledge, suggestions, and ideas, evolved strategic interest in the ﬁeld of
Big Data. He was the one who convinced me to consider Big Data as a research topic,
and was always there with useful criticisms when there was need for improvements.
Furthermore, he was the one who once advised me to ‘tell a story’ whenever I am
writing technical reports - as everyone likes to read a great story. . .
I would also like to thank all my colleagues from the Big Data research group and
the Department of Computing and Information Systems for their invaluable comments,
support and all the shared experiences during my period of study.
In addition, I would like to thank Dr. Tatiana Simmonds for all the helpful meetings,
swift email replies, and for helping me secure funding, in order to present at the Tableau
Conference in Atlanta, USA in March 2015, and the Hadoop Summit in San Jose, USA
in June 2015. I would also like to thank Dr. Wim Melis, under whose guidance I
wrote my Bachelor of Engineering thesis. Wim, still continues to provide support and
guidance, and not to forget that he was the one who introduced me to the wonderful
world of LATEX.
Moreover, I would like to thank my parents: Kotaiah and Vijayalakshmi - to whom I
dedicate this thesis. Without their motivation and support, I would not have made it
this far. . .
iv

Contents
Declaration of Authorship i
Abstract iii
Acknowledgements iv
List of Figures vi
List of Tables vii
Abbreviations viii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Aim and Objective . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Research Approach 4
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Gross Domestic Product . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Immigration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Data Flow and Technology Stack . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.2 Source Website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.3 Features in Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Legal, Social, and Ethical Issues with Open Datasets . . . . . . . . . . . . 9
3 Research Background 10
3.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.2 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.3 Deﬁnition of Machine Learning Terms . . . . . . . . . . . . . . . . 12
v

Contents vi
3.2 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.3 Hadoop Distribution File System . . . . . . . . . . . . . . . . . . . 14
3.2.4 Apache Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.5 HCatalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.6 Apache ZooKeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.7 Apache Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Cloud Based Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 Public and Private Clouds . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.3 Amazon Elastic Compute Cloud (Amazon EC2) . . . . . . . . . . 19
3.3.4 Microsoft Azure Studio . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Case Studies 23
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Machine Learning with Hadoop in the Cloud . . . . . . . . . . . . . . . . 24
4.2.1 Infrastructure Implementation . . . . . . . . . . . . . . . . . . . . 24
4.2.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.3 Infrastructure Evaluation . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Machine Learning with Python . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 Infrastructure Development . . . . . . . . . . . . . . . . . . . . . . 27
4.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.3 Infrastructure Evaluation . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Comparing Case 1 and Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Conclusions and Future Work 33
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 The Final Word. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
A Links to External Sources 35
A.1 Dropbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
A.2 Tableau Public Proﬁle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
B Code 37
B.1 Python Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
B.2 R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Bibliography 39

List of Figures
2.1 Technology Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Hadoop Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 HCatalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Apache Zookeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 AzureML Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Azure ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Machine Learning Implementation . . . . . . . . . . . . . . . . . . . . . . 26
4.3 EMR Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
vii

List of Tables
3.1 On-Demand Instance Prices . . . . . . . . . . . . . . . . . . . . . . . . . . 20
viii

Abbreviations
API Application Program Interface
AWS Amazon Web Services
DFS Distributed File System
EEA European Economic Area
ETL Extract Transform Load
EMR Elastic Map Reduce
GB Giga Byte
GDP Gross Domestic Product
GUI Graphical User Interface
GVA Gross Value Added
HDFS Hadoop Distributed File System
HiveQL Hive Query Language
IaaS Infrastructure as a Service
MB Mega Byte
ML Machine Learning
PaaS Platform as a Service
PB Peta Byte
RDS Relational Database Service
RDBMS Relational Data Base Management Service
RFID Radio Frequency IDentiﬁcation
SaaS Software as a Service
SQL Structured Query Language
TB Tera Byte
UDF Universal Disk Format
ZB Zeta Byte
ix

Keywords:
Big Data, Hadoop, HCatalog, Hive, MapReduce, Machine Learning,
Amazon Web Services, Azure Machine Learning, Tableau,
Visualisation, Business Intelligence, Python, R, Elastic MapReduce
x

Chapter 1
Introduction
“Begin at the beginning, the King said gravely, “and go on till you come
to the end: then stop.”
Lewis Carroll, Alice in Wonderland
1.1 Motivation
Over the last decades, numerous improvements in cloud computing and distributed
computing have significantly [1] decreased the data-flow capabilities. Unfortunately,
this advancement has not been put into use by combining various open source resources
to deal with GDP and Immigration, which result in a growing performance gap [2]
between the two. This growing gap can be related to the complexity of such advanced
distributed framework systems and their architectures. Many economics generally make
predictions based on pure Machine Learning [3], rather than taking advantage of the Big
Data technologies [4]. Specifically, an optimal solution to the latter is to closely relate
to the application being implemented.
For example, Machine Learning can be combined with the Hadoop stack [5] and all of
this infrastructure can be bundled into Cloud based services. This would help derive
much better results with regards to predicting the economic impact of immigration [6]
on GDP in the United Kingdom. Such architecture would provide solutions that handle
issues of scalability, flexibility, and speed and be resilient to failure, and be highly cost
effective, when compared to traditional approaches.
In addition, the market is currently seeing an increase in the use of fault-tolerant [7],
and low-cost cloud based solutions. These services are often offered as a free service
1

Master's Thesis

Recommended

Recommended

More Related Content

Similar to Master's Thesis

Similar to Master's Thesis (20)

Recently uploaded

Recently uploaded (20)

Master's Thesis