SlideShare a Scribd company logo
1 of 12
UNIVERISTY OF GREENWICH
Machine Learning with Hadoop in the
Cloud
by
Sridhar Mamella
A thesis submitted in partial fulfilment for the degree of
Master of Science
in Big Data and Business Intelligence
in the
Faculty of Architecture, Computing, and Humanities
Department of Computing and Information Systems
September 2015
Declaration of Authorship
I, SRIDHAR MAMELLA, declare that this thesis titled, ‘MACHINE LEARNING WITH
HADOOP IN THE CLOUD’ and the work presented in it are my own. I confirm that:
This work was done mainly while in candidature for a postgraduate degree at this
University.
Where any part of this thesis has previously been submitted for a degree or any
other qualification at this University or any other institution, this has been clearly
stated.
Where I have consulted the published work of others, this is always clearly at-
tributed.
Where I have quoted from the work of others, the source is always given. With
the exception of such quotations, this thesis is entirely my own work.
I have acknowledged all main sources of help.
Where the the use of external databases has been availed, there the attributes were
throughly checked - the nature of such datasets are open-sourced.
Signed:
Sridhar Mamella
Date:
15 September 2015
i
“Design is not just what it looks like and feels like. Design is how it works.”
Steve Jobs (1955 - 2011)
UNIVERISTY OF GREENWICH
Abstract
Faculty of Architecture, Computing, and Humanities
Department of Computing and Information Systems
Master of Science
by Sridhar Mamella
The work presented in this thesis describes a new infrastructure for scalable machine
learning in the cloud with visualisation support. Here, a learner-agnostic and data-
parallel approach towards cloud-based distributed learning is implemented, which makes
use of the existing single-machine algorithms, without any dependence on distributed
file systems or shared memory between instances. Therefore, the product designed,
implemented, and released, as a web service is an asynchronous and decentralised peer
discovery protocol. This model learns and configures a distributed network of learners.
These end models are then filtered, fused, and graphically produced to produce the
predictions. This thoughtfully designed framework is built entirely in the cloud, and
makes use of a real-world Immigration and Gross Domestic Product (GDP) relationship-
problem from a large government database. These datasets were made available from the
National Office of National Statistics (ONS), within the United Kingdom. The obtained
results demonstrate the reliability and robustness of the system built, and shows how it
can be scaled to handle petabytes of data ? if required, with just a few clicks. Finally, the
end results are compared with a traditional Machine Learning approach, which makes
use of a programming language as the vital tool to effectively leverage the datasets stored
in the cloud and compares this to the proposed cloud-based solution. . .
Acknowledgements
The research and ideas presented in this thesis was carried out under the supervision of
Dr. Mohammed Hassouna, and Dr. Asif Malik.
Mohammed, has not only provided invaluable guidance and support during this re-
search, but also mentored me. Furthermore, his expertise, understanding, and patience,
contributed considerably towards my postgraduate experience. I appreciate his vast
knowledge and skill-set which has sparked a greater interest in Machine Learning. Fi-
nally, his insights and dedication were critical to formulating much of this presented
work.
Asif, with his knowledge, suggestions, and ideas, evolved strategic interest in the field of
Big Data. He was the one who convinced me to consider Big Data as a research topic,
and was always there with useful criticisms when there was need for improvements.
Furthermore, he was the one who once advised me to ‘tell a story’ whenever I am
writing technical reports - as everyone likes to read a great story. . .
I would also like to thank all my colleagues from the Big Data research group and
the Department of Computing and Information Systems for their invaluable comments,
support and all the shared experiences during my period of study.
In addition, I would like to thank Dr. Tatiana Simmonds for all the helpful meetings,
swift email replies, and for helping me secure funding, in order to present at the Tableau
Conference in Atlanta, USA in March 2015, and the Hadoop Summit in San Jose, USA
in June 2015. I would also like to thank Dr. Wim Melis, under whose guidance I
wrote my Bachelor of Engineering thesis. Wim, still continues to provide support and
guidance, and not to forget that he was the one who introduced me to the wonderful
world of LATEX.
Moreover, I would like to thank my parents: Kotaiah and Vijayalakshmi - to whom I
dedicate this thesis. Without their motivation and support, I would not have made it
this far. . .
iv
Contents
Declaration of Authorship i
Abstract iii
Acknowledgements iv
List of Figures vi
List of Tables vii
Abbreviations viii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Aim and Objective . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Research Approach 4
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Gross Domestic Product . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Immigration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Data Flow and Technology Stack . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.2 Source Website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.3 Features in Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Legal, Social, and Ethical Issues with Open Datasets . . . . . . . . . . . . 9
3 Research Background 10
3.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.2 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.3 Definition of Machine Learning Terms . . . . . . . . . . . . . . . . 12
v
Contents vi
3.2 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.3 Hadoop Distribution File System . . . . . . . . . . . . . . . . . . . 14
3.2.4 Apache Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.5 HCatalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.6 Apache ZooKeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.7 Apache Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Cloud Based Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 Public and Private Clouds . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.3 Amazon Elastic Compute Cloud (Amazon EC2) . . . . . . . . . . 19
3.3.4 Microsoft Azure Studio . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Case Studies 23
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Machine Learning with Hadoop in the Cloud . . . . . . . . . . . . . . . . 24
4.2.1 Infrastructure Implementation . . . . . . . . . . . . . . . . . . . . 24
4.2.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.3 Infrastructure Evaluation . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Machine Learning with Python . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 Infrastructure Development . . . . . . . . . . . . . . . . . . . . . . 27
4.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.3 Infrastructure Evaluation . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Comparing Case 1 and Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Conclusions and Future Work 33
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 The Final Word. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
A Links to External Sources 35
A.1 Dropbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
A.2 Tableau Public Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
B Code 37
B.1 Python Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
B.2 R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Bibliography 39
List of Figures
2.1 Technology Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Hadoop Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 HCatalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Apache Zookeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 AzureML Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Azure ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Machine Learning Implementation . . . . . . . . . . . . . . . . . . . . . . 26
4.3 EMR Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
vii
List of Tables
3.1 On-Demand Instance Prices . . . . . . . . . . . . . . . . . . . . . . . . . . 20
viii
Abbreviations
API Application Program Interface
AWS Amazon Web Services
DFS Distributed File System
EEA European Economic Area
ETL Extract Transform Load
EMR Elastic Map Reduce
GB Giga Byte
GDP Gross Domestic Product
GUI Graphical User Interface
GVA Gross Value Added
HDFS Hadoop Distributed File System
HiveQL Hive Query Language
IaaS Infrastructure as a Service
MB Mega Byte
ML Machine Learning
PaaS Platform as a Service
PB Peta Byte
RDS Relational Database Service
RDBMS Relational Data Base Management Service
RFID Radio Frequency IDentification
SaaS Software as a Service
SQL Structured Query Language
TB Tera Byte
UDF Universal Disk Format
ZB Zeta Byte
ix
Keywords:
Big Data, Hadoop, HCatalog, Hive, MapReduce, Machine Learning,
Amazon Web Services, Azure Machine Learning, Tableau,
Visualisation, Business Intelligence, Python, R, Elastic MapReduce
x
Chapter 1
Introduction
“Begin at the beginning, the King said gravely, “and go on till you come
to the end: then stop.”
Lewis Carroll, Alice in Wonderland
1.1 Motivation
Over the last decades, numerous improvements in cloud computing and distributed
computing have significantly [1] decreased the data-flow capabilities. Unfortunately,
this advancement has not been put into use by combining various open source resources
to deal with GDP and Immigration, which result in a growing performance gap [2]
between the two. This growing gap can be related to the complexity of such advanced
distributed framework systems and their architectures. Many economics generally make
predictions based on pure Machine Learning [3], rather than taking advantage of the Big
Data technologies [4]. Specifically, an optimal solution to the latter is to closely relate
to the application being implemented.
For example, Machine Learning can be combined with the Hadoop stack [5] and all of
this infrastructure can be bundled into Cloud based services. This would help derive
much better results with regards to predicting the economic impact of immigration [6]
on GDP in the United Kingdom. Such architecture would provide solutions that handle
issues of scalability, flexibility, and speed and be resilient to failure, and be highly cost
effective, when compared to traditional approaches.
In addition, the market is currently seeing an increase in the use of fault-tolerant [7],
and low-cost cloud based solutions. These services are often offered as a free service
1

More Related Content

Similar to Master's Thesis

Project final report
Project final reportProject final report
Project final reportALIN BABU
 
M.Sc Dissertation: Simple Digital Libraries
M.Sc Dissertation: Simple Digital LibrariesM.Sc Dissertation: Simple Digital Libraries
M.Sc Dissertation: Simple Digital LibrariesLighton Phiri
 
Specification of the Linked Media Layer
Specification of the Linked Media LayerSpecification of the Linked Media Layer
Specification of the Linked Media LayerLinkedTV
 
Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...Lorenzo D'Eri
 
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...Nóra Szepes
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
 
Machine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportMachine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportTrushita Redij
 
QBD_1464843125535 - Copy
QBD_1464843125535 - CopyQBD_1464843125535 - Copy
QBD_1464843125535 - CopyBhavesh Jangale
 
Scale The Realtime Web
Scale The Realtime WebScale The Realtime Web
Scale The Realtime Webpfleidi
 
bonino_thesis_final
bonino_thesis_finalbonino_thesis_final
bonino_thesis_finalDario Bonino
 
Data over dab
Data over dabData over dab
Data over dabDigris AG
 
Content and concept filter
Content and concept filterContent and concept filter
Content and concept filterLinkedTV
 
iGUARD: An Intelligent Way To Secure - Report
iGUARD: An Intelligent Way To Secure - ReportiGUARD: An Intelligent Way To Secure - Report
iGUARD: An Intelligent Way To Secure - ReportNandu B Rajan
 

Similar to Master's Thesis (20)

Master thesis
Master thesisMaster thesis
Master thesis
 
Project final report
Project final reportProject final report
Project final report
 
M.Sc Dissertation: Simple Digital Libraries
M.Sc Dissertation: Simple Digital LibrariesM.Sc Dissertation: Simple Digital Libraries
M.Sc Dissertation: Simple Digital Libraries
 
Specification of the Linked Media Layer
Specification of the Linked Media LayerSpecification of the Linked Media Layer
Specification of the Linked Media Layer
 
Aregay_Msc_EEMCS
Aregay_Msc_EEMCSAregay_Msc_EEMCS
Aregay_Msc_EEMCS
 
Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...
 
Investigation in deep web
Investigation in deep webInvestigation in deep web
Investigation in deep web
 
web_based_ide
web_based_ideweb_based_ide
web_based_ide
 
thesis_online
thesis_onlinethesis_online
thesis_online
 
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
 
Thesis
ThesisThesis
Thesis
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
Machine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportMachine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_report
 
QBD_1464843125535 - Copy
QBD_1464843125535 - CopyQBD_1464843125535 - Copy
QBD_1464843125535 - Copy
 
Scale The Realtime Web
Scale The Realtime WebScale The Realtime Web
Scale The Realtime Web
 
Software guide 3.20.0
Software guide 3.20.0Software guide 3.20.0
Software guide 3.20.0
 
bonino_thesis_final
bonino_thesis_finalbonino_thesis_final
bonino_thesis_final
 
Data over dab
Data over dabData over dab
Data over dab
 
Content and concept filter
Content and concept filterContent and concept filter
Content and concept filter
 
iGUARD: An Intelligent Way To Secure - Report
iGUARD: An Intelligent Way To Secure - ReportiGUARD: An Intelligent Way To Secure - Report
iGUARD: An Intelligent Way To Secure - Report
 

Recently uploaded

Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...HyderabadDolls
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...HyderabadDolls
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridihmeghakumariji156
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...vershagrag
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxAniqa Zai
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...ThinkInnovation
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?RemarkSemacio
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 

Recently uploaded (20)

Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 

Master's Thesis

  • 1. UNIVERISTY OF GREENWICH Machine Learning with Hadoop in the Cloud by Sridhar Mamella A thesis submitted in partial fulfilment for the degree of Master of Science in Big Data and Business Intelligence in the Faculty of Architecture, Computing, and Humanities Department of Computing and Information Systems September 2015
  • 2. Declaration of Authorship I, SRIDHAR MAMELLA, declare that this thesis titled, ‘MACHINE LEARNING WITH HADOOP IN THE CLOUD’ and the work presented in it are my own. I confirm that: This work was done mainly while in candidature for a postgraduate degree at this University. Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated. Where I have consulted the published work of others, this is always clearly at- tributed. Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. I have acknowledged all main sources of help. Where the the use of external databases has been availed, there the attributes were throughly checked - the nature of such datasets are open-sourced. Signed: Sridhar Mamella Date: 15 September 2015 i
  • 3. “Design is not just what it looks like and feels like. Design is how it works.” Steve Jobs (1955 - 2011)
  • 4. UNIVERISTY OF GREENWICH Abstract Faculty of Architecture, Computing, and Humanities Department of Computing and Information Systems Master of Science by Sridhar Mamella The work presented in this thesis describes a new infrastructure for scalable machine learning in the cloud with visualisation support. Here, a learner-agnostic and data- parallel approach towards cloud-based distributed learning is implemented, which makes use of the existing single-machine algorithms, without any dependence on distributed file systems or shared memory between instances. Therefore, the product designed, implemented, and released, as a web service is an asynchronous and decentralised peer discovery protocol. This model learns and configures a distributed network of learners. These end models are then filtered, fused, and graphically produced to produce the predictions. This thoughtfully designed framework is built entirely in the cloud, and makes use of a real-world Immigration and Gross Domestic Product (GDP) relationship- problem from a large government database. These datasets were made available from the National Office of National Statistics (ONS), within the United Kingdom. The obtained results demonstrate the reliability and robustness of the system built, and shows how it can be scaled to handle petabytes of data ? if required, with just a few clicks. Finally, the end results are compared with a traditional Machine Learning approach, which makes use of a programming language as the vital tool to effectively leverage the datasets stored in the cloud and compares this to the proposed cloud-based solution. . .
  • 5. Acknowledgements The research and ideas presented in this thesis was carried out under the supervision of Dr. Mohammed Hassouna, and Dr. Asif Malik. Mohammed, has not only provided invaluable guidance and support during this re- search, but also mentored me. Furthermore, his expertise, understanding, and patience, contributed considerably towards my postgraduate experience. I appreciate his vast knowledge and skill-set which has sparked a greater interest in Machine Learning. Fi- nally, his insights and dedication were critical to formulating much of this presented work. Asif, with his knowledge, suggestions, and ideas, evolved strategic interest in the field of Big Data. He was the one who convinced me to consider Big Data as a research topic, and was always there with useful criticisms when there was need for improvements. Furthermore, he was the one who once advised me to ‘tell a story’ whenever I am writing technical reports - as everyone likes to read a great story. . . I would also like to thank all my colleagues from the Big Data research group and the Department of Computing and Information Systems for their invaluable comments, support and all the shared experiences during my period of study. In addition, I would like to thank Dr. Tatiana Simmonds for all the helpful meetings, swift email replies, and for helping me secure funding, in order to present at the Tableau Conference in Atlanta, USA in March 2015, and the Hadoop Summit in San Jose, USA in June 2015. I would also like to thank Dr. Wim Melis, under whose guidance I wrote my Bachelor of Engineering thesis. Wim, still continues to provide support and guidance, and not to forget that he was the one who introduced me to the wonderful world of LATEX. Moreover, I would like to thank my parents: Kotaiah and Vijayalakshmi - to whom I dedicate this thesis. Without their motivation and support, I would not have made it this far. . . iv
  • 6. Contents Declaration of Authorship i Abstract iii Acknowledgements iv List of Figures vi List of Tables vii Abbreviations viii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research Aim and Objective . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Research Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Research Approach 4 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Gross Domestic Product . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.2 Immigration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Data Flow and Technology Stack . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4.2 Source Website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4.3 Features in Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5 Legal, Social, and Ethical Issues with Open Datasets . . . . . . . . . . . . 9 3 Research Background 10 3.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1.2 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.3 Definition of Machine Learning Terms . . . . . . . . . . . . . . . . 12 v
  • 7. Contents vi 3.2 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.3 Hadoop Distribution File System . . . . . . . . . . . . . . . . . . . 14 3.2.4 Apache Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.5 HCatalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.6 Apache ZooKeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.7 Apache Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Cloud Based Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.2 Public and Private Clouds . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.3 Amazon Elastic Compute Cloud (Amazon EC2) . . . . . . . . . . 19 3.3.4 Microsoft Azure Studio . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Case Studies 23 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Machine Learning with Hadoop in the Cloud . . . . . . . . . . . . . . . . 24 4.2.1 Infrastructure Implementation . . . . . . . . . . . . . . . . . . . . 24 4.2.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.3 Infrastructure Evaluation . . . . . . . . . . . . . . . . . . . . . . . 25 4.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3 Machine Learning with Python . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.1 Infrastructure Development . . . . . . . . . . . . . . . . . . . . . . 27 4.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.3 Infrastructure Evaluation . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Comparing Case 1 and Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . 32 5 Conclusions and Future Work 33 5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.3 The Final Word. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 A Links to External Sources 35 A.1 Dropbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 A.2 Tableau Public Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 B Code 37 B.1 Python Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 B.2 R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Bibliography 39
  • 8. List of Figures 2.1 Technology Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Hadoop Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 HCatalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4 Apache Zookeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.5 AzureML Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 Azure ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Machine Learning Implementation . . . . . . . . . . . . . . . . . . . . . . 26 4.3 EMR Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 vii
  • 9. List of Tables 3.1 On-Demand Instance Prices . . . . . . . . . . . . . . . . . . . . . . . . . . 20 viii
  • 10. Abbreviations API Application Program Interface AWS Amazon Web Services DFS Distributed File System EEA European Economic Area ETL Extract Transform Load EMR Elastic Map Reduce GB Giga Byte GDP Gross Domestic Product GUI Graphical User Interface GVA Gross Value Added HDFS Hadoop Distributed File System HiveQL Hive Query Language IaaS Infrastructure as a Service MB Mega Byte ML Machine Learning PaaS Platform as a Service PB Peta Byte RDS Relational Database Service RDBMS Relational Data Base Management Service RFID Radio Frequency IDentification SaaS Software as a Service SQL Structured Query Language TB Tera Byte UDF Universal Disk Format ZB Zeta Byte ix
  • 11. Keywords: Big Data, Hadoop, HCatalog, Hive, MapReduce, Machine Learning, Amazon Web Services, Azure Machine Learning, Tableau, Visualisation, Business Intelligence, Python, R, Elastic MapReduce x
  • 12. Chapter 1 Introduction “Begin at the beginning, the King said gravely, “and go on till you come to the end: then stop.” Lewis Carroll, Alice in Wonderland 1.1 Motivation Over the last decades, numerous improvements in cloud computing and distributed computing have significantly [1] decreased the data-flow capabilities. Unfortunately, this advancement has not been put into use by combining various open source resources to deal with GDP and Immigration, which result in a growing performance gap [2] between the two. This growing gap can be related to the complexity of such advanced distributed framework systems and their architectures. Many economics generally make predictions based on pure Machine Learning [3], rather than taking advantage of the Big Data technologies [4]. Specifically, an optimal solution to the latter is to closely relate to the application being implemented. For example, Machine Learning can be combined with the Hadoop stack [5] and all of this infrastructure can be bundled into Cloud based services. This would help derive much better results with regards to predicting the economic impact of immigration [6] on GDP in the United Kingdom. Such architecture would provide solutions that handle issues of scalability, flexibility, and speed and be resilient to failure, and be highly cost effective, when compared to traditional approaches. In addition, the market is currently seeing an increase in the use of fault-tolerant [7], and low-cost cloud based solutions. These services are often offered as a free service 1