Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
1
Department of Information Systems and Computing
BSc (Hons) Information Systems (Business)
Academic Year 2013 – 2014
Digital Prosumer - Identification of Personas through Intelligent
Data Mining (Clustering)
Adebowale Nadi
1008089
A report submitted in partial fulfilment of the requirements for the degree of
Bachelor of Science
Brunel University
Department of Information Systems and Computing
Uxbridge
Middlesex
UB8 3PH
United Kingdom
T: +44 1895 203397
F: +44 (0) 1895 251686

2
Abstract
The main objective of the paper is to explore the idea of prosumption and how digital
personhood data that we produce can be extracted, filtered and analysed and given back to
us [prosumers] in a way that is commodifiable, subsequently empowering citizens to utilize
data that they produce. One aspect of this hypothesis is the identification of personas through
clustering which is facet of intelligent data analysis. With the sole aim being of building a
Persona Identification Application (PIA) which sole purpose is to be able to deduce personas
from data stores.
In 2011 it was estimated that 274.2 million Americans were connected to the internet
leading to 81 billion minutes being spent on social networking sites and blogs. In the same
year 117.6 million people visited the internet via a mobile phone accounting for $246 billon
being spent making online purchases (Palis, 2012). Well renowed mangement consultency
firm Boston Consulting Group projects that the Internet Econmoy will contribute $4.2 billion
to G20 total GDP by 2016. This lead co-author David Dein to emphasise that “If it were a
national economy [internet economy], it would rank in the world’s top five, behind only the U.S.,
China, India, and Japan, and ahead of Germany,” (Dein, 2012). With the rise of the internet
economy coupled with the increased rise of mobile devices connected to the internet,
faciliating an unprecedently amount of data being held, intelligent data analysis needs to be
used to be able to isolate the key information thus producing personas that can be later
traded on a futures market.
This paper will look at the rise of the internet economy coupled with the emergance of the
digital prosumer. In addtion clustering will be look at in finite detail, looking at the various
clustering techniques that can be used in the purposed application, looking into the
advantages and disadvantages of each before deciding on which is the appropriate method
for this project. Furthmore this paper will detail the step by step implementation of the
application detailing all the design and requirement analysis that took place before hand.
Finally a detailed evaluation will be explained and executed relaying the findings from the
application and seeing if, infact, the application meets the aim in a coherent and
chomprehensible manner.

3
Acknowledgements
First and foremost I would like to take this opportunity to thank my Lord Jesus Christ for
guiding me through this project and giving me the strength to be able to conclude this
dissertation. I would also like to thank my Mum & Dad for their indubitable and
unconditional support given to me throughout my time working on this project. In addition,
all the people that helped, supported and assisted me in anyway shape or form in putting this
dissertation together I would like to personally thank and extend my sincere gratitude
towards. (There are too many to name personally but they know who they are). Last but
certainly not least, I would like to personally thank my supervisor Panos Louvieris and his
assistant Natalie Clewley for all their support rendered to me throughout this project. This
dissertation was, no doubt, the biggest challenge I have faced in all my 19 years in education,
but definitely the most rewarding, learning a highly complex topic (data mining) and learning
to code in a completely new software environment with no prior experience. I truly wouldn’t
have been able to complete it without their guidance, assistance and motivation. In closing I
would like to wish Panos and his team the best of luck in completing their EPSRC sponsored
project Digital Personhood: Digital Prosumer.
Total Words: 15,500
I certify that the work presented in the dissertation is my own unless referenced.
Signature Adebowale Olatunde Nadi
Date 24/03/2014

4
Table of Contents
Abstract...........................................................................................................................................................................2
Acknowledgements.................................................................................................................................................... 3
Table of Contents........................................................................................................................................................ 4
List of Tables.................................................................................................................................................................7
List of Figures............................................................................................................................................................... 7
1 Introduction ........................................................................................................................................................ 9
1.1 Problem Definition..................................................................................................................................9
1.2 Aims and Objectives............................................................................................................................... 9
1.3 Project Approach.................................................................................................................................. 10
1.4 Dissertation Outline ............................................................................................................................ 11
2 Literature Review .......................................................................................................................................... 12
2.1 Personal Data......................................................................................................................................... 12
2.2 Value of Personal Data ....................................................................................................................... 12
2.3 The Internet [Digital] Economy...................................................................................................... 13
2.3.1 Midata .................................................................................................................................... 13
2.3.2 Information Economy Strategy (IES)........................................................................ 13
2.4 What is a Persona?............................................................................................................................... 14
2.5 What is a Prosumer? ........................................................................................................................... 14
2.5.1 The Rise of the Digital Prosumer................................................................................ 15
2.6 Data Mining............................................................................................................................................. 15
2.6.1 Knowledge Discovery from Data [KDD] .................................................................. 16
2.7 Cluster Analysis..................................................................................................................................... 17
2.7.1 Partitioning Technique................................................................................................... 17
2.7.2 Advantages and Disadvantages................................................................................... 17
2.7.3 Hierarchical Technique................................................................................................... 18
2.8 Critical Discussion................................................................................................................................ 19
2.9 Summary.................................................................................................................................................. 20
3 Methodology..................................................................................................................................................... 21
3.1 Design Science ....................................................................................................................................... 21
3.2 Positivist Approach (Positivism)................................................................................................... 22
3.3 Interpretive Approach........................................................................................................................ 23
3.5 Software Development Lifecycle Models.................................................................................... 24

5
3.5.1 Rapid Application Development (RAD)................................................................... 24
3.5.2 Analysis ................................................................................................................................. 25
3.6 Waterfall Model..................................................................................................................................... 25
3.7 Analysis..................................................................................................................................................... 26
3.8 User Interface Evaluation.................................................................................................................. 26
3.8.1 Nielsen Heuristics............................................................................................................. 27
3.9.1 Cognitive Walkthrough................................................................................................... 29
3.11 Summary.................................................................................................................................................. 30
4 Requirements Analysis and Design........................................................................................................ 31
4.1 Customer Requirements.................................................................................................................... 31
4.2 Functional Requirements.................................................................................................................. 31
4.3 Non-Functional Requirements........................................................................................................ 32
4.4 Requirements Summary.................................................................................................................... 32
4.5 Design........................................................................................................................................................ 32
4.6 Activity Diagram.................................................................................................................................... 33
4.7 Use Case.................................................................................................................................................... 34
Summary ................................................................................................................................................................ 34
5 Implementation .............................................................................................................................................. 35
5.1 Software Environment – R................................................................................................................ 35
5.2 Software Environment - MatLab.................................................................................................... 35
5.3 Persona Identification Application Implementation............................................................. 35
5.3.1 Application Coding Screenshots ................................................................................. 36
5.3.2 Application Interface Screenshots ............................................................................. 39
5.4 Assumptions........................................................................................................................................... 40
5.5 Summary.................................................................................................................................................. 40
6 Results and Evaluation................................................................................................................................. 41
6.1 Data Pre-Processing............................................................................................................................ 41
6.2 Results Summary.................................................................................................................................. 43
6.3 Evaluation................................................................................................................................................ 45
6.3.1 Participant selection........................................................................................................ 46
6.4 Black-Box Testing................................................................................................................................. 46
6.5 Evaluation Results................................................................................................................................ 47
6.6 Black Box Testing Results................................................................................................................. 48

6
6.7 Evaluation Summary........................................................................................................................... 48
7 Conclusion......................................................................................................................................................... 49
7.1.1 Aim - Identify individual personas from prosumers personal information.
49
7.1.2 Objective 1 - Undertake a state-of-the-art literature review to inform,
create a design specification for an identifying personas/Investigate in greater detail the
pros and cons of clustering with reference to appropriate literature ..................................... 49
7.1.3 Objective 2 - Build a persona identification application................................... 50
7.1.4 Objective 3 - Evaluate the application...................................................................... 50
7.2 Future Development ........................................................................................................................... 50
Appendix A Personal Reflection........................................................................................................... 51
A.1 Reflection on Project........................................................................................................................... 51
A.2 Personal Reflection.............................................................................................................................. 51
Bibliography............................................................................................................................................................... 53
A.3 Appendices.............................................................................................................................................. 57
A.4 Appendices.............................................................................................................................................. 57

7
List of Tables
Table 1 – User Requirements.............................................................................................................................. 31
Table 2 - Functional Requirements.................................................................................................................. 32
Table 3 - Non-Functional Requirements........................................................................................................ 32
Table 4 - Use Case Narrative ............................................................................................................................... 33
List of Figures
Figure 1 - Fayyad KDD representation ........................................................................................................... 16
Figure 2 - Example of a word sorting dendrogram output from:
http://www.macs.hw.ac.uk/texturelab/people/thomas-methven/ ....................................... 18
Figure 3 - Design Science Guideline from MIS Quarterly Research Essay. ...................................... 21
Figure 4 - The Engineering Cycle ...................................................................................................................... 22
Figure 5- Epistemological Assumptions for Qualitative and Quantitative Research from
http://dstraub.cis.gsu.edu:88/quant/2philo.asp............................................................................. 23
Figure 6 - RAD Diagram......................................................................................................................................... 25
Figure 7 - Waterfall Model ................................................................................................................................... 26
Figure 8 - Activity Diagram of Persona Identification Application..................................................... 34
Figure 9 - Use Case Diagram of Persona Identification Application................................................... 34
Figure 10 - Import csv file plus description.................................................................................................. 36
Figure 11 – Choose variables plus description............................................................................................ 36
Figure 12 – Standardize data and run k-means plus description........................................................ 37
Figure 13 – Choose K function plus description ......................................................................................... 37
Figure 14 – Show analysis results plus description .................................................................................. 38
Figure 15 – Download results csv file plus description........................................................................... 38
Figure 16 - Screenshot of Persona Application Interface 1.0................................................................ 39
Figure 17 – Screenshot of Persona Identification Application 2.0...................................................... 39
Figure 18 – Evidence of data pre-processing Results............................................................................... 41
Figure 19 - Screenshot of results out CSV file.............................................................................................. 42
Figure 20 - Identifying Personas Breakdown .............................................................................................. 42
Figure 21 –Percentage Calculator Example.................................................................................................. 43
Figure 22 - Persona Percentage Results (Test 1) ....................................................................................... 43
Figure 23- Persona Percentage Results (Test 2) ........................................................................................ 44
Figure 24 - System Usability Questionnaire................................................................................................. 45
Figure 25 - Graph showing the optimum number of evaluators.......................................................... 46
Figure 26 - Functional Test Questionnaire.................................................................................................... 47

8
Figure 27 - Table of Usability Questionnaire Results ............................................................................... 47
Figure 28 - Bar Chart of Usability Questionnaire Results....................................................................... 47
Figure 29 Bar Chart showing average usability questionnaire results............................................. 48
Figure 30 - Results of System Functionality Questionnaire................................................................... 48

9
1 Introduction
This dissertation will be looking at the digital prosumer; in particular, concentrating on the
identification of personas gained from wholesome prosumer data stores which can be used
as valuable commodities to sell on the ‘futures’ market. I plan to execute this by identifying
specific personas from a digital vault of prosumer personal information by using intelligent
data analysis, in this case, clustering. During the course of this dissertation I expect to isolate,
analyze and categorize raw prosumer data and present it in a way were I can link it to a
persona. Also I expect to find the best clustering technique, through an extensive literature
review analyzing both the advantages and disadvantages of each selected method before
coming to a conclusion on the best technique to use. I will also develop a persona
identification application, which will be used to analyze the data and set them into clusters
which can then be classified into personas. Then finally I will be undertaking a
comprehensive evaluation of the app to scope the overall effectiveness of the application.
1.1 Problem Definition
Personal data can generate unprecedented economic and social value for governments,
organizations and individuals in many ways. By 2020 it is estimated that more than 50 billion
devices may be connected to the Internet (Nagel, 2013) and more than 40 times as many
personal data records stored. With the large amounts of data collected from prosumers,
smarter data mining techniques need to be employed to efficiently analyze the data and
identify personas for which data can be traded on a data exchange.
Data mining is the search for valuable information within large volumes of data by
systematically exploring underlying patterns, trends, and relationships hidden in available
data. Data mining techniques can generally be categorized into: (i) classification and
prediction; (ii) clustering; (iii) outlier prediction; (iv) association rules; (v) sequence
analysis; (vi) time series analysis; and (vii) text mining.
1.2 Aims and Objectives
The aim of this project is to identify individual personas from prosumers personal
information stored in a digital vault using an intelligent data analysis technique, Clustering.
To aid me in achieving this aim within this project I have set out a list of objectives that will
help develop the body of this dissertation as well as assist me in determining whether the
project aim has been successfully satisfied.

10
• Undertake a state-of-the-art literature review to inform, create a design specification
for an identifying personas from digital personhood data using intelligent data
analysis techniques (Clustering).
• Investigate in greater detail the pros and cons of clustering with reference to
appropriate literature
• Build a persona identification application (e.g. using MatLab or R).
• Evaluate the application.
1.3 Project Approach
In order to successfully complete this project I have adopted a five-step approach. At each
stage there will be a set of deliverables I will set that will help achieve my aims and
objectives and also to assist me in completing this project on time.
The first step will be to conduct a state-of-the-art literature review. This review will look at
different cluster analysis techniques from a variety of different physical and online sources.
This will enable me to inform the design of my application, which is the cornerstone of this
project. In addition I will look at what has been done in terms of cluster analysis and try to
synthesize that information and relate it back to my project. The second step will be to
looking at different methodology principles and models, picking the most appropriate
method for this project with appropriate reference to literature. Selecting the right
methodology is pivotal to the success of this project. The third stage will be to analyses the
user requirements and talk about the design of my application and evaluating the GUI. After
this has been discussed and illustrated then I will proceed in coding my application, which
will be done in R-Studio. The fourth stage will be ascertaining the results of the application
and trying to find personas out of the dataset clustered. The way I went about de-cyphering
the information and deducing personas will be shown and explained at this stage. The final
stage of this project will involve evaluating the application and the project as a whole. This
will be coupled with personal reflection on my experiences on putting together this project

11
1.4 Dissertation Outline
Chapter 2: Literature Review – This chapter will look into pervious literature that will
equip me to gain a deeper understanding into my research problem. Subsequently it will
help inform my design of my application.
Chapter 3: Methodology - This chapter will look at different methodologies principles as
well as software development lifecycle models and critically discussing each of their
strengths as well as weaknesses before isolating a principle and SDLC that will be the most
appropriate for my project.
Chapter 4: Requirement Analysis and Design – This chapter will look at the requirements of
the application set out by the user and analyzing the functional and non-functional
requirements. In addition I will be going through the design process of my application and
how I intend to put it all together.
Chapter 5: Implementation – This chapter will demonstrate the coding of the logic of my
application in R and the coding of the interface using R-Shiny. I will be including fully
annotated screenshots depicting evidence of implementation.
Chapter 6: Results and Evaluation – This chapter will be showing the results of the
application as well as showing how I went about deducing personas from the application. I
will also be looking into evaluating the app and seeing if it has met the aims and objectives
set out at the beginning.
Chapter 7: Conclusion – This chapter will be drawing conclusions to all the findings brought
about in this project. I will be concluding my aims as well as all 3 of my objectives. In addition
I will be evaluating my application from a subjective point of view as well as the project in
its entirety. I will also be suggesting future work to make my application even better.

12
2 Literature Review
In this chapter I will be discussing and reviewing the different clustering methodologies
available, analyzing the advantages and disadvantages of each technique with reference to
the appropriate literature. This, along with personal evaluation, will fortify me in concluding
which chosen technique is the most appropriate in executing this project by giving me the
adequate justification for that chosen method. In addition to this I will be looking into further
detail into what personal data is as well as how it has metamorphosed into being an
increasing important aspect of a to economic growth and corporate supremacy, consequently
delivering a new breed of prosumers, the digital prosumer.
2.1 Personal Data
If we look at the European Data Protection Directive [Article 2] we see that personal data is
defined “by reference to whether information relates to an identified or identifiable individual”
(Information Commissioner Office, 2010) in other words personal data is any given piece of
information that can be used to in identify and individual or individual characteristic. The
Data Protection Act of 1998 adds a different dimension to the EDPD definition of ‘data’ by
taken into account the way the information was processed before it can be regarded as data
e.g. processed automatically or processed non automatically. The EDPD and Data Protection
Act have a common consensus on what personal data/information is;
- Information processed, or intended to be processed, wholly or partly by
automatic means (that is, information in electronic form) (ICO, 2010)
- Information processed in a non-automated manner which forms part of, or is
intended to form part of, a ‘filing system’ (that is, manual information in a
filing system) (ICO, 2010)
2.2 Value of Personal Data
Personal information is an increasingly important asset in the twenty-first century, both in
terms of corporate monetary value and government efficiency as well as economic prowess.
Coincidentally, corporate companies around the world have begun the transition into
investing greatly in software that helps facilitate the collation of consumer data (Schwartz,
2003). It’s estimated that everyday people across the world send 10 billion text messages
daily, coupled with that 1 billion posts to a blog or social media sites are made leading to a
new type of economy emerging, The Internet economy. It is estimated that that the Internet
economy within the G20 amounted to $2.3 trillion or 4.1% total GDP in 2010 (Group, 2012).

13
2.3 The Internet [Digital] Economy
Sometimes called the digital or web economy the Internet Economy is a concept based on
digital technologies fusing with the traditional economy. First established by Don Tapscott in
his critically acclaimed book; The Digital Economy: Promise and Peril in the Age of
Networked Intelligence’’, it is widely believed that the internet economy is positioning itself
as the new cornerstone for any emerging or established economy (Tapscott, 1997) This is
evident by the recent figures released by the Boston Consulting Group their Digital Manifesto
Report which states that currently the value of the internet economy is larger than that of
countries like Brazil and Italy and that by the year 2016 the Internet economic value is
expected to double to $4.2 trillion. The report also goes on to say that ‘’no company or country
can afford to ignore this [Internet economy] phenomenon’’. (David Dean, 2012) The rise in
the amount of data being produced is strongly linked to the innovation of mobile technology,
from the turn of the millennium, allowing more devices than ever to be able to make a
connection with the cyber-world that is the Internet. Steve Wojtowecz, Vice President of
storage software development at IBM, stated that by the year 2015 over a trillion devices
would be connected to the internet (King, 2011). As a consequence the UK government has
started up two initiatives, Midata and Information Economy Strategy (IES) to aid prosumers
with improved and sufficient access to their own personal data that companies hold about
them. (BIS, 2011).
2.3.1 Midata
These are the key principles [aims] of the Midata initiative outlined in its government report:
(Department for Business, Innovation & Skills , 2013)
- Get more private sector businesses to release personal data to consumers
electronically
- Make sure consumers can access their own data securely
- Encourage businesses to develop applications (apps) that will help
consumers make effective use of their data
2.3.2 Information Economy Strategy (IES)
These are the key principles [aims] of the IES project outlined in its government report:
(Department for Business, Innovation and Skills, 2013)
- A strong, innovative, information economy sector exporting UK excellence to the
world

14
- UK businesses and organizations, especially small and medium enterprises
(SMEs), confidently using technology, able to trade online, seizing technological
opportunities and increasing revenues in domestic and international markets
- Citizens with the capability and confidence to make the most of the digital age
and benefiting from excellent digital services.’’
Long-term success will be underpinned by:
- A highly skilled digital workforce (whether specialists who create and develop
information technologies, or non-specialists who use them)
- The digital infrastructure (both physical and regulatory) and the framework for
cyber security and privacy necessary to support growth, innovation and
excellence.’’ (Department for Business, Innovation and Skills, 2013)
It’s important to remember that both these government initiatives are being reinforced by
reviews and changes to legislation such as the Data Protection Act, Consumer Rights Bill
[Both UK and EU level] and the Enterprise and Regulatory Reform Act 2013. Reason being is
that this will necessitate companies to disclose customers’ personal data to them if they opt
not to do so voluntarily. (Department for Business, Innovation & Skills , 2013)
2.4 What is a Persona?
Typically used as marketing tool and human centered design [HCD] personas are
hypothesized groups of users that illustrate similar behavioral patterns in their use of
technology, lifestyle decisions, customer service preferences as well as their purchasing
decisions. Angus Jenkinson first came up with a top down analytical approach that works by
‘grouping’ focusing on a synthetic, clustering process leading to ‘customer communities’ and
the creation and preservation of loyalty within these communities in his 1994 journal
Beyond Segmentation (Jenkinson, 1994). This concept was refined five years later by Alan
Cooper in his pioneering book The Inmates Are Running the Asylum in which Cooper creates
the actual concept called ‘persona’ that is used today to identify customer relative behavior
and consumption patterns. (Cooper, 1998)
2.5 What is a Prosumer?
It is widely considered that Alvin Toffler is the creator of concept of prosumption, he goes on
to define it in his book ‘The Third Wave’ as people who “produce some of the goods and
services entering their own consumption” (Toffler, 1980) (Kotler, 1986). In other words
people that produce and consume their own products and services are prosumers. In the 21st

15
century the prosumer has become more and more prominent replacing the traditional
consumers of the Industrial Age, this lays credence to Toffler’s own prediction that; as society
moves to towards the Post-Industrial Age the number of pure consumers will decline being
replaced with “prosumers” (Toffler, 1980).
2.5.1 The Rise of the Digital Prosumer
Consequently as we divulge deeper into the Information Age and the Internet Economy
continues to evolve into an economic juggernaut, a new type of prosumer has emerged, the
digital prosumer. The digital prosumer is a person that creates and consumes his or her own
data. As of today the biggest benefactors of personal data produced are the depicted as the
big 3 data companies, which are; Google, Facebook and Twitter making upwards of $1200
from a user profile. (Madrigal, 2012)
2.6 Data Mining
Data mining is the iterative process of extracting or “mining” knowledge from excessive
amounts of data stores, which can be put into perspective and exported into useful
information. Data mining is thought to involve six common classes of that lead to prediction
and description, which is one of the primary goals of data mining: (Wikipedia, 2011)
(Kamber, 2006)
• Classification – is learning a function that classifies a single data item into one of
several predefined classes. Examples of classifications techniques:
- Bayesian classifiers
- K-nearest neighbor
- Linear classifiers
• Regression – is learning a function that maps a data item to a prediction variable.
In other words regression estimates the relationship between any two variables.
Some examples of regression models are:
- Percentage regression
- Bayesian linear regression
- Nonparametric regression
• Clustering- is a descriptive task that works by aiming to identify cluster or
categories that seek to describe data. Examples of clustering techniques are:
- Hierarchical
- Partitioning
- Density-Based
- Centroid-Based

16
• Summarization – is a method for finding a cohesive description of a data set, this
includes analytical representation such as visualization and report generation
• Dependency modeling – is a method that consists of finding a model that depicts
significant dependencies between variables
• Change and deviation detection – is a method that focuses on finding the most
significant changes from previously measured data. (Usama Fayyad, 2008)
2.6.1 Knowledge Discovery from Data [KDD]
KDD can often be misconstrued as data mining in itself; however it’s safe to say that data
mining is an essential part of the knowledge discovery. Usama Fayyad purposed the
methodology of KDD in 1995 with the purpose of making data produced by companies useful
to their business needs. (Deutsch, 2010)
Figure 1 - Fayyad KDD representation
Knowledge discovery takes an iterative sequence approach to its philosophy, which consists
of; (Kamber, 2006)
• Data Cleaning – to remove noise and inconsistent data
• Data Integration – where multiple data sources may be combined
• Data Selection - where data relevant to the analysis task are retrieved from the
database
• Data Transformation - where data are transformed or consolidated into forms
appropriate for mining
• Data Mining – an essential process where intelligent methods are applied in order to
extract data pattern
• Pattern Evaluation – to identify the truly interesting patterns representing
knowledge
• Knowledge Presentation – where visualization and knowledge representation are
used to present the finished knowledge to the user

17
2.7 Cluster Analysis
Cluster analysis can be defined as the process of grouping a set of physical or abstract objects
into classes that have similar objects. In other words a cluster can be depicted as collection of
data objects that a similar to object within the same cluster or dissimilar to objects in another
cluster. An advantage of clustering or cluster analysis is that it can single out useful features
that define characteristics within different groups, which, in turn, will help me in my aim of
identifying personas from prosumer data (Kamber, 2006). They’re a various different
cluster analysis techniques such as; Partitioning, Hierarchical (Agglomerative and Divisive)
and The Single Link Method (Raza Ali, 2004)
2.7.1 Partitioning Technique
Partitioning methods aims to relocate clusters of data from one cluster to another; this is
usually started by the initial partitioning. The method also requires the number of clusters to
be pre-set by the user. It is also commonly cited that to achieve global optimality in this type
of clustering an exhaustive enumeration process of all possible partitions is needed, because
of this necessity most applications choose one of two popular algorithms, K-means and K-
medoids algorithms (Kamber, 2006):
• K-Means Algorithm
K-means enables the user to mine data by representing each cluster
by the mean value (usually K) of the objects present in the cluster
• K-Medoids Algorithm
K-medoids on the other hand, enables each cluster to be represented
by one of the objects located nearer to the center of the cluster.
2.7.2 Advantages and Disadvantages
Now the K-means technique has advantages as well as disadvantages, one of the main
advantages is that k-means work well for finding spherical-shaped clustering within small
to medium-sized data stores. Another advantage of k-means is that the method tends to
produce tighter, more compact clusters than say hierarchical clustering. (Lior Rokach,
2010)
However there are also disadvantages to this technique, one of them being that it is very
limited to the type of cluster model the algorithm is applied to. The effectiveness of the
algorithm is predicated on the spherical shaped clusters, sometimes called globular, as this
enables the mean value to be positioned closer towards the center of the cluster. This
consequently means that clusters that aren’t a similar size or have large datasets won’t work

18
well with this algorithm. Another disadvantage to this algorithm is that it is very sensitive to
noisy data and outliners, which can increase the squared error significantly; this leads to
the user mandated to know the number of clusters beforehand, which is a very tedious task.
(Improved Outcomes Software (ios), 2009)
2.7.3 Hierarchical Technique
Hierarchical methods aim to create a hierarchical decomposition of the given sets of data
objects. This method can be sub-partitioned into two techniques; Agglomerative and
Divisive. The agglomerative method, which is also called the bottom up approach, works by
each data object forming a separate group, after this is done the clusters are successively
merged until the desired cluster structure is achieved. The divisive method, which is also
called the top-down approach, works by all the data objects being in the same cluster then
partitioned into sub-clusters, which in turn is partitioned further sub-clusters. This
sequential process is repeated until the desired cluster structure is obtained. One of the
intriguing things about hierarchical clustering is that it provides a decipherable visual of the
algorithm plus data; this is called a Dendrogram. This is a resourceful summarization tool
that makes hierarchical clustering extremely popular. (Lior Rokach, 2010)
Figure 2 - Example of a word sorting dendrogram output from:
http://www.macs.hw.ac.uk/texturelab/people/thomas-methven/
It’s important to remember that hierarchical techniques have many advantages as well as
disadvantages. One of the advantages is that it is very versatile; methods like single-link
work maintain a strong performance on datasets delivering well-separated, chainlike and
concentric clusters. Another advantage to hierarchical methods is the fact that they produce
multiple partitions, this is particular resourceful for users that want to choose different

19
partitions from those already nested in the overall cluster according to the desired similarity
level chosen by the user.
On the other hand the disadvantages to this particular technique are quite evident.
Hierarchical algorithms are notorious for their inability to scale well; the algorithm is also
accredited to causing high I/O costs when trying to cluster a large number of objects. Another
disadvantage to the hierarchical technique is that its rigidity, simply put, once one step is
done in the sequence it can never be undone or modified. (Lior Rokach, 2010)
2.8 Critical Discussion
Having reviewed the advantages and disadvantages of hierarchal and partitioning techniques
it’s important to offer an analysis of both techniques, in relation to this project, in order for to
be able to distinguish the most appropriate technique for clustering. From my research I can
see that partitioning clustering works well on small sized data sets as opposed to bigger data
sets, the dataset used in this project is fairly large containing data from 2,500 household’s
weekly shop. Partitioning clustering also goes about making tighter, more cohesive, clusters
through its k-means algorithm, which makes it easier to depict the key features within the
cluster, which in turn defines persona characteristics. On the other hand, for users not to
encounter noisy data while clustering it is advantageous for them to know the number of
clusters in advance, this is near on impossible with the size of the database in question.
Looking on the other side of the coin we see that the Hierarchical technique is very versatile
offering different methods such as single link, complete link and average link, which,
consequently, delivers separate clusters. This I believe will work well in this project, as it will
aid in presenting persona’s from the dataset provided. In addition to this the hierarchical
technique has a very good quality assurance type algorithm to ensure quality of cluster such
as Chameleon which will be good in ensure that the personas defined are validated. On the
other hand the hierarchical technique is very rigid so if erroneous decisions occur it is nearly
impossible for it to be corrected which provides a big disadvantage to this project as
identifying personas will need a great deal of flexibility as parameters for personas can
change at any given time.
In light of all the information reviewed it’s fair to say there are a number of advantages and
disadvantages that both offer however in order to obtain the best and more concise results I
believe consensus clustering would be the best option. However due to time constraints and
lack of expertise in coding, I have decided to use the K-Means algorithm to provide the logic
to my application. I intend to then build an interface, which simplifies the steps of the K-
Means algorithm and puts it in a way that is easy to administer for the user. The choice of

20
which software environment I will use to code the interface as well as the justifications for it
will be made in Chapter 5.
2.9 Summary
In this chapter I have spoken about personal data and its value, I have also looked into the
definition of personas coupled with the rise of the prosumer and Internet economy.
Furthermore I have discussed in detail what is cluster analysis is looking in particular at two
clustering techniques (Hierarchical and Partitioning), offering an in-depth critical discussion
about my chosen technique to take forward into my application. The findings of the chapter
will further equip me into meeting my aims and objectives set out for this project. In addition
it will assist me in constructing a design specification for my application

21
3 Methodology
This chapter will be exploring different research methodologies and coming up with the
appropriate justification for applying the chosen methodology to this project the three
methods in question will be; Design Science, Positivist and Interpretive. The methodology I
have decided to use is the design science approach. The justification will be validated through
the appropriate reference to literature sourced, as well as a personal analysis of the different
approaches.
3.1 Design Science
As previously mentioned the design science approach is my chosen methodology for this
project. Design science simply put is the methodical form of designing or research design.
First established by American inventor Richard Buckminster Fuller in 1963, the concept of
design science proceeded to be further developed by Gregory in his 1966 book “The Design
Method” in which he demarcates the relationship between design method and scientific
method. He further accentuates his view that design is not inherently a science and that the
actual term design science pertains to the scientific study of design. As technology continued
to evolve at the turn of the century design science started becoming more integrated into
Information systems research and software design projects. Alan Hevner in 2004 produced a
seven-guideline framework, with the aim to assist information system researchers to;
conduct, evaluate and present design-science research. (Alan R. Hevner, 2004)
Figure 3 - Design Science Guideline from MIS Quarterly Research Essay.
Further refinement this framework by Peffers, was later made in order to explain how the
regulative cycle fits into the design science research framework.

22
Figure 4 - The Engineering Cycle
This framework is widely used today by information system researchers as it provides
researchers a medium to analyze and de-cipher an existing problem and offer a solution
design or solution hypotheses. After which they can then look at whether their solution or
hypotheses is effective or meets the specified criteria, this can be executed through a pilot
scheme or prototyping after which the full implementation can take place. (Roel Wieringa,
2010). This principle in particular would suit my project the most in my opinion, as I aim to
design a software solution (clustering program), design it, and then evaluate the
effectiveness of the solution.
3.2 Positivist Approach (Positivism)
The positivist approach is a methodology based on an objective hypotheses based on
introspection or intuition validated or dis-proved by scientific testing and experimentation
(Sage Publications, 2009). In other words a positivist approach will have a hypotheses
validating a subject area or discrediting it then going on to prove the hypotheses by
experimentation or building a solution (University of the West of England, 2007). The
origins of the method lie with sociologist Auguste Comte who coined and developed the term
in the early 19th century. Today the positivist approach is used increasingly in IS and
software engineering projects (Sociology Guide, 2008). Some of the advantages of the
positivist approach are that it relies heavily on quantitative data as opposed to qualitative
data which is seen as more scientific thus being a more reliable source to base hypotheses on.
Another advantage to the positivist approach is the fact that it follows a very stringent
structure, as the positivist approach believes that there are guidelines in place that need to
be adhered to, which as a consequence should minimize room for error. This ideology makes
positivist believe that the reduced room for error will make the whole approach more
accurate when it pertains to experiments and applications. However on the other hand there

23
are drawbacks to the approach one of them being human behavior. Positivists strongly
believe in objective based assumptions however there is no guarantee that bias or subjective
analysis won’t corrupt the study. (Johnson, 2010) (Wikipedia, 2014)
Figure 5- Epistemological Assumptions for Qualitative and Quantitative Research from
http://dstraub.cis.gsu.edu:88/quant/2philo.asp
3.3 Interpretive Approach
The Interpretive approach is a qualitative research method that is based on subjective
assumptions with the knowledge derived from value-laden socially constructed
interpretations (Packer, 2007). In a stark contrast to the positivist approach interpretivist
researchers aim to understand and interpret human behavior as opposed to generalizing and
predicting cause and effect. The impact this has on information system and software design
projects is that the researcher will aim to ask several open ended questions generally
through questionnaires or unstructured / semi-structured interviews and sometimes
observations to gather as much primary information as possible once the scope of the project
has been defined (WordPress, 2012). This particular approach also enables the researcher
to open to new ideologies throughout the duration of the project as opposed to that of the
positivist approach who believe in a pre-ordained rules and guidelines. With that being said
there are many advantages as well as disadvantages to this approach. One advantage is that
the research methodology is highly qualitative based meaning that the data gathered will be
in more depth. However a drawback will be that interpretivists have a subjective view about
the project this into which will lead to bias getting in the way of ascertaining the correct
results or the best methods to apply in completing the project. (Institute of Public &
International Affairs, 2009) (Slideshare, 2013)
Having looked out all three research approaches in appropriate detail, highlighting the
advantages and disadvantages of each, it’s safe to say that all have adequate potential in
being the framework for any information systems project. However I believe that the best
approach to adopt for this particular project will be the Design Science approach as this
offers the strongest correlation between what I am trying to achieve in this project and the

24
actual design science approach itself (design, build evaluate). However with that being said I
believe that I can still look at this project from a positivist point of view. The reason I say this
is that the idea of using data mining to develop ‘personas’ is a relatively novel idea, so using a
hypotheses I am trying to positively prove that it is possible and can be done.
3.5 Software Development Lifecycle Models
There are many models that can be used to develop a software project. All of these models
follow the design science principle of design, build evaluate. What I aim to achieve in this
section will be to identify and describe two common models, offering adequate analysis on
each. After which I will isolate the best model that can be adopted to my project.
3.5.1 Rapid Application Development (RAD)
Rapid Application Development is an iterative model that favors rapid, early software
prototyping as opposed to traditional planning. This approach consequently allows the
development of software to take place much sooner. It also keeps stakeholders at the heart of
the development process and allows requirement changes to take place easily. RAD typically
follows four phases in it model; Requirements Planning Phase, User Design Phase,
Construction Phase and Cutover phase. (Wikipedia, 2014) (David C. Yen, 1999)
1. Requirements Planning Phase – The inaugural phase of the project were the
project team meet with the stakeholders to go over the business needs of the client,
the project scope, system requirements and constraints. This is then preceded by an
agreement of the key issues that need to be addressed after which the relevant
authorization needs to be obtain in order to proceed
2. User Design Phase – The second phase of the project aims for the stakeholders to
maintain dialogue with the project analysts to develop prototype models of the
system that shows clear representation of all system input and output features plus
all the processes within the system. This phase of RAD is perceived to be a continuous
interactive process that allows the stakeholders to play an active role in
understanding, modifying and consequently approving a working prototype model
once they see a model that caters to their business needs
3. Construction Phase – The penultimate phase of project continues to focuses on
program and application development. Stakeholders further participate in suggesting
changes and improvement to any user interfaces or reports that are typically
developed at this phase. Unit-integration, system testing, programming and
application development is done at this phase of RAD.

25
4. Cutover Phase – The final phase of RAD is typically when the whole project is
brought to a head. Tasks such as testing, data conversion, user training and system
changeover is done at this stage. The compression of all this tasks that the final stage
enables the new system to be delivered back to the stakeholders in a much quicker
timeframe.
Figure 6 - RAD Diagram
3.5.2 Analysis
The RAD model comes with many advantages as well as disadvantages. However the key is to
be able to synthase them and be relate it back to my project. One of the common advantages
of the RAD model is that it drastically reduces the time need for requirement analysis and
software requirement software requirement. Also all prototypes created can be stored for
future use; this will consequently speed up the software development of the product.
Relatively speaking heavy prototyping is not necessary for my project as it’s a fairly short,
small project with strict user requirements. (Rouse, 2007) (ISTQB Exam Certification,
2012)
3.6 Waterfall Model
The waterfall model is a sequential design model that establishes software development
through downward flow of task/activities through several phases (reminiscent of an actual
waterfall). It differs from conventional agile development models as it seeks to fully describe
the application through written documents before actual software development commences.
Originally developed by Royce in 1970 the waterfall model follows seven sequential phases.
(The Waterfall Development Methodology, 2012)
1. Requirements Specification – The requirements are gathered from the
stakeholders and agreed on in principle with development team.
2. Design – The blueprint of the project is drawn up and given to the developers to
commence coding and start implementation

26
3. Implementation - The actual system is developed at this stage, all coding is
completed resulting in the actual program being developed
4. Integration – The system created is integrated in the environment agreed on in the
preliminary phase
5. Testing – Full testing of the integrated system is performed at this stage, debugging
also happens at this stage with the view of determining any bugs and work on
potential fixes and patches
6. Installation – Installing of the system including the removal of the old system is done
at this stage. This stage also includes training for all stakeholders and staff members
7. Maintenance – The installed system is maintained through continuous updates and
patches being developed and installed.
The waterfall model follows a strict principle that you can only move forward to the next
phase once the existing phase has been completed and worked to perfection meaning that
once a phase is completed it cannot be looked at again. (ISTQB Exam Certification, 2012)
Figure 7 - Waterfall Model
3.7 Analysis
The waterfall model comes with many advantages. One of the most common is that
sequential nature of the model, which makes it very easy to understand and execute. Another
advantage is that it works well on projects that are fairly small with strict set-in-stone
requirements, which suit my project adequately. Another reason I favor this SDLC is that it
seems to go hand in hand with the design science approach (design, build & evaluate). (
Select Business Solutions, Inc., 2010)
3.8 User Interface Evaluation
One of the most integral parts of any software project is to be able to coherently evaluate the
design of the artefact. Like previously stated the user requirements are used to inform the
design of the application, once this is done a framework or principle needs to be
implemented in order to evaluate it. One of the most popular techniques for usability

27
evaluation is the Nielsen Heuristics. In this section of the report I aim to talk about the
Nielsen Heuristics in detail as well as another usability inspection method, The Cognitive
Walkthrough, in order to draw qualitative comparisons to both methods. This in turn will
help me decide on the most suitable approach in evaluate the usability of the Persona
Identification Application.
3.8.1 Nielsen Heuristics
As previously stated the Nielsen Heuristics is one of the most popular usability evaluation
techniques and one of the most used today. It’s important to remember that heuristic
evaluation bridges the gap between conventional user testing. This is achieved by providing a
template or set of principles that help uncover problems a user will likely come across does
this. Looking back it was Jakob Nielsen work with Rolf Molich in the 1990’s that helped
originate the heuristics that is widely used today. However it was in his 1994 publication
Usability Engineering that the actual ten heuristics were published for the first time.
(Nielsen, 1994)
(Some of the heuristics have been shortened for brevity)
1. Simple and Natural Dialogue – The dialogue should not contain information that is
irrelevant or rarely needed
2. Speak the User’s Language – The dialogue should be expressed clearly in words,
phrases, and concepts familiar to users rather than in system oriented terms
3. Minimize the User’s Memory Load – The user should not have to remember
information from one part of the dialogue to another
4. Consistency – Users should not have to wonder whether different words, situations
or actions mean the same thing
5. Feedback – The system should always keep users informed about what is going on,
through appropriate feedback within reasonable time.
6. Clearly Marked Exits – Users often choose system functions by mistake and would
need a clearly marked ’emergency exit’
7. Shortcuts (Accelerators) – Unseen by the novice users by often speed up the
interaction for expert users.
8. Good Error Messages – They should be expressed in plain language (no code) to
precisely indicate the problem
9. Prevent Errors – Even better than good error messages is a careful design that
prevent a problem from occurring in the first place

28
10. Help and Documentation –Even though it is better if the system can be used
without documentation, it may be necessary to provide help and documentation. Any
such information should be easy to search, be focused on the user’s tasks, list
concrete steps to be carried out and not be too large
Nielsen heuristics comes with many advantages as well as disadvantages. Some of the
advantages to this principle are that it’s a very useful and relative inexpensive way of
providing some quick feedback to designers, which can reduce the overall turnover time that
a product is in the usability evaluation stage. Furthermore it can be a good way of obtaining
qualitative feedback EARLY in the design process. Another advantage to the heuristics
evaluation is that it can help immensely in suggesting the best corrective measures for
designers provided that the correct heuristic has been assigned in the first place. This would
prove to be helpful when designing the user interface for the Persona Identification
Application (PIA). Looking deeper into Nielsen Heuristics there is a few disadvantages to this
evaluation principle. One being that it requires specialist knowledge and competent
experience for it the application of the heuristics to be effective. Moreover usability experts
trained to administer the heuristics effectively and hard to come by and can be relatively
expensive to source. Another disadvantage to the heuristics is that it can tend to be
misleading in that it can identify more of the minor issues and less of the actual major issues
with the design. (Usability.Gov, 2010) (Nielsen, 1994)
Moving forward it is important to remember that heuristic evaluation does not replace
conventional usability testing and it should not be seen as an alternative to it. Many of the
benefits and drawbacks have been highlighted above and with all being discussed I’m in no
doubt that Nielsen Heuristics is the perfect evaluation metric for evaluating the user interface
for the application. Reason being is that, in essence, it evaluates all the basic requirements set
by the stakeholders and also it gives me things to consider while designing the app i.e.
accelerators and consistency etc. as well as things to evaluate on at the end of the design
process
Nielsen heuristics comes with many advantages as well as disadvantages. Some of the
advantages to this principle are that it’s a very useful and relative inexpensive way of
providing some quick feedback to designers, which can reduce the overall turnover time
that a product is in the usability evaluation stage. Furthermore it can be a good way of
obtaining qualitative feedback EARLY in the design process. Another advantage to the

29
heuristics evaluation is that it can help immensely in suggesting the best corrective
measures for designers provided that the correct heuristic has been assigned in the first
place. This would prove to be helpful when designing the user interface for the Persona
Identification Application (PIA). Looking deeper into Nielsen Heuristics there is a few
disadvantages to this evaluation principle. One being that it requires specialist knowledge
and competent experience for it the application of the heuristics to be effective. Moreover
usability experts trained to administer the heuristics effectively and hard to come by and
can be relatively expensive to source. Another disadvantage to the heuristics is that it can
tend to be misleading in that it can identify more of the minor issues and less of the actual
major issues with the design. Moving forward it is important to remember that heuristic
evaluation does not replace conventional usability testing and it should not be seen as an
alternative to it. Many of the benefits and drawbacks have been highlighted above and with
all being discussed I’m in no doubt that Nielsen Heuristics is the perfect evaluation metric
for evaluating the user interface for the application. Reason being is that, in essence, it
evaluates all the basic requirements set by the stakeholders and also it gives me things to
consider while designing the app i.e. accelerators and consistency etc. as well as things to
evaluate on at the end of the design process. The way I intend to go about this heuristic
evaluation is to construct a usability questionnaire as well as system functionality test in
order to be able to coherently ascertain the usability of the system, also to be able to test
the functionality of the system, thus validating the user requirements.
3.9.1 Cognitive Walkthrough
In order to balance the argument for which evaluation technique to use it’s imperative to
draw on a comparison. One of the direct comparisons to the Nielsen Heuristics is the
Cognitive Walkthrough approach. Cognitive Walkthrough was developed as an additional
tool in usability engineering. The technique involves a group of evaluators undertaking a set
of tasks on the interface to evaluate its ease of learning and understandability. Lewis and
Polson first set out the concept of cognitive walkthrough, and it works by tasking the
evaluators with four questions; (usabilityfirst, 2011) (Cathleen Wharton, 1994)
• Will the user try to achieve the right effect?
• Will the user notice that the correct action is available?
• Will the user associate the correct action with the effect to be achieved?
• If the correct action is performed will the user see that the progress is being made
toward solution of the task?
After all these questions are ascertained the evaluator attempt to conjure a ‘success story’ for
each incremental step of the process. If this turns out to be impossible then the evaluator will

30
then create a ‘failure story’, which aims to assess why the user cannot accomplish the task
based on the GUI. The findings from the walkthrough are later aggregated and used to make
improvements on the application, in this case the Persona Identification App. Like the
heuristics stated earlier cognitive walkthrough has many advantages as well as
disadvantages. One of the main advantages is that it’s useful for identifying problems early in
the design phase as well as help define users goals and assumptions with fewer resources’
that say full user testing would demand. This technique fits well with the scope of my project
as it provides a short and concise evaluation of the user interface I will be designing it also
provides a user centered perspective similar to what the heuristics offer in comparison.
However one of the main issues with cognitive walkthrough is more susceptible to subjective
bias from the evaluators, which may hinder the main issues not being covered. Another issue
is that it can be very difficult for a seasoned evaluator to assume the perspective of an
inexperienced user of the system. (Lewis, 1997)
Like the heuristics stated earlier cognitive walkthrough has many advantages as well as
disadvantages. One of the main advantages is that it’s useful for identifying problems early in
the design phase as well as help define users goals and assumptions with fewer resources’
that say full user testing would demand. This technique fits well with the scope of my project
as it provides a short and concise evaluation of the user interface I will be designing it also
provides a user centered perspective similar to what the heuristics offer in comparison.
However one of the main issues with cognitive walkthrough is more susceptible to
subjective bias from the evaluators, which may hinder the main issues not being covered.
Another issue is that it can be very difficult for a seasoned evaluator to assume the
perspective of an inexperienced user of the system.
3.11 Summary
In this chapter I have looked in depth at three design principles, evaluating each of them
and choosing the most appropriate one for my project. In addition I looked into software
development lifecycle and picked out the waterfall model as the most efficient lifecycle for
this project. Finally I looked into user interface evaluation choosing Nielsen heuristics as
my way of evaluating the application interface. The findings of this chapter have helped me
choose the appropriate methodology and evaluation for this project.

31
4 Requirements Analysis and Design
In this chapter I will be reviewing and discussing the fundamental requirements of this
project. There are many types of requirements categories that can be used. In this project I
will be using three; Customer requirements, Functional and Non-Functional requirements.
In addition to this I will be discussing the design process of my project making use of
activity diagrams, use case diagrams and narrative to help illustrate the design of my
application
4.1 Customer Requirements
Customer requirements are direct statements or expectations that come from the principle
stakeholders or the prime actors of the project being developed. They directly impact scope
of the project and have unequivocal ramifications on the key features of the system being
developed. In this particular case I spoke directly to some of the principle stakeholders for
the Persona Identification Application who told me directly what their mission
statement/requirements were the following:
1. To be able to use wholesome dataset (Excel)
2. To be able to cluster the dataset through an application
interface
3. Be given back a visual representation of the clustering results
through the application interface
4. To be able to download a CSV table that show the clustering
results which can help facilitate the identification of personas
Table 1 – User Requirements
4.2 Functional Requirements
Functional requirements are the mandatory tasks and activities that need to be fulfilled in
order to exert the full functionality of the app. In others words it should depict what the
system should do and the features it should provide to its users. The table below shows the
functional requirements for the Persona Identification Application.

32
Table 2 - Functional Requirements
4.3 Non-Functional Requirements
Non-functional requirements are the requirements that depict the functionality of the
system, in this case the Persona Identification Application. The table below shows the non-
functional requirements for this system.
Table 3 - Non-Functional Requirements
4.4 Requirements Summary
Thus far, one of the key things to remember is that requirement gathering and analysis is
that it plays a crucial role in informing the design of the software solution. The
requirements along with research conducted in the literature review will assist me in
putting together an adequate design of the system, which will be shown in the second half
of this chapter.
4.5 Design
In this part of the chapter I will be concentrating on the design aspect of the Persona
Identification Application. As previously stated the outcomes of my literature review

33
coupled with the results from the requirement analysis have helped put this part of the
chapter together. I will draw up different diagrams such to clearly show the interaction
with the user and the system. I will also be providing reasoning behind why each method
was chose.
4.6 Activity Diagram
One of the important UML models, an activity diagram illustrates the workflow of a
business process. In this case the diagram below shows the set of incremental steps that an
end user would need to achieve to get to attain his or her end goal. Along the way there are
different decision points that a customer will face which will ultimately lead them to the
same main deliverable. One of the reasons I opted to construct an activity diagram it is one
of the most comprehensible diagrams offering a clear understanding of the business flow
within the system not only to the developers but to them stakeholders as well. (Wang
Linzhang, 2004

34
Figure 8 - Activity Diagram of Persona Identification Application
4.7 Use Case
Another important UML model the use case aims to offer the simplest way of demonstrating
the user’s interaction with the proposed system. The diagram below shows the user
interactions with the Persona Identification App. In addition to the diagram I put together
a use case narrative, which basically provides a more in depth description to the use case
diagram. The reason I chose to implement a use case diagram and narrative is that it
provides an abstract view of the application from the user perspective. (Elenburg, 2005)
Figure 9 - Use Case Diagram of Persona Identification Application

33
Table 4 - Use Case Narrative

34
Summary
This chapter has looked at the requirements set out by the user setting out the functional and
non-functional of the application. Also this chapter has shown how I went about designing
the application; in addition to this I have been able to discuss different techniques in
evaluating the usability of the application interface and functionality. The findings in this
chapter will help me greatly in implementing the application taking into consideration the
requirements from the users; equally it will help me evaluate the application as a whole. This
will be explained more in Chapter 6.

35
5 Implementation
In this chapter I will be discussing the implementation of the Persona Identification App. In
particular I will be looking into the software environment I chose to implement the
application in, which in this project is R, providing adequate justification for why my selected
software environment was chosen. In addition to this I will be detailing the full functionality
of the application by way of screenshots with adequate description of each point.
5.1 Software Environment – R
R is a free command line based programming language specifically for statistical computing
and data mining. Its software environment enables its users to construct statistical software
as well as graphical user interfaces. As previously stated R is a command-based line
programming language meaning it runs through a MS-DOS style display; however several GUI
platforms have been developed to use alongside R such as R-Studio. One of the main reasons
I decided to use R to implement this system is that it was a free meaning that I could use it at
will as opposed to having to obtain a license. Another reason I chose to use it was because I
felt quite comfortable using a command line based system due to my prior experience with
MS-DOS. Subsequently R offers a good and easy to understand package in developing
interactive web-based interfaces (R-Shiny) which I used to develop the interface.
5.2 Software Environment - MatLab
MatLab is a high level, interactive programming environment written in a bevy of
programming languages such as Java, C and C++. One of the advantages of MatLab is that it
allows its users to access a world of different features such as plotting and mapping functions
and data, implementing algorithms and using built in math functions. Furthermore MatLab
allows its user to create graphical user interfaces to work hand in hand with the programs
coded in its environment. One of the main reasons I chose not to use MatLab to develop and
implement the Persona Identification App was because I was unable to obtain a license to use
it at home from the university, meaning that every time I wanted to work on development I
would have to come onsite which is not feasible or indeed efficient.
5.3 Persona Identification Application Implementation
As previously stated I developed the persona identification program in R then subsequently
developed the interface using R’s own package Shiny. In order to do this I had to code in
different functions then put it together in Shiny based application. I have enclosed below
screenshots of the coding of the most important functions with annotations to help depict
what each function is doing. For convince sake I have also listed the functions below:

36
5.3.1 Application Coding Screenshots
1. Import CSV File
Figure 10 - Import csv file plus description
2. Choose variables
Figure 11 – Choose
variables plus
description
1. Import CSV file and convert to data matrix
2. Choose variables
3. Standardize data option and cluster data
4. Show within groups sum of errors squared (Number of
clusters)
5. Show results

37
3. Standardize data and run K-Means algorithm
Figure 12 – Standardize data and run k-means plus description
4. Show within group’s sum of errors squared (Number of clusters)
Figure 13 – Choose K function plus description

38
5. Show Analysis Results
Figure 14 – Show analysis results plus description
6. Download cluster results CSV file
Figure 15 – Download results csv file plus description

39
5.3.2 Application Interface Screenshots
This part of this chapter I will be presenting screenshots depicting the actual interface of the
application. This will add a visual impression to the lines of code explained earlier. The
screen shots will further be annotated to provide more in-depth descriptions on what is
transpiring within the application.
Figure 16 - Screenshot of Persona Application Interface 1.0
Figure 17 – Screenshot of Persona Identification Application 2.0

40
5.4 Assumptions
In order to run the application successfully there needs to be some prerequisites that need to
be adhered to. One of them is that all the data that is in the csv file needs to be numeric else
the K-Means algorithm will just throw errors. In addition the data imputed has to be pre-
processed in order to gain tangible results. This will be further discussed in chapter 6. Finally
when running this application in R the shiny library needs to unpackaged and run after this is
done a simple command line of runApp(“.”) needs to be entered to run the application.
5.5 Summary
This chapter has shown the implementation of the application as well as the reasoning
behind why I chose the software environment to code it in. I have also discussed the
prerequisites that need to be fulfilled in order for the application to work. The findings in
this chapter have demonstrated my ability to code an application and present it in a user-
friendly manner.

41
6 Results and Evaluation
In this chapter I will be looking at the results gained from the application developed. I will
also be detailing how I went about gaining personas from the results data. It’s important to
remember that this application can work with any dataset as long as its numeric and for the
purposes of this project I have focused on a dataset containing 500 families weekly shop over
a 2 month period. Furthermore I will be evaluating the application usability through the
Nielsen Heuristics principle and conducting black-box testing to test the system functionality.
6.1 Data Pre-Processing
As previously stated data preprocessing is an essential part of the data mining process as it
helps lay the foundation for more concise result analysis. It also helps clear up the so-called
‘garbage’ data that may spew the results. To pre-process the data used for this project I first
choose the two most important variables that will help me identify personas from the
Dunhummby dataset, which in this case was household key (hkey) and product category
(prodcatID). I used a technique called “Quota Sampling” to select which data I wanted to use
for this analysis (Riley, 2012). After which I created my own data subset to make with the
two variables only in the CSV file. Finally, to adhere to the rule of K-Means, I assigned each of
the 22 product categories to a numeric value and inputted them into the data subset keeping
a reference of the category and the numeric value its assigned to which can be seen below.
For ease of understanding I used the product category as the “personas” e.g. GROCERY will be
a grocery persona etc.
Figure 18 – Evidence of data pre-processing Results

42
Once the results CSV file is downloaded the contents show four columns; kclust, which shows
how, many clusters there are hkey and prodcatID, these are the two variables we chose to
analyze and finally fit.cluster which show where each of the variables assigned fit in each
cluster.
Figure 19 - Screenshot of results out CSV file
I can see from here that the prodcatID and hkey have been assigned to a fit.cluster, which has
been set by the user already (see. From this I can then filter the rows in the csv file to see how
many numeric variables e.g. 1001, 1002 are in each cluster. Once I have found out how many
of each variable are in each cluster, I aggregate the total amount, which in turn helps me
work out a persona percentage on each category in each cluster. I make sure all the results
are documented which can be seen below.
Figure 20 - Identifying Personas Breakdown

43
The formula I used to work out the percentage was relatively straightforward. After I
aggregate the total amount a calculated the instances of variables against the total amount
within the cluster. For example 1001(Grocery) has 2050 instances in cluster 1, I run that
number against the total amount of instances in cluster one using an online percentage
calculator.
Figure 21 –Percentage Calculator Example
6.2 Results Summary
To be able to identify personas, thus meeting my aim, I conducted some tests on my own data
sub-set (Figure 11). The first test I ran was with K (Number of Clusters) set to 3, which is the
optimum number of clusters for this dataset (see Figure 10). After mining the raw data
based on the method stated above, the following results were found:
Figure 22 - Persona Percentage Results (Test 1)

44
From the results found I can say that the GROCERY persona was the most consistent and
populous persona found in the data set averaging around 60-65% in terms of persona
percentage. The next best persona found was the DRUG GM persona, averaging around 10-
11% persona percentage. This tells me that the dataset is heavily populated with GROCERY
Personas with very little other variances of personas following. To validate this finding I ran
the application again on that same dataset, however this time with K = 4. The results were as
follows:
Figure 23- Persona Percentage Results (Test 2)

45
From this particular test I can see some sort of correlation with the first test I conducted with
K set at 3. I can deduce that the GROCERY persona is averaging between 63-66% persona
percentages spread across 4 clusters, which is very similar to the first test run. The DRUG GM
persona keeps its mark with around 10% persona percentage, with PRODUCE coming in at
around 9-10% average in terms of persona percentage. This indicates to me that the dataset
is densely populated with GROCERY personas
6.3 Evaluation
As previously mentioned in chapter 3.8.1 I have chosen to use the Nielsen heuristics to
evaluate the usability of the application interface. To go about this I have used a System
Usability Scale questionnaire, which was developed by John Brooke (Brooke, 2011). The
questionniare itself is ten questions long based on a likert scale scoring system (1= Strongly
disagree, 2= Strongly agree) if the particitpant is uncertain of an answer than they will select
3. The reason for me choosing this questionnarie is that the questions asked are similar to
that of Nilesen 94’ huerisitcs which is what I planned to use to evaluate the system with to
begin with. In addtion using a likert scale system makes it more choerent and easier for the
participents to complete, thus saving time (Dane Bertram, 2012). Below is an example of
the questionniare that will be given to the participants;
Figure 24 - System Usability Questionnaire

46
6.3.1 Participant selection
Selecting the number of participant to evaluate the application is very important especially
when it pertains to this project. In an ideal world the more evaluators I have the better as
different evaluators can pick up different usability issues. However according to Nielsen the
most optimum number for evaluating a software system are 5 evaluators or at least 3.
(Nielsen, 1995).
Figure 25 - Graph showing the optimum number of evaluators
The above figure (23) shows that optimum number of evaluators against the proportion of
usability problems found. I can see here that 5 evaluators can find 75% of usability problems.
6.4 Black-Box Testing
Black box testing is a form of functional testing which aims to test if the software developed
does what it is supposed to do. The way I went about this was to create a questionnaire
which is based on the functional requirements, which the same participants that are testing
the usability would have to fill out. (Williams, 2006)

47
Figure 26 - Functional Test Questionnaire
The reason I chose to design the questions this way (figure 24) was to be able to gauge
whether or not the functional requirements have been met with a straightforward yes or no
response. This directly has a knock on effect as the outcome of this questionnaire will
indicate to me how far I have gone in meeting the user requirements.
6.5 Evaluation Results
After the evaluation was completed I put all the results from the questionnaire and deduced a
bar chart from it to add a visual representation to the evaluation results. The first thing I did
was to put all the answers from each participant in a table which can be seen below (Figure
25). After this I was able to construct a bar chart using Excel.
Figure 28 - Bar Chart of Usability Questionnaire Results
To make the output more meaningful to me I aggregated the results and draw up a bar chart
to give a visual representation of the average score of the usability questionnaire
Figure 27 - Table of Usability Questionnaire Results

48
Figure 29 Bar Chart showing average usability questionnaire results
6.6 Black Box Testing Results
As previously stated the system functionality testing (black box) was conducted concurrently
with the usability testing. Everyone that took part reported back that they execute all the
functionalities that the system offered. The results is illustrated below in figure 28
Figure 30 - Results of System Functionality Questionnaire
6.7 Evaluation Summary
To conclude this chapter I can say that the usability and system evaluation was highly
successful, in particular the black box testing. From all 5 subject experts who conducted the
evaluation, their response was highly positive which tells me that, from an expert point of
view, the application is very useable and does what its set out to do. On the functionality side
5/5 evaluators answered YES to all 7 functionality questions (Figure 28). This tells me that
the system functionality is fit for purpose and crucially it validates the customer
requirements set out in Chapter 4.

49
7 Conclusion
This dissertation has covered a lot of topics as well as fresh, novel ideas i.e. persona
identification. However it’s important to be able to competently draw conclusions from the
findings of this project, offering appraisal on the positives found and being able to offer
constructive critique on the weaker aspects of the dissertation.
7.1.1 Aim - Identify individual personas from prosumers personal information.
To answer this question I can say that I was able to identify individual “personas” from
prosumer data, however there were issues that I came across during in regards to this.
The first issue was the strength of the persona. The main personas found on the dataset
tested were the GROCERY “persona” however this could be deemed by some analyst as too
vague or not in depth enough. Thorough my own investigation into this perception I found
out that a much deeper pre-processing method, e.g. using sub-product categories instead of
main product categories, would be required in order to fish out much more ‘features’ within
the clusters. This will help facilitate more diverse and meaningful “personas”. It’s important
to stress that this could have been achieved within the boundaries of this particular project
however I believed that deriving personas from main product categories i.e. grocery,
produce, nutrition etc. would be a much better way of obtaining good individual personas.
However from hindsight I believe a deeper pre-processing method would have produced
more meaningful persona. Nevertheless I believe this shouldn’t take away from the fact that I
was able to identify individual “personas” which was the ultimate aim of this dissertation.
7.1.2 Objective 1 - Undertake a state-of-the-art literature review to inform, create
a design specification for an identifying personas/Investigate in greater
detail the pros and cons of clustering with reference to appropriate
literature
To conclude this objective I can confidently say that a state-of-the-art literature review was
undertaken (See Chapter 2) carefully analyzing two of the main clustering methods
(hierarchical and partitioning) drawing advantages and disadvantages and relating it back to
how it would impact my aim of this project. In addition I looked into the importance of
personal data and how it has risen to be the new “oil”, also I looked at the rise of the digital
prosumer, in particular, how prosumption is poised to take over typical consumption laying
credence to Toffler prediction on how prosumption is going to take over consumption by the
turn of the 21st century. This all provided the necessary justification for undertaking the
project and exposed the potential value in building an application that can identify personas.

50
In essence I believe this objective was met at a high standard making use of various white
literatures. This subsequently enabled me to create a design specification for my application.
7.1.3 Objective 2 - Build a persona identification application.
The particular part of the project was by far the most challenging yet the most rewarding.
First off I was tasked with choosing the appropriate software environment in which the
application will be coded in, after this was ascertained then the code development begun.
Although this was a very tedious task, involving numerous failed attempts and heavily
bugged versions, a final version was created bringing to life all the research and personal
hypotheses set out at the beginning of the project. (See Chapter 5) Overall I was hugely
satisfied with the implementation of the application despite the fact that it took a huge
amount of time and resources to put together, I believe it was a very strong and well put
together application that was indeed fit for purpose
7.1.4 Objective 3 - Evaluate the application.
The final part of this dissertation required me to evaluate the application, to not only provide
validation against my aim but to validate the customer requirements defined in Chapter 4. I
went about this by, first evaluating the usability of the system; this was done via a
questionnaire which was very heavy influenced by the Nielsen heuristic principle. After this a
black-box test was put together to evaluate the functionality of the application. Both test
were a huge success, as I was using experts to evaluate the system, there was a lot of extra
scrutiny laid on both the usability and functionality. The feedback was highly positive which
went a long way in validating my aim and user requirements. (See Chapter 6)
7.2 Future Development
One of the most underrated aspects of any project is to negate things that haven’t been done,
due to time or resources, and over-emphasis the things that have been achieved in a project. I
believe that there is a world of benefits to be unlocked once we can sit back and look at what
can be developed in the future to make this project even better.
There are a number of things that can be achieved with future work/development that would
enhance the application even further. The first is obviously a much deeper pool of personas
which was explained in the chapter. Another future development would be adding more
algorithms to the application instead of just the single K-Means. This was explained in more
detail in Chapter 2.8. Another development would be the ability to but the application on a
server and connect it to a database, this will enhance the application even more as it would
mean that data from the data lockers could be stored on the databases and be called into the
application via a database query etc. making the application more robust, expanding the

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Similar to Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science) (20)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)