SlideShare a Scribd company logo
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
1
Department of Information Systems and Computing
BSc (Hons) Information Systems (Business)
Academic Year 2013 – 2014
Digital Prosumer - Identification of Personas through Intelligent
Data Mining (Clustering)
Adebowale Nadi
1008089
A report submitted in partial fulfilment of the requirements for the degree of
Bachelor of Science
Brunel University
Department of Information Systems and Computing
Uxbridge
Middlesex
UB8 3PH
United Kingdom
T: +44 1895 203397
F: +44 (0) 1895 251686
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
2
Abstract
The main objective of the paper is to explore the idea of prosumption and how digital
personhood data that we produce can be extracted, filtered and analysed and given back to
us [prosumers] in a way that is commodifiable, subsequently empowering citizens to utilize
data that they produce. One aspect of this hypothesis is the identification of personas through
clustering which is facet of intelligent data analysis. With the sole aim being of building a
Persona Identification Application (PIA) which sole purpose is to be able to deduce personas
from data stores.
In 2011 it was estimated that 274.2 million Americans were connected to the internet
leading to 81 billion minutes being spent on social networking sites and blogs. In the same
year 117.6 million people visited the internet via a mobile phone accounting for $246 billon
being spent making online purchases (Palis, 2012). Well renowed mangement consultency
firm Boston Consulting Group projects that the Internet Econmoy will contribute $4.2 billion
to G20 total GDP by 2016. This lead co-author David Dein to emphasise that “If it were a
national economy [internet economy], it would rank in the world’s top five, behind only the U.S.,
China, India, and Japan, and ahead of Germany,” (Dein, 2012). With the rise of the internet
economy coupled with the increased rise of mobile devices connected to the internet,
faciliating an unprecedently amount of data being held, intelligent data analysis needs to be
used to be able to isolate the key information thus producing personas that can be later
traded on a futures market.
This paper will look at the rise of the internet economy coupled with the emergance of the
digital prosumer. In addtion clustering will be look at in finite detail, looking at the various
clustering techniques that can be used in the purposed application, looking into the
advantages and disadvantages of each before deciding on which is the appropriate method
for this project. Furthmore this paper will detail the step by step implementation of the
application detailing all the design and requirement analysis that took place before hand.
Finally a detailed evaluation will be explained and executed relaying the findings from the
application and seeing if, infact, the application meets the aim in a coherent and
chomprehensible manner.
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
3
Acknowledgements
First and foremost I would like to take this opportunity to thank my Lord Jesus Christ for
guiding me through this project and giving me the strength to be able to conclude this
dissertation. I would also like to thank my Mum & Dad for their indubitable and
unconditional support given to me throughout my time working on this project. In addition,
all the people that helped, supported and assisted me in anyway shape or form in putting this
dissertation together I would like to personally thank and extend my sincere gratitude
towards. (There are too many to name personally but they know who they are). Last but
certainly not least, I would like to personally thank my supervisor Panos Louvieris and his
assistant Natalie Clewley for all their support rendered to me throughout this project. This
dissertation was, no doubt, the biggest challenge I have faced in all my 19 years in education,
but definitely the most rewarding, learning a highly complex topic (data mining) and learning
to code in a completely new software environment with no prior experience. I truly wouldn’t
have been able to complete it without their guidance, assistance and motivation. In closing I
would like to wish Panos and his team the best of luck in completing their EPSRC sponsored
project Digital Personhood: Digital Prosumer.
Total Words: 15,500
I certify that the work presented in the dissertation is my own unless referenced.
Signature Adebowale Olatunde Nadi
Date 24/03/2014
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
4
Table of Contents
Abstract...........................................................................................................................................................................2
Acknowledgements.................................................................................................................................................... 3
Table of Contents........................................................................................................................................................ 4
List of Tables.................................................................................................................................................................7
List of Figures............................................................................................................................................................... 7
1 Introduction ........................................................................................................................................................ 9
1.1 Problem Definition..................................................................................................................................9
1.2 Aims and Objectives............................................................................................................................... 9
1.3 Project Approach.................................................................................................................................. 10
1.4 Dissertation Outline ............................................................................................................................ 11
2 Literature Review .......................................................................................................................................... 12
2.1 Personal Data......................................................................................................................................... 12
2.2 Value of Personal Data ....................................................................................................................... 12
2.3 The Internet [Digital] Economy...................................................................................................... 13
2.3.1 Midata .................................................................................................................................... 13
2.3.2 Information Economy Strategy (IES)........................................................................ 13
2.4 What is a Persona?............................................................................................................................... 14
2.5 What is a Prosumer? ........................................................................................................................... 14
2.5.1 The Rise of the Digital Prosumer................................................................................ 15
2.6 Data Mining............................................................................................................................................. 15
2.6.1 Knowledge Discovery from Data [KDD] .................................................................. 16
2.7 Cluster Analysis..................................................................................................................................... 17
2.7.1 Partitioning Technique................................................................................................... 17
2.7.2 Advantages and Disadvantages................................................................................... 17
2.7.3 Hierarchical Technique................................................................................................... 18
2.7.4 Advantages and Disadvantages................................................................................... 18
2.8 Critical Discussion................................................................................................................................ 19
2.9 Summary.................................................................................................................................................. 20
3 Methodology..................................................................................................................................................... 21
3.1 Design Science ....................................................................................................................................... 21
3.2 Positivist Approach (Positivism)................................................................................................... 22
3.3 Interpretive Approach........................................................................................................................ 23
3.4 Critical Discussion................................................................................................................................ 23
3.5 Software Development Lifecycle Models.................................................................................... 24
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
5
3.5.1 Rapid Application Development (RAD)................................................................... 24
3.5.2 Analysis ................................................................................................................................. 25
3.6 Waterfall Model..................................................................................................................................... 25
3.7 Analysis..................................................................................................................................................... 26
3.8 User Interface Evaluation.................................................................................................................. 26
3.8.1 Nielsen Heuristics............................................................................................................. 27
3.8.2 Advantages and Disadvantages................................................................................... 28
3.9 Critical Discussion................................................................................................................................ 28
3.9.1 Cognitive Walkthrough................................................................................................... 29
3.10 Critical Discussion................................................................................................................................ 30
3.11 Summary.................................................................................................................................................. 30
4 Requirements Analysis and Design........................................................................................................ 31
4.1 Customer Requirements.................................................................................................................... 31
4.2 Functional Requirements.................................................................................................................. 31
4.3 Non-Functional Requirements........................................................................................................ 32
4.4 Requirements Summary.................................................................................................................... 32
4.5 Design........................................................................................................................................................ 32
4.6 Activity Diagram.................................................................................................................................... 33
4.7 Use Case.................................................................................................................................................... 34
Summary ................................................................................................................................................................ 34
5 Implementation .............................................................................................................................................. 35
5.1 Software Environment – R................................................................................................................ 35
5.2 Software Environment - MatLab.................................................................................................... 35
5.3 Persona Identification Application Implementation............................................................. 35
5.3.1 Application Coding Screenshots ................................................................................. 36
5.3.2 Application Interface Screenshots ............................................................................. 39
5.4 Assumptions........................................................................................................................................... 40
5.5 Summary.................................................................................................................................................. 40
6 Results and Evaluation................................................................................................................................. 41
6.1 Data Pre-Processing............................................................................................................................ 41
6.2 Results Summary.................................................................................................................................. 43
6.3 Evaluation................................................................................................................................................ 45
6.3.1 Participant selection........................................................................................................ 46
6.4 Black-Box Testing................................................................................................................................. 46
6.5 Evaluation Results................................................................................................................................ 47
6.6 Black Box Testing Results................................................................................................................. 48
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
6
6.7 Evaluation Summary........................................................................................................................... 48
7 Conclusion......................................................................................................................................................... 49
7.1.1 Aim - Identify individual personas from prosumers personal information.
49
7.1.2 Objective 1 - Undertake a state-of-the-art literature review to inform,
create a design specification for an identifying personas/Investigate in greater detail the
pros and cons of clustering with reference to appropriate literature ..................................... 49
7.1.3 Objective 2 - Build a persona identification application................................... 50
7.1.4 Objective 3 - Evaluate the application...................................................................... 50
7.2 Future Development ........................................................................................................................... 50
Appendix A Personal Reflection........................................................................................................... 51
A.1 Reflection on Project........................................................................................................................... 51
A.2 Personal Reflection.............................................................................................................................. 51
Bibliography............................................................................................................................................................... 53
A.3 Appendices.............................................................................................................................................. 57
A.4 Appendices.............................................................................................................................................. 57
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
7
List of Tables
Table 1 – User Requirements.............................................................................................................................. 31
Table 2 - Functional Requirements.................................................................................................................. 32
Table 3 - Non-Functional Requirements........................................................................................................ 32
Table 4 - Use Case Narrative ............................................................................................................................... 33
List of Figures
Figure 1 - Fayyad KDD representation ........................................................................................................... 16
Figure 2 - Example of a word sorting dendrogram output from:
http://www.macs.hw.ac.uk/texturelab/people/thomas-methven/ ....................................... 18
Figure 3 - Design Science Guideline from MIS Quarterly Research Essay. ...................................... 21
Figure 4 - The Engineering Cycle ...................................................................................................................... 22
Figure 5- Epistemological Assumptions for Qualitative and Quantitative Research from
http://dstraub.cis.gsu.edu:88/quant/2philo.asp............................................................................. 23
Figure 6 - RAD Diagram......................................................................................................................................... 25
Figure 7 - Waterfall Model ................................................................................................................................... 26
Figure 8 - Activity Diagram of Persona Identification Application..................................................... 34
Figure 9 - Use Case Diagram of Persona Identification Application................................................... 34
Figure 10 - Import csv file plus description.................................................................................................. 36
Figure 11 – Choose variables plus description............................................................................................ 36
Figure 12 – Standardize data and run k-means plus description........................................................ 37
Figure 13 – Choose K function plus description ......................................................................................... 37
Figure 14 – Show analysis results plus description .................................................................................. 38
Figure 15 – Download results csv file plus description........................................................................... 38
Figure 16 - Screenshot of Persona Application Interface 1.0................................................................ 39
Figure 17 – Screenshot of Persona Identification Application 2.0...................................................... 39
Figure 18 – Evidence of data pre-processing Results............................................................................... 41
Figure 19 - Screenshot of results out CSV file.............................................................................................. 42
Figure 20 - Identifying Personas Breakdown .............................................................................................. 42
Figure 21 –Percentage Calculator Example.................................................................................................. 43
Figure 22 - Persona Percentage Results (Test 1) ....................................................................................... 43
Figure 23- Persona Percentage Results (Test 2) ........................................................................................ 44
Figure 24 - System Usability Questionnaire................................................................................................. 45
Figure 25 - Graph showing the optimum number of evaluators.......................................................... 46
Figure 26 - Functional Test Questionnaire.................................................................................................... 47
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
8
Figure 27 - Table of Usability Questionnaire Results ............................................................................... 47
Figure 28 - Bar Chart of Usability Questionnaire Results....................................................................... 47
Figure 29 Bar Chart showing average usability questionnaire results............................................. 48
Figure 30 - Results of System Functionality Questionnaire................................................................... 48
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
9
1 Introduction
This dissertation will be looking at the digital prosumer; in particular, concentrating on the
identification of personas gained from wholesome prosumer data stores which can be used
as valuable commodities to sell on the ‘futures’ market. I plan to execute this by identifying
specific personas from a digital vault of prosumer personal information by using intelligent
data analysis, in this case, clustering. During the course of this dissertation I expect to isolate,
analyze and categorize raw prosumer data and present it in a way were I can link it to a
persona. Also I expect to find the best clustering technique, through an extensive literature
review analyzing both the advantages and disadvantages of each selected method before
coming to a conclusion on the best technique to use. I will also develop a persona
identification application, which will be used to analyze the data and set them into clusters
which can then be classified into personas. Then finally I will be undertaking a
comprehensive evaluation of the app to scope the overall effectiveness of the application.
1.1 Problem Definition
Personal data can generate unprecedented economic and social value for governments,
organizations and individuals in many ways. By 2020 it is estimated that more than 50 billion
devices may be connected to the Internet (Nagel, 2013) and more than 40 times as many
personal data records stored. With the large amounts of data collected from prosumers,
smarter data mining techniques need to be employed to efficiently analyze the data and
identify personas for which data can be traded on a data exchange.
Data mining is the search for valuable information within large volumes of data by
systematically exploring underlying patterns, trends, and relationships hidden in available
data. Data mining techniques can generally be categorized into: (i) classification and
prediction; (ii) clustering; (iii) outlier prediction; (iv) association rules; (v) sequence
analysis; (vi) time series analysis; and (vii) text mining.
1.2 Aims and Objectives
The aim of this project is to identify individual personas from prosumers personal
information stored in a digital vault using an intelligent data analysis technique, Clustering.
To aid me in achieving this aim within this project I have set out a list of objectives that will
help develop the body of this dissertation as well as assist me in determining whether the
project aim has been successfully satisfied.
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
10
• Undertake a state-of-the-art literature review to inform, create a design specification
for an identifying personas from digital personhood data using intelligent data
analysis techniques (Clustering).
• Investigate in greater detail the pros and cons of clustering with reference to
appropriate literature
• Build a persona identification application (e.g. using MatLab or R).
• Evaluate the application.
1.3 Project Approach
In order to successfully complete this project I have adopted a five-step approach. At each
stage there will be a set of deliverables I will set that will help achieve my aims and
objectives and also to assist me in completing this project on time.
The first step will be to conduct a state-of-the-art literature review. This review will look at
different cluster analysis techniques from a variety of different physical and online sources.
This will enable me to inform the design of my application, which is the cornerstone of this
project. In addition I will look at what has been done in terms of cluster analysis and try to
synthesize that information and relate it back to my project. The second step will be to
looking at different methodology principles and models, picking the most appropriate
method for this project with appropriate reference to literature. Selecting the right
methodology is pivotal to the success of this project. The third stage will be to analyses the
user requirements and talk about the design of my application and evaluating the GUI. After
this has been discussed and illustrated then I will proceed in coding my application, which
will be done in R-Studio. The fourth stage will be ascertaining the results of the application
and trying to find personas out of the dataset clustered. The way I went about de-cyphering
the information and deducing personas will be shown and explained at this stage. The final
stage of this project will involve evaluating the application and the project as a whole. This
will be coupled with personal reflection on my experiences on putting together this project
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
11
1.4 Dissertation Outline
Chapter 2: Literature Review – This chapter will look into pervious literature that will
equip me to gain a deeper understanding into my research problem. Subsequently it will
help inform my design of my application.
Chapter 3: Methodology - This chapter will look at different methodologies principles as
well as software development lifecycle models and critically discussing each of their
strengths as well as weaknesses before isolating a principle and SDLC that will be the most
appropriate for my project.
Chapter 4: Requirement Analysis and Design – This chapter will look at the requirements of
the application set out by the user and analyzing the functional and non-functional
requirements. In addition I will be going through the design process of my application and
how I intend to put it all together.
Chapter 5: Implementation – This chapter will demonstrate the coding of the logic of my
application in R and the coding of the interface using R-Shiny. I will be including fully
annotated screenshots depicting evidence of implementation.
Chapter 6: Results and Evaluation – This chapter will be showing the results of the
application as well as showing how I went about deducing personas from the application. I
will also be looking into evaluating the app and seeing if it has met the aims and objectives
set out at the beginning.
Chapter 7: Conclusion – This chapter will be drawing conclusions to all the findings brought
about in this project. I will be concluding my aims as well as all 3 of my objectives. In addition
I will be evaluating my application from a subjective point of view as well as the project in
its entirety. I will also be suggesting future work to make my application even better.
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
12
2 Literature Review
In this chapter I will be discussing and reviewing the different clustering methodologies
available, analyzing the advantages and disadvantages of each technique with reference to
the appropriate literature. This, along with personal evaluation, will fortify me in concluding
which chosen technique is the most appropriate in executing this project by giving me the
adequate justification for that chosen method. In addition to this I will be looking into further
detail into what personal data is as well as how it has metamorphosed into being an
increasing important aspect of a to economic growth and corporate supremacy, consequently
delivering a new breed of prosumers, the digital prosumer.
2.1 Personal Data
If we look at the European Data Protection Directive [Article 2] we see that personal data is
defined “by reference to whether information relates to an identified or identifiable individual”
(Information Commissioner Office, 2010) in other words personal data is any given piece of
information that can be used to in identify and individual or individual characteristic. The
Data Protection Act of 1998 adds a different dimension to the EDPD definition of ‘data’ by
taken into account the way the information was processed before it can be regarded as data
e.g. processed automatically or processed non automatically. The EDPD and Data Protection
Act have a common consensus on what personal data/information is;
- Information processed, or intended to be processed, wholly or partly by
automatic means (that is, information in electronic form) (ICO, 2010)
- Information processed in a non-automated manner which forms part of, or is
intended to form part of, a ‘filing system’ (that is, manual information in a
filing system) (ICO, 2010)
2.2 Value of Personal Data
Personal information is an increasingly important asset in the twenty-first century, both in
terms of corporate monetary value and government efficiency as well as economic prowess.
Coincidentally, corporate companies around the world have begun the transition into
investing greatly in software that helps facilitate the collation of consumer data (Schwartz,
2003). It’s estimated that everyday people across the world send 10 billion text messages
daily, coupled with that 1 billion posts to a blog or social media sites are made leading to a
new type of economy emerging, The Internet economy. It is estimated that that the Internet
economy within the G20 amounted to $2.3 trillion or 4.1% total GDP in 2010 (Group, 2012).
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
13
2.3 The Internet [Digital] Economy
Sometimes called the digital or web economy the Internet Economy is a concept based on
digital technologies fusing with the traditional economy. First established by Don Tapscott in
his critically acclaimed book; The Digital Economy: Promise and Peril in the Age of
Networked Intelligence’’, it is widely believed that the internet economy is positioning itself
as the new cornerstone for any emerging or established economy (Tapscott, 1997) This is
evident by the recent figures released by the Boston Consulting Group their Digital Manifesto
Report which states that currently the value of the internet economy is larger than that of
countries like Brazil and Italy and that by the year 2016 the Internet economic value is
expected to double to $4.2 trillion. The report also goes on to say that ‘’no company or country
can afford to ignore this [Internet economy] phenomenon’’. (David Dean, 2012) The rise in
the amount of data being produced is strongly linked to the innovation of mobile technology,
from the turn of the millennium, allowing more devices than ever to be able to make a
connection with the cyber-world that is the Internet. Steve Wojtowecz, Vice President of
storage software development at IBM, stated that by the year 2015 over a trillion devices
would be connected to the internet (King, 2011). As a consequence the UK government has
started up two initiatives, Midata and Information Economy Strategy (IES) to aid prosumers
with improved and sufficient access to their own personal data that companies hold about
them. (BIS, 2011).
2.3.1 Midata
These are the key principles [aims] of the Midata initiative outlined in its government report:
(Department for Business, Innovation & Skills , 2013)
- Get more private sector businesses to release personal data to consumers
electronically
- Make sure consumers can access their own data securely
- Encourage businesses to develop applications (apps) that will help
consumers make effective use of their data
2.3.2 Information Economy Strategy (IES)
These are the key principles [aims] of the IES project outlined in its government report:
(Department for Business, Innovation and Skills, 2013)
- A strong, innovative, information economy sector exporting UK excellence to the
world
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
14
- UK businesses and organizations, especially small and medium enterprises
(SMEs), confidently using technology, able to trade online, seizing technological
opportunities and increasing revenues in domestic and international markets
- Citizens with the capability and confidence to make the most of the digital age
and benefiting from excellent digital services.’’
Long-term success will be underpinned by:
- A highly skilled digital workforce (whether specialists who create and develop
information technologies, or non-specialists who use them)
- The digital infrastructure (both physical and regulatory) and the framework for
cyber security and privacy necessary to support growth, innovation and
excellence.’’ (Department for Business, Innovation and Skills, 2013)
It’s important to remember that both these government initiatives are being reinforced by
reviews and changes to legislation such as the Data Protection Act, Consumer Rights Bill
[Both UK and EU level] and the Enterprise and Regulatory Reform Act 2013. Reason being is
that this will necessitate companies to disclose customers’ personal data to them if they opt
not to do so voluntarily. (Department for Business, Innovation & Skills , 2013)
2.4 What is a Persona?
Typically used as marketing tool and human centered design [HCD] personas are
hypothesized groups of users that illustrate similar behavioral patterns in their use of
technology, lifestyle decisions, customer service preferences as well as their purchasing
decisions. Angus Jenkinson first came up with a top down analytical approach that works by
‘grouping’ focusing on a synthetic, clustering process leading to ‘customer communities’ and
the creation and preservation of loyalty within these communities in his 1994 journal
Beyond Segmentation (Jenkinson, 1994). This concept was refined five years later by Alan
Cooper in his pioneering book The Inmates Are Running the Asylum in which Cooper creates
the actual concept called ‘persona’ that is used today to identify customer relative behavior
and consumption patterns. (Cooper, 1998)
2.5 What is a Prosumer?
It is widely considered that Alvin Toffler is the creator of concept of prosumption, he goes on
to define it in his book ‘The Third Wave’ as people who “produce some of the goods and
services entering their own consumption” (Toffler, 1980) (Kotler, 1986). In other words
people that produce and consume their own products and services are prosumers. In the 21st
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
15
century the prosumer has become more and more prominent replacing the traditional
consumers of the Industrial Age, this lays credence to Toffler’s own prediction that; as society
moves to towards the Post-Industrial Age the number of pure consumers will decline being
replaced with “prosumers” (Toffler, 1980).
2.5.1 The Rise of the Digital Prosumer
Consequently as we divulge deeper into the Information Age and the Internet Economy
continues to evolve into an economic juggernaut, a new type of prosumer has emerged, the
digital prosumer. The digital prosumer is a person that creates and consumes his or her own
data. As of today the biggest benefactors of personal data produced are the depicted as the
big 3 data companies, which are; Google, Facebook and Twitter making upwards of $1200
from a user profile. (Madrigal, 2012)
2.6 Data Mining
Data mining is the iterative process of extracting or “mining” knowledge from excessive
amounts of data stores, which can be put into perspective and exported into useful
information. Data mining is thought to involve six common classes of that lead to prediction
and description, which is one of the primary goals of data mining: (Wikipedia, 2011)
(Kamber, 2006)
• Classification – is learning a function that classifies a single data item into one of
several predefined classes. Examples of classifications techniques:
- Bayesian classifiers
- K-nearest neighbor
- Linear classifiers
• Regression – is learning a function that maps a data item to a prediction variable.
In other words regression estimates the relationship between any two variables.
Some examples of regression models are:
- Percentage regression
- Bayesian linear regression
- Nonparametric regression
• Clustering- is a descriptive task that works by aiming to identify cluster or
categories that seek to describe data. Examples of clustering techniques are:
- Hierarchical
- Partitioning
- Density-Based
- Centroid-Based
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
16
• Summarization – is a method for finding a cohesive description of a data set, this
includes analytical representation such as visualization and report generation
• Dependency modeling – is a method that consists of finding a model that depicts
significant dependencies between variables
• Change and deviation detection – is a method that focuses on finding the most
significant changes from previously measured data. (Usama Fayyad, 2008)
2.6.1 Knowledge Discovery from Data [KDD]
KDD can often be misconstrued as data mining in itself; however it’s safe to say that data
mining is an essential part of the knowledge discovery. Usama Fayyad purposed the
methodology of KDD in 1995 with the purpose of making data produced by companies useful
to their business needs. (Deutsch, 2010)
Figure 1 - Fayyad KDD representation
Knowledge discovery takes an iterative sequence approach to its philosophy, which consists
of; (Kamber, 2006)
• Data Cleaning – to remove noise and inconsistent data
• Data Integration – where multiple data sources may be combined
• Data Selection - where data relevant to the analysis task are retrieved from the
database
• Data Transformation - where data are transformed or consolidated into forms
appropriate for mining
• Data Mining – an essential process where intelligent methods are applied in order to
extract data pattern
• Pattern Evaluation – to identify the truly interesting patterns representing
knowledge
• Knowledge Presentation – where visualization and knowledge representation are
used to present the finished knowledge to the user
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
17
2.7 Cluster Analysis
Cluster analysis can be defined as the process of grouping a set of physical or abstract objects
into classes that have similar objects. In other words a cluster can be depicted as collection of
data objects that a similar to object within the same cluster or dissimilar to objects in another
cluster. An advantage of clustering or cluster analysis is that it can single out useful features
that define characteristics within different groups, which, in turn, will help me in my aim of
identifying personas from prosumer data (Kamber, 2006). They’re a various different
cluster analysis techniques such as; Partitioning, Hierarchical (Agglomerative and Divisive)
and The Single Link Method (Raza Ali, 2004)
2.7.1 Partitioning Technique
Partitioning methods aims to relocate clusters of data from one cluster to another; this is
usually started by the initial partitioning. The method also requires the number of clusters to
be pre-set by the user. It is also commonly cited that to achieve global optimality in this type
of clustering an exhaustive enumeration process of all possible partitions is needed, because
of this necessity most applications choose one of two popular algorithms, K-means and K-
medoids algorithms (Kamber, 2006):
• K-Means Algorithm
K-means enables the user to mine data by representing each cluster
by the mean value (usually K) of the objects present in the cluster
• K-Medoids Algorithm
K-medoids on the other hand, enables each cluster to be represented
by one of the objects located nearer to the center of the cluster.
2.7.2 Advantages and Disadvantages
Now the K-means technique has advantages as well as disadvantages, one of the main
advantages is that k-means work well for finding spherical-shaped clustering within small
to medium-sized data stores. Another advantage of k-means is that the method tends to
produce tighter, more compact clusters than say hierarchical clustering. (Lior Rokach,
2010)
However there are also disadvantages to this technique, one of them being that it is very
limited to the type of cluster model the algorithm is applied to. The effectiveness of the
algorithm is predicated on the spherical shaped clusters, sometimes called globular, as this
enables the mean value to be positioned closer towards the center of the cluster. This
consequently means that clusters that aren’t a similar size or have large datasets won’t work
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
18
well with this algorithm. Another disadvantage to this algorithm is that it is very sensitive to
noisy data and outliners, which can increase the squared error significantly; this leads to
the user mandated to know the number of clusters beforehand, which is a very tedious task.
(Improved Outcomes Software (ios), 2009)
2.7.3 Hierarchical Technique
Hierarchical methods aim to create a hierarchical decomposition of the given sets of data
objects. This method can be sub-partitioned into two techniques; Agglomerative and
Divisive. The agglomerative method, which is also called the bottom up approach, works by
each data object forming a separate group, after this is done the clusters are successively
merged until the desired cluster structure is achieved. The divisive method, which is also
called the top-down approach, works by all the data objects being in the same cluster then
partitioned into sub-clusters, which in turn is partitioned further sub-clusters. This
sequential process is repeated until the desired cluster structure is obtained. One of the
intriguing things about hierarchical clustering is that it provides a decipherable visual of the
algorithm plus data; this is called a Dendrogram. This is a resourceful summarization tool
that makes hierarchical clustering extremely popular. (Lior Rokach, 2010)
Figure 2 - Example of a word sorting dendrogram output from:
http://www.macs.hw.ac.uk/texturelab/people/thomas-methven/
2.7.4 Advantages and Disadvantages
It’s important to remember that hierarchical techniques have many advantages as well as
disadvantages. One of the advantages is that it is very versatile; methods like single-link
work maintain a strong performance on datasets delivering well-separated, chainlike and
concentric clusters. Another advantage to hierarchical methods is the fact that they produce
multiple partitions, this is particular resourceful for users that want to choose different
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
19
partitions from those already nested in the overall cluster according to the desired similarity
level chosen by the user.
On the other hand the disadvantages to this particular technique are quite evident.
Hierarchical algorithms are notorious for their inability to scale well; the algorithm is also
accredited to causing high I/O costs when trying to cluster a large number of objects. Another
disadvantage to the hierarchical technique is that its rigidity, simply put, once one step is
done in the sequence it can never be undone or modified. (Lior Rokach, 2010)
2.8 Critical Discussion
Having reviewed the advantages and disadvantages of hierarchal and partitioning techniques
it’s important to offer an analysis of both techniques, in relation to this project, in order for to
be able to distinguish the most appropriate technique for clustering. From my research I can
see that partitioning clustering works well on small sized data sets as opposed to bigger data
sets, the dataset used in this project is fairly large containing data from 2,500 household’s
weekly shop. Partitioning clustering also goes about making tighter, more cohesive, clusters
through its k-means algorithm, which makes it easier to depict the key features within the
cluster, which in turn defines persona characteristics. On the other hand, for users not to
encounter noisy data while clustering it is advantageous for them to know the number of
clusters in advance, this is near on impossible with the size of the database in question.
Looking on the other side of the coin we see that the Hierarchical technique is very versatile
offering different methods such as single link, complete link and average link, which,
consequently, delivers separate clusters. This I believe will work well in this project, as it will
aid in presenting persona’s from the dataset provided. In addition to this the hierarchical
technique has a very good quality assurance type algorithm to ensure quality of cluster such
as Chameleon which will be good in ensure that the personas defined are validated. On the
other hand the hierarchical technique is very rigid so if erroneous decisions occur it is nearly
impossible for it to be corrected which provides a big disadvantage to this project as
identifying personas will need a great deal of flexibility as parameters for personas can
change at any given time.
In light of all the information reviewed it’s fair to say there are a number of advantages and
disadvantages that both offer however in order to obtain the best and more concise results I
believe consensus clustering would be the best option. However due to time constraints and
lack of expertise in coding, I have decided to use the K-Means algorithm to provide the logic
to my application. I intend to then build an interface, which simplifies the steps of the K-
Means algorithm and puts it in a way that is easy to administer for the user. The choice of
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
20
which software environment I will use to code the interface as well as the justifications for it
will be made in Chapter 5.
2.9 Summary
In this chapter I have spoken about personal data and its value, I have also looked into the
definition of personas coupled with the rise of the prosumer and Internet economy.
Furthermore I have discussed in detail what is cluster analysis is looking in particular at two
clustering techniques (Hierarchical and Partitioning), offering an in-depth critical discussion
about my chosen technique to take forward into my application. The findings of the chapter
will further equip me into meeting my aims and objectives set out for this project. In addition
it will assist me in constructing a design specification for my application
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
21
3 Methodology
This chapter will be exploring different research methodologies and coming up with the
appropriate justification for applying the chosen methodology to this project the three
methods in question will be; Design Science, Positivist and Interpretive. The methodology I
have decided to use is the design science approach. The justification will be validated through
the appropriate reference to literature sourced, as well as a personal analysis of the different
approaches.
3.1 Design Science
As previously mentioned the design science approach is my chosen methodology for this
project. Design science simply put is the methodical form of designing or research design.
First established by American inventor Richard Buckminster Fuller in 1963, the concept of
design science proceeded to be further developed by Gregory in his 1966 book “The Design
Method” in which he demarcates the relationship between design method and scientific
method. He further accentuates his view that design is not inherently a science and that the
actual term design science pertains to the scientific study of design. As technology continued
to evolve at the turn of the century design science started becoming more integrated into
Information systems research and software design projects. Alan Hevner in 2004 produced a
seven-guideline framework, with the aim to assist information system researchers to;
conduct, evaluate and present design-science research. (Alan R. Hevner, 2004)
Figure 3 - Design Science Guideline from MIS Quarterly Research Essay.
Further refinement this framework by Peffers, was later made in order to explain how the
regulative cycle fits into the design science research framework.
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
22
Figure 4 - The Engineering Cycle
This framework is widely used today by information system researchers as it provides
researchers a medium to analyze and de-cipher an existing problem and offer a solution
design or solution hypotheses. After which they can then look at whether their solution or
hypotheses is effective or meets the specified criteria, this can be executed through a pilot
scheme or prototyping after which the full implementation can take place. (Roel Wieringa,
2010). This principle in particular would suit my project the most in my opinion, as I aim to
design a software solution (clustering program), design it, and then evaluate the
effectiveness of the solution.
3.2 Positivist Approach (Positivism)
The positivist approach is a methodology based on an objective hypotheses based on
introspection or intuition validated or dis-proved by scientific testing and experimentation
(Sage Publications, 2009). In other words a positivist approach will have a hypotheses
validating a subject area or discrediting it then going on to prove the hypotheses by
experimentation or building a solution (University of the West of England, 2007). The
origins of the method lie with sociologist Auguste Comte who coined and developed the term
in the early 19th century. Today the positivist approach is used increasingly in IS and
software engineering projects (Sociology Guide, 2008). Some of the advantages of the
positivist approach are that it relies heavily on quantitative data as opposed to qualitative
data which is seen as more scientific thus being a more reliable source to base hypotheses on.
Another advantage to the positivist approach is the fact that it follows a very stringent
structure, as the positivist approach believes that there are guidelines in place that need to
be adhered to, which as a consequence should minimize room for error. This ideology makes
positivist believe that the reduced room for error will make the whole approach more
accurate when it pertains to experiments and applications. However on the other hand there
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
23
are drawbacks to the approach one of them being human behavior. Positivists strongly
believe in objective based assumptions however there is no guarantee that bias or subjective
analysis won’t corrupt the study. (Johnson, 2010) (Wikipedia, 2014)
Figure 5- Epistemological Assumptions for Qualitative and Quantitative Research from
http://dstraub.cis.gsu.edu:88/quant/2philo.asp
3.3 Interpretive Approach
The Interpretive approach is a qualitative research method that is based on subjective
assumptions with the knowledge derived from value-laden socially constructed
interpretations (Packer, 2007). In a stark contrast to the positivist approach interpretivist
researchers aim to understand and interpret human behavior as opposed to generalizing and
predicting cause and effect. The impact this has on information system and software design
projects is that the researcher will aim to ask several open ended questions generally
through questionnaires or unstructured / semi-structured interviews and sometimes
observations to gather as much primary information as possible once the scope of the project
has been defined (WordPress, 2012). This particular approach also enables the researcher
to open to new ideologies throughout the duration of the project as opposed to that of the
positivist approach who believe in a pre-ordained rules and guidelines. With that being said
there are many advantages as well as disadvantages to this approach. One advantage is that
the research methodology is highly qualitative based meaning that the data gathered will be
in more depth. However a drawback will be that interpretivists have a subjective view about
the project this into which will lead to bias getting in the way of ascertaining the correct
results or the best methods to apply in completing the project. (Institute of Public &
International Affairs, 2009) (Slideshare, 2013)
3.4 Critical Discussion
Having looked out all three research approaches in appropriate detail, highlighting the
advantages and disadvantages of each, it’s safe to say that all have adequate potential in
being the framework for any information systems project. However I believe that the best
approach to adopt for this particular project will be the Design Science approach as this
offers the strongest correlation between what I am trying to achieve in this project and the
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
24
actual design science approach itself (design, build evaluate). However with that being said I
believe that I can still look at this project from a positivist point of view. The reason I say this
is that the idea of using data mining to develop ‘personas’ is a relatively novel idea, so using a
hypotheses I am trying to positively prove that it is possible and can be done.
3.5 Software Development Lifecycle Models
There are many models that can be used to develop a software project. All of these models
follow the design science principle of design, build evaluate. What I aim to achieve in this
section will be to identify and describe two common models, offering adequate analysis on
each. After which I will isolate the best model that can be adopted to my project.
3.5.1 Rapid Application Development (RAD)
Rapid Application Development is an iterative model that favors rapid, early software
prototyping as opposed to traditional planning. This approach consequently allows the
development of software to take place much sooner. It also keeps stakeholders at the heart of
the development process and allows requirement changes to take place easily. RAD typically
follows four phases in it model; Requirements Planning Phase, User Design Phase,
Construction Phase and Cutover phase. (Wikipedia, 2014) (David C. Yen, 1999)
1. Requirements Planning Phase – The inaugural phase of the project were the
project team meet with the stakeholders to go over the business needs of the client,
the project scope, system requirements and constraints. This is then preceded by an
agreement of the key issues that need to be addressed after which the relevant
authorization needs to be obtain in order to proceed
2. User Design Phase – The second phase of the project aims for the stakeholders to
maintain dialogue with the project analysts to develop prototype models of the
system that shows clear representation of all system input and output features plus
all the processes within the system. This phase of RAD is perceived to be a continuous
interactive process that allows the stakeholders to play an active role in
understanding, modifying and consequently approving a working prototype model
once they see a model that caters to their business needs
3. Construction Phase – The penultimate phase of project continues to focuses on
program and application development. Stakeholders further participate in suggesting
changes and improvement to any user interfaces or reports that are typically
developed at this phase. Unit-integration, system testing, programming and
application development is done at this phase of RAD.
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
25
4. Cutover Phase – The final phase of RAD is typically when the whole project is
brought to a head. Tasks such as testing, data conversion, user training and system
changeover is done at this stage. The compression of all this tasks that the final stage
enables the new system to be delivered back to the stakeholders in a much quicker
timeframe.
Figure 6 - RAD Diagram
3.5.2 Analysis
The RAD model comes with many advantages as well as disadvantages. However the key is to
be able to synthase them and be relate it back to my project. One of the common advantages
of the RAD model is that it drastically reduces the time need for requirement analysis and
software requirement software requirement. Also all prototypes created can be stored for
future use; this will consequently speed up the software development of the product.
Relatively speaking heavy prototyping is not necessary for my project as it’s a fairly short,
small project with strict user requirements. (Rouse, 2007) (ISTQB Exam Certification,
2012)
3.6 Waterfall Model
The waterfall model is a sequential design model that establishes software development
through downward flow of task/activities through several phases (reminiscent of an actual
waterfall). It differs from conventional agile development models as it seeks to fully describe
the application through written documents before actual software development commences.
Originally developed by Royce in 1970 the waterfall model follows seven sequential phases.
(The Waterfall Development Methodology, 2012)
1. Requirements Specification – The requirements are gathered from the
stakeholders and agreed on in principle with development team.
2. Design – The blueprint of the project is drawn up and given to the developers to
commence coding and start implementation
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
26
3. Implementation - The actual system is developed at this stage, all coding is
completed resulting in the actual program being developed
4. Integration – The system created is integrated in the environment agreed on in the
preliminary phase
5. Testing – Full testing of the integrated system is performed at this stage, debugging
also happens at this stage with the view of determining any bugs and work on
potential fixes and patches
6. Installation – Installing of the system including the removal of the old system is done
at this stage. This stage also includes training for all stakeholders and staff members
7. Maintenance – The installed system is maintained through continuous updates and
patches being developed and installed.
The waterfall model follows a strict principle that you can only move forward to the next
phase once the existing phase has been completed and worked to perfection meaning that
once a phase is completed it cannot be looked at again. (ISTQB Exam Certification, 2012)
Figure 7 - Waterfall Model
3.7 Analysis
The waterfall model comes with many advantages. One of the most common is that
sequential nature of the model, which makes it very easy to understand and execute. Another
advantage is that it works well on projects that are fairly small with strict set-in-stone
requirements, which suit my project adequately. Another reason I favor this SDLC is that it
seems to go hand in hand with the design science approach (design, build & evaluate). (
Select Business Solutions, Inc., 2010)
3.8 User Interface Evaluation
One of the most integral parts of any software project is to be able to coherently evaluate the
design of the artefact. Like previously stated the user requirements are used to inform the
design of the application, once this is done a framework or principle needs to be
implemented in order to evaluate it. One of the most popular techniques for usability
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
27
evaluation is the Nielsen Heuristics. In this section of the report I aim to talk about the
Nielsen Heuristics in detail as well as another usability inspection method, The Cognitive
Walkthrough, in order to draw qualitative comparisons to both methods. This in turn will
help me decide on the most suitable approach in evaluate the usability of the Persona
Identification Application.
3.8.1 Nielsen Heuristics
As previously stated the Nielsen Heuristics is one of the most popular usability evaluation
techniques and one of the most used today. It’s important to remember that heuristic
evaluation bridges the gap between conventional user testing. This is achieved by providing a
template or set of principles that help uncover problems a user will likely come across does
this. Looking back it was Jakob Nielsen work with Rolf Molich in the 1990’s that helped
originate the heuristics that is widely used today. However it was in his 1994 publication
Usability Engineering that the actual ten heuristics were published for the first time.
(Nielsen, 1994)
(Some of the heuristics have been shortened for brevity)
1. Simple and Natural Dialogue – The dialogue should not contain information that is
irrelevant or rarely needed
2. Speak the User’s Language – The dialogue should be expressed clearly in words,
phrases, and concepts familiar to users rather than in system oriented terms
3. Minimize the User’s Memory Load – The user should not have to remember
information from one part of the dialogue to another
4. Consistency – Users should not have to wonder whether different words, situations
or actions mean the same thing
5. Feedback – The system should always keep users informed about what is going on,
through appropriate feedback within reasonable time.
6. Clearly Marked Exits – Users often choose system functions by mistake and would
need a clearly marked ’emergency exit’
7. Shortcuts (Accelerators) – Unseen by the novice users by often speed up the
interaction for expert users.
8. Good Error Messages – They should be expressed in plain language (no code) to
precisely indicate the problem
9. Prevent Errors – Even better than good error messages is a careful design that
prevent a problem from occurring in the first place
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
28
10. Help and Documentation –Even though it is better if the system can be used
without documentation, it may be necessary to provide help and documentation. Any
such information should be easy to search, be focused on the user’s tasks, list
concrete steps to be carried out and not be too large
3.8.2 Advantages and Disadvantages
Nielsen heuristics comes with many advantages as well as disadvantages. Some of the
advantages to this principle are that it’s a very useful and relative inexpensive way of
providing some quick feedback to designers, which can reduce the overall turnover time that
a product is in the usability evaluation stage. Furthermore it can be a good way of obtaining
qualitative feedback EARLY in the design process. Another advantage to the heuristics
evaluation is that it can help immensely in suggesting the best corrective measures for
designers provided that the correct heuristic has been assigned in the first place. This would
prove to be helpful when designing the user interface for the Persona Identification
Application (PIA). Looking deeper into Nielsen Heuristics there is a few disadvantages to this
evaluation principle. One being that it requires specialist knowledge and competent
experience for it the application of the heuristics to be effective. Moreover usability experts
trained to administer the heuristics effectively and hard to come by and can be relatively
expensive to source. Another disadvantage to the heuristics is that it can tend to be
misleading in that it can identify more of the minor issues and less of the actual major issues
with the design. (Usability.Gov, 2010) (Nielsen, 1994)
Moving forward it is important to remember that heuristic evaluation does not replace
conventional usability testing and it should not be seen as an alternative to it. Many of the
benefits and drawbacks have been highlighted above and with all being discussed I’m in no
doubt that Nielsen Heuristics is the perfect evaluation metric for evaluating the user interface
for the application. Reason being is that, in essence, it evaluates all the basic requirements set
by the stakeholders and also it gives me things to consider while designing the app i.e.
accelerators and consistency etc. as well as things to evaluate on at the end of the design
process
3.9 Critical Discussion
Nielsen heuristics comes with many advantages as well as disadvantages. Some of the
advantages to this principle are that it’s a very useful and relative inexpensive way of
providing some quick feedback to designers, which can reduce the overall turnover time
that a product is in the usability evaluation stage. Furthermore it can be a good way of
obtaining qualitative feedback EARLY in the design process. Another advantage to the
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
29
heuristics evaluation is that it can help immensely in suggesting the best corrective
measures for designers provided that the correct heuristic has been assigned in the first
place. This would prove to be helpful when designing the user interface for the Persona
Identification Application (PIA). Looking deeper into Nielsen Heuristics there is a few
disadvantages to this evaluation principle. One being that it requires specialist knowledge
and competent experience for it the application of the heuristics to be effective. Moreover
usability experts trained to administer the heuristics effectively and hard to come by and
can be relatively expensive to source. Another disadvantage to the heuristics is that it can
tend to be misleading in that it can identify more of the minor issues and less of the actual
major issues with the design. Moving forward it is important to remember that heuristic
evaluation does not replace conventional usability testing and it should not be seen as an
alternative to it. Many of the benefits and drawbacks have been highlighted above and with
all being discussed I’m in no doubt that Nielsen Heuristics is the perfect evaluation metric
for evaluating the user interface for the application. Reason being is that, in essence, it
evaluates all the basic requirements set by the stakeholders and also it gives me things to
consider while designing the app i.e. accelerators and consistency etc. as well as things to
evaluate on at the end of the design process. The way I intend to go about this heuristic
evaluation is to construct a usability questionnaire as well as system functionality test in
order to be able to coherently ascertain the usability of the system, also to be able to test
the functionality of the system, thus validating the user requirements.
3.9.1 Cognitive Walkthrough
In order to balance the argument for which evaluation technique to use it’s imperative to
draw on a comparison. One of the direct comparisons to the Nielsen Heuristics is the
Cognitive Walkthrough approach. Cognitive Walkthrough was developed as an additional
tool in usability engineering. The technique involves a group of evaluators undertaking a set
of tasks on the interface to evaluate its ease of learning and understandability. Lewis and
Polson first set out the concept of cognitive walkthrough, and it works by tasking the
evaluators with four questions; (usabilityfirst, 2011) (Cathleen Wharton, 1994)
• Will the user try to achieve the right effect?
• Will the user notice that the correct action is available?
• Will the user associate the correct action with the effect to be achieved?
• If the correct action is performed will the user see that the progress is being made
toward solution of the task?
After all these questions are ascertained the evaluator attempt to conjure a ‘success story’ for
each incremental step of the process. If this turns out to be impossible then the evaluator will
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
30
then create a ‘failure story’, which aims to assess why the user cannot accomplish the task
based on the GUI. The findings from the walkthrough are later aggregated and used to make
improvements on the application, in this case the Persona Identification App. Like the
heuristics stated earlier cognitive walkthrough has many advantages as well as
disadvantages. One of the main advantages is that it’s useful for identifying problems early in
the design phase as well as help define users goals and assumptions with fewer resources’
that say full user testing would demand. This technique fits well with the scope of my project
as it provides a short and concise evaluation of the user interface I will be designing it also
provides a user centered perspective similar to what the heuristics offer in comparison.
However one of the main issues with cognitive walkthrough is more susceptible to subjective
bias from the evaluators, which may hinder the main issues not being covered. Another issue
is that it can be very difficult for a seasoned evaluator to assume the perspective of an
inexperienced user of the system. (Lewis, 1997)
3.10 Critical Discussion
Like the heuristics stated earlier cognitive walkthrough has many advantages as well as
disadvantages. One of the main advantages is that it’s useful for identifying problems early in
the design phase as well as help define users goals and assumptions with fewer resources’
that say full user testing would demand. This technique fits well with the scope of my project
as it provides a short and concise evaluation of the user interface I will be designing it also
provides a user centered perspective similar to what the heuristics offer in comparison.
However one of the main issues with cognitive walkthrough is more susceptible to
subjective bias from the evaluators, which may hinder the main issues not being covered.
Another issue is that it can be very difficult for a seasoned evaluator to assume the
perspective of an inexperienced user of the system.
3.11 Summary
In this chapter I have looked in depth at three design principles, evaluating each of them
and choosing the most appropriate one for my project. In addition I looked into software
development lifecycle and picked out the waterfall model as the most efficient lifecycle for
this project. Finally I looked into user interface evaluation choosing Nielsen heuristics as
my way of evaluating the application interface. The findings of this chapter have helped me
choose the appropriate methodology and evaluation for this project.
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
31
4 Requirements Analysis and Design
In this chapter I will be reviewing and discussing the fundamental requirements of this
project. There are many types of requirements categories that can be used. In this project I
will be using three; Customer requirements, Functional and Non-Functional requirements.
In addition to this I will be discussing the design process of my project making use of
activity diagrams, use case diagrams and narrative to help illustrate the design of my
application
4.1 Customer Requirements
Customer requirements are direct statements or expectations that come from the principle
stakeholders or the prime actors of the project being developed. They directly impact scope
of the project and have unequivocal ramifications on the key features of the system being
developed. In this particular case I spoke directly to some of the principle stakeholders for
the Persona Identification Application who told me directly what their mission
statement/requirements were the following:
1. To be able to use wholesome dataset (Excel)
2. To be able to cluster the dataset through an application
interface
3. Be given back a visual representation of the clustering results
through the application interface
4. To be able to download a CSV table that show the clustering
results which can help facilitate the identification of personas
Table 1 – User Requirements
4.2 Functional Requirements
Functional requirements are the mandatory tasks and activities that need to be fulfilled in
order to exert the full functionality of the app. In others words it should depict what the
system should do and the features it should provide to its users. The table below shows the
functional requirements for the Persona Identification Application.
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
32
Table 2 - Functional Requirements
4.3 Non-Functional Requirements
Non-functional requirements are the requirements that depict the functionality of the
system, in this case the Persona Identification Application. The table below shows the non-
functional requirements for this system.
Table 3 - Non-Functional Requirements
4.4 Requirements Summary
Thus far, one of the key things to remember is that requirement gathering and analysis is
that it plays a crucial role in informing the design of the software solution. The
requirements along with research conducted in the literature review will assist me in
putting together an adequate design of the system, which will be shown in the second half
of this chapter.
4.5 Design
In this part of the chapter I will be concentrating on the design aspect of the Persona
Identification Application. As previously stated the outcomes of my literature review
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
33
coupled with the results from the requirement analysis have helped put this part of the
chapter together. I will draw up different diagrams such to clearly show the interaction
with the user and the system. I will also be providing reasoning behind why each method
was chose.
4.6 Activity Diagram
One of the important UML models, an activity diagram illustrates the workflow of a
business process. In this case the diagram below shows the set of incremental steps that an
end user would need to achieve to get to attain his or her end goal. Along the way there are
different decision points that a customer will face which will ultimately lead them to the
same main deliverable. One of the reasons I opted to construct an activity diagram it is one
of the most comprehensible diagrams offering a clear understanding of the business flow
within the system not only to the developers but to them stakeholders as well. (Wang
Linzhang, 2004
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
34
Figure 8 - Activity Diagram of Persona Identification Application
4.7 Use Case
Another important UML model the use case aims to offer the simplest way of demonstrating
the user’s interaction with the proposed system. The diagram below shows the user
interactions with the Persona Identification App. In addition to the diagram I put together
a use case narrative, which basically provides a more in depth description to the use case
diagram. The reason I chose to implement a use case diagram and narrative is that it
provides an abstract view of the application from the user perspective. (Elenburg, 2005)
Figure 9 - Use Case Diagram of Persona Identification Application
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
33
Table 4 - Use Case Narrative
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
34
Summary
This chapter has looked at the requirements set out by the user setting out the functional and
non-functional of the application. Also this chapter has shown how I went about designing
the application; in addition to this I have been able to discuss different techniques in
evaluating the usability of the application interface and functionality. The findings in this
chapter will help me greatly in implementing the application taking into consideration the
requirements from the users; equally it will help me evaluate the application as a whole. This
will be explained more in Chapter 6.
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
35
5 Implementation
In this chapter I will be discussing the implementation of the Persona Identification App. In
particular I will be looking into the software environment I chose to implement the
application in, which in this project is R, providing adequate justification for why my selected
software environment was chosen. In addition to this I will be detailing the full functionality
of the application by way of screenshots with adequate description of each point.
5.1 Software Environment – R
R is a free command line based programming language specifically for statistical computing
and data mining. Its software environment enables its users to construct statistical software
as well as graphical user interfaces. As previously stated R is a command-based line
programming language meaning it runs through a MS-DOS style display; however several GUI
platforms have been developed to use alongside R such as R-Studio. One of the main reasons
I decided to use R to implement this system is that it was a free meaning that I could use it at
will as opposed to having to obtain a license. Another reason I chose to use it was because I
felt quite comfortable using a command line based system due to my prior experience with
MS-DOS. Subsequently R offers a good and easy to understand package in developing
interactive web-based interfaces (R-Shiny) which I used to develop the interface.
5.2 Software Environment - MatLab
MatLab is a high level, interactive programming environment written in a bevy of
programming languages such as Java, C and C++. One of the advantages of MatLab is that it
allows its users to access a world of different features such as plotting and mapping functions
and data, implementing algorithms and using built in math functions. Furthermore MatLab
allows its user to create graphical user interfaces to work hand in hand with the programs
coded in its environment. One of the main reasons I chose not to use MatLab to develop and
implement the Persona Identification App was because I was unable to obtain a license to use
it at home from the university, meaning that every time I wanted to work on development I
would have to come onsite which is not feasible or indeed efficient.
5.3 Persona Identification Application Implementation
As previously stated I developed the persona identification program in R then subsequently
developed the interface using R’s own package Shiny. In order to do this I had to code in
different functions then put it together in Shiny based application. I have enclosed below
screenshots of the coding of the most important functions with annotations to help depict
what each function is doing. For convince sake I have also listed the functions below:
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
36
5.3.1 Application Coding Screenshots
1. Import CSV File
Figure 10 - Import csv file plus description
2. Choose variables
Figure 11 – Choose
variables plus
description
1. Import CSV file and convert to data matrix
2. Choose variables
3. Standardize data option and cluster data
4. Show within groups sum of errors squared (Number of
clusters)
5. Show results
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
37
3. Standardize data and run K-Means algorithm
Figure 12 – Standardize data and run k-means plus description
4. Show within group’s sum of errors squared (Number of clusters)
Figure 13 – Choose K function plus description
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
38
5. Show Analysis Results
Figure 14 – Show analysis results plus description
6. Download cluster results CSV file
Figure 15 – Download results csv file plus description
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
39
5.3.2 Application Interface Screenshots
This part of this chapter I will be presenting screenshots depicting the actual interface of the
application. This will add a visual impression to the lines of code explained earlier. The
screen shots will further be annotated to provide more in-depth descriptions on what is
transpiring within the application.
Figure 16 - Screenshot of Persona Application Interface 1.0
Figure 17 – Screenshot of Persona Identification Application 2.0
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
40
5.4 Assumptions
In order to run the application successfully there needs to be some prerequisites that need to
be adhered to. One of them is that all the data that is in the csv file needs to be numeric else
the K-Means algorithm will just throw errors. In addition the data imputed has to be pre-
processed in order to gain tangible results. This will be further discussed in chapter 6. Finally
when running this application in R the shiny library needs to unpackaged and run after this is
done a simple command line of runApp(“.”) needs to be entered to run the application.
5.5 Summary
This chapter has shown the implementation of the application as well as the reasoning
behind why I chose the software environment to code it in. I have also discussed the
prerequisites that need to be fulfilled in order for the application to work. The findings in
this chapter have demonstrated my ability to code an application and present it in a user-
friendly manner.
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
41
6 Results and Evaluation
In this chapter I will be looking at the results gained from the application developed. I will
also be detailing how I went about gaining personas from the results data. It’s important to
remember that this application can work with any dataset as long as its numeric and for the
purposes of this project I have focused on a dataset containing 500 families weekly shop over
a 2 month period. Furthermore I will be evaluating the application usability through the
Nielsen Heuristics principle and conducting black-box testing to test the system functionality.
6.1 Data Pre-Processing
As previously stated data preprocessing is an essential part of the data mining process as it
helps lay the foundation for more concise result analysis. It also helps clear up the so-called
‘garbage’ data that may spew the results. To pre-process the data used for this project I first
choose the two most important variables that will help me identify personas from the
Dunhummby dataset, which in this case was household key (hkey) and product category
(prodcatID). I used a technique called “Quota Sampling” to select which data I wanted to use
for this analysis (Riley, 2012). After which I created my own data subset to make with the
two variables only in the CSV file. Finally, to adhere to the rule of K-Means, I assigned each of
the 22 product categories to a numeric value and inputted them into the data subset keeping
a reference of the category and the numeric value its assigned to which can be seen below.
For ease of understanding I used the product category as the “personas” e.g. GROCERY will be
a grocery persona etc.
Figure 18 – Evidence of data pre-processing Results
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
42
Once the results CSV file is downloaded the contents show four columns; kclust, which shows
how, many clusters there are hkey and prodcatID, these are the two variables we chose to
analyze and finally fit.cluster which show where each of the variables assigned fit in each
cluster.
Figure 19 - Screenshot of results out CSV file
I can see from here that the prodcatID and hkey have been assigned to a fit.cluster, which has
been set by the user already (see. From this I can then filter the rows in the csv file to see how
many numeric variables e.g. 1001, 1002 are in each cluster. Once I have found out how many
of each variable are in each cluster, I aggregate the total amount, which in turn helps me
work out a persona percentage on each category in each cluster. I make sure all the results
are documented which can be seen below.
Figure 20 - Identifying Personas Breakdown
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
43
The formula I used to work out the percentage was relatively straightforward. After I
aggregate the total amount a calculated the instances of variables against the total amount
within the cluster. For example 1001(Grocery) has 2050 instances in cluster 1, I run that
number against the total amount of instances in cluster one using an online percentage
calculator.
Figure 21 –Percentage Calculator Example
6.2 Results Summary
To be able to identify personas, thus meeting my aim, I conducted some tests on my own data
sub-set (Figure 11). The first test I ran was with K (Number of Clusters) set to 3, which is the
optimum number of clusters for this dataset (see Figure 10). After mining the raw data
based on the method stated above, the following results were found:
Figure 22 - Persona Percentage Results (Test 1)
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
44
From the results found I can say that the GROCERY persona was the most consistent and
populous persona found in the data set averaging around 60-65% in terms of persona
percentage. The next best persona found was the DRUG GM persona, averaging around 10-
11% persona percentage. This tells me that the dataset is heavily populated with GROCERY
Personas with very little other variances of personas following. To validate this finding I ran
the application again on that same dataset, however this time with K = 4. The results were as
follows:
Figure 23- Persona Percentage Results (Test 2)
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
45
From this particular test I can see some sort of correlation with the first test I conducted with
K set at 3. I can deduce that the GROCERY persona is averaging between 63-66% persona
percentages spread across 4 clusters, which is very similar to the first test run. The DRUG GM
persona keeps its mark with around 10% persona percentage, with PRODUCE coming in at
around 9-10% average in terms of persona percentage. This indicates to me that the dataset
is densely populated with GROCERY personas
6.3 Evaluation
As previously mentioned in chapter 3.8.1 I have chosen to use the Nielsen heuristics to
evaluate the usability of the application interface. To go about this I have used a System
Usability Scale questionnaire, which was developed by John Brooke (Brooke, 2011). The
questionniare itself is ten questions long based on a likert scale scoring system (1= Strongly
disagree, 2= Strongly agree) if the particitpant is uncertain of an answer than they will select
3. The reason for me choosing this questionnarie is that the questions asked are similar to
that of Nilesen 94’ huerisitcs which is what I planned to use to evaluate the system with to
begin with. In addtion using a likert scale system makes it more choerent and easier for the
participents to complete, thus saving time (Dane Bertram, 2012). Below is an example of
the questionniare that will be given to the participants;
Figure 24 - System Usability Questionnaire
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
46
6.3.1 Participant selection
Selecting the number of participant to evaluate the application is very important especially
when it pertains to this project. In an ideal world the more evaluators I have the better as
different evaluators can pick up different usability issues. However according to Nielsen the
most optimum number for evaluating a software system are 5 evaluators or at least 3.
(Nielsen, 1995).
Figure 25 - Graph showing the optimum number of evaluators
The above figure (23) shows that optimum number of evaluators against the proportion of
usability problems found. I can see here that 5 evaluators can find 75% of usability problems.
6.4 Black-Box Testing
Black box testing is a form of functional testing which aims to test if the software developed
does what it is supposed to do. The way I went about this was to create a questionnaire
which is based on the functional requirements, which the same participants that are testing
the usability would have to fill out. (Williams, 2006)
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
47
Figure 26 - Functional Test Questionnaire
The reason I chose to design the questions this way (figure 24) was to be able to gauge
whether or not the functional requirements have been met with a straightforward yes or no
response. This directly has a knock on effect as the outcome of this questionnaire will
indicate to me how far I have gone in meeting the user requirements.
6.5 Evaluation Results
After the evaluation was completed I put all the results from the questionnaire and deduced a
bar chart from it to add a visual representation to the evaluation results. The first thing I did
was to put all the answers from each participant in a table which can be seen below (Figure
25). After this I was able to construct a bar chart using Excel.
Figure 28 - Bar Chart of Usability Questionnaire Results
To make the output more meaningful to me I aggregated the results and draw up a bar chart
to give a visual representation of the average score of the usability questionnaire
Figure 27 - Table of Usability Questionnaire Results
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
48
Figure 29 Bar Chart showing average usability questionnaire results
6.6 Black Box Testing Results
As previously stated the system functionality testing (black box) was conducted concurrently
with the usability testing. Everyone that took part reported back that they execute all the
functionalities that the system offered. The results is illustrated below in figure 28
Figure 30 - Results of System Functionality Questionnaire
6.7 Evaluation Summary
To conclude this chapter I can say that the usability and system evaluation was highly
successful, in particular the black box testing. From all 5 subject experts who conducted the
evaluation, their response was highly positive which tells me that, from an expert point of
view, the application is very useable and does what its set out to do. On the functionality side
5/5 evaluators answered YES to all 7 functionality questions (Figure 28). This tells me that
the system functionality is fit for purpose and crucially it validates the customer
requirements set out in Chapter 4.
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
49
7 Conclusion
This dissertation has covered a lot of topics as well as fresh, novel ideas i.e. persona
identification. However it’s important to be able to competently draw conclusions from the
findings of this project, offering appraisal on the positives found and being able to offer
constructive critique on the weaker aspects of the dissertation.
7.1.1 Aim - Identify individual personas from prosumers personal information.
To answer this question I can say that I was able to identify individual “personas” from
prosumer data, however there were issues that I came across during in regards to this.
The first issue was the strength of the persona. The main personas found on the dataset
tested were the GROCERY “persona” however this could be deemed by some analyst as too
vague or not in depth enough. Thorough my own investigation into this perception I found
out that a much deeper pre-processing method, e.g. using sub-product categories instead of
main product categories, would be required in order to fish out much more ‘features’ within
the clusters. This will help facilitate more diverse and meaningful “personas”. It’s important
to stress that this could have been achieved within the boundaries of this particular project
however I believed that deriving personas from main product categories i.e. grocery,
produce, nutrition etc. would be a much better way of obtaining good individual personas.
However from hindsight I believe a deeper pre-processing method would have produced
more meaningful persona. Nevertheless I believe this shouldn’t take away from the fact that I
was able to identify individual “personas” which was the ultimate aim of this dissertation.
7.1.2 Objective 1 - Undertake a state-of-the-art literature review to inform, create
a design specification for an identifying personas/Investigate in greater
detail the pros and cons of clustering with reference to appropriate
literature
To conclude this objective I can confidently say that a state-of-the-art literature review was
undertaken (See Chapter 2) carefully analyzing two of the main clustering methods
(hierarchical and partitioning) drawing advantages and disadvantages and relating it back to
how it would impact my aim of this project. In addition I looked into the importance of
personal data and how it has risen to be the new “oil”, also I looked at the rise of the digital
prosumer, in particular, how prosumption is poised to take over typical consumption laying
credence to Toffler prediction on how prosumption is going to take over consumption by the
turn of the 21st century. This all provided the necessary justification for undertaking the
project and exposed the potential value in building an application that can identify personas.
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
50
In essence I believe this objective was met at a high standard making use of various white
literatures. This subsequently enabled me to create a design specification for my application.
7.1.3 Objective 2 - Build a persona identification application.
The particular part of the project was by far the most challenging yet the most rewarding.
First off I was tasked with choosing the appropriate software environment in which the
application will be coded in, after this was ascertained then the code development begun.
Although this was a very tedious task, involving numerous failed attempts and heavily
bugged versions, a final version was created bringing to life all the research and personal
hypotheses set out at the beginning of the project. (See Chapter 5) Overall I was hugely
satisfied with the implementation of the application despite the fact that it took a huge
amount of time and resources to put together, I believe it was a very strong and well put
together application that was indeed fit for purpose
7.1.4 Objective 3 - Evaluate the application.
The final part of this dissertation required me to evaluate the application, to not only provide
validation against my aim but to validate the customer requirements defined in Chapter 4. I
went about this by, first evaluating the usability of the system; this was done via a
questionnaire which was very heavy influenced by the Nielsen heuristic principle. After this a
black-box test was put together to evaluate the functionality of the application. Both test
were a huge success, as I was using experts to evaluate the system, there was a lot of extra
scrutiny laid on both the usability and functionality. The feedback was highly positive which
went a long way in validating my aim and user requirements. (See Chapter 6)
7.2 Future Development
One of the most underrated aspects of any project is to negate things that haven’t been done,
due to time or resources, and over-emphasis the things that have been achieved in a project. I
believe that there is a world of benefits to be unlocked once we can sit back and look at what
can be developed in the future to make this project even better.
There are a number of things that can be achieved with future work/development that would
enhance the application even further. The first is obviously a much deeper pool of personas
which was explained in the chapter. Another future development would be adding more
algorithms to the application instead of just the single K-Means. This was explained in more
detail in Chapter 2.8. Another development would be the ability to but the application on a
server and connect it to a database, this will enhance the application even more as it would
mean that data from the data lockers could be stored on the databases and be called into the
application via a database query etc. making the application more robust, expanding the
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

More Related Content

What's hot

Semantic Web Mining of Un-structured Data: Challenges and Opportunities
Semantic Web Mining of Un-structured Data: Challenges and OpportunitiesSemantic Web Mining of Un-structured Data: Challenges and Opportunities
Semantic Web Mining of Un-structured Data: Challenges and Opportunities
CSCJournals
 
Data science landscape in the insurance industry
Data science landscape in the insurance industryData science landscape in the insurance industry
Data science landscape in the insurance industry
Stefano Perfetti
 
The data science revolution in insurance
The data science revolution in insuranceThe data science revolution in insurance
The data science revolution in insurance
Stefano Perfetti
 
23 ijcse-01238-1indhunisha
23 ijcse-01238-1indhunisha23 ijcse-01238-1indhunisha
23 ijcse-01238-1indhunisha
Shivlal Mewada
 
The Future of Big Data
The Future of Big Data The Future of Big Data
The Future of Big Data
EMC
 
The Digital Enterprise
The Digital EnterpriseThe Digital Enterprise
The Digital Enterprise
Booz Allen Hamilton
 
Information is knowledge
Information is knowledgeInformation is knowledge
Information is knowledge
haramaya university
 
Gimme my data: government transformation
Gimme my data: government transformationGimme my data: government transformation
Gimme my data: government transformation
W. David Stephenson
 
How to collect and organize data
How to collect and organize dataHow to collect and organize data
How to collect and organize data
Frieda Brioschi
 
Transforming policy skepticism into policy co makership
Transforming policy skepticism into policy co makershipTransforming policy skepticism into policy co makership
Transforming policy skepticism into policy co makership
Thei Geurts
 
The promise and peril of big data
The promise and peril of big dataThe promise and peril of big data
The promise and peril of big data
rmvvr143
 
Around Data Science (v. 2021 ITA)
Around Data Science (v. 2021 ITA)Around Data Science (v. 2021 ITA)
Around Data Science (v. 2021 ITA)
Frieda Brioschi
 
Big data-and-creativity v.1
Big data-and-creativity v.1Big data-and-creativity v.1
Big data-and-creativity v.1
Kim Flintoff
 
Data Science for Finance Interview.
Data Science for Finance Interview. Data Science for Finance Interview.
Data Science for Finance Interview.
James LoBuono, CAPM, ITILv4
 
ST&I National Information System Platform: the Brazilian case of Lattes
ST&I National Information System Platform: the Brazilian case of LattesST&I National Information System Platform: the Brazilian case of Lattes
ST&I National Information System Platform: the Brazilian case of Lattes
Roberto C. S. Pacheco
 
use of social media sites in work place
use of social media sites in work placeuse of social media sites in work place
use of social media sites in work place
Simran Agrawal
 
Final baxis patel
Final baxis patelFinal baxis patel
Final baxis patel
Sonam Shah
 
Infrastructure as Destiny — How Purdue Builds a Support Fabric for Big Data E...
Infrastructure as Destiny — How Purdue Builds a Support Fabric for Big Data E...Infrastructure as Destiny — How Purdue Builds a Support Fabric for Big Data E...
Infrastructure as Destiny — How Purdue Builds a Support Fabric for Big Data E...
Dana Gardner
 
WEF - Personal Data New Asset Report2011
WEF - Personal Data New Asset Report2011WEF - Personal Data New Asset Report2011
WEF - Personal Data New Asset Report2011
Vincent Ducrey
 

What's hot (19)

Semantic Web Mining of Un-structured Data: Challenges and Opportunities
Semantic Web Mining of Un-structured Data: Challenges and OpportunitiesSemantic Web Mining of Un-structured Data: Challenges and Opportunities
Semantic Web Mining of Un-structured Data: Challenges and Opportunities
 
Data science landscape in the insurance industry
Data science landscape in the insurance industryData science landscape in the insurance industry
Data science landscape in the insurance industry
 
The data science revolution in insurance
The data science revolution in insuranceThe data science revolution in insurance
The data science revolution in insurance
 
23 ijcse-01238-1indhunisha
23 ijcse-01238-1indhunisha23 ijcse-01238-1indhunisha
23 ijcse-01238-1indhunisha
 
The Future of Big Data
The Future of Big Data The Future of Big Data
The Future of Big Data
 
The Digital Enterprise
The Digital EnterpriseThe Digital Enterprise
The Digital Enterprise
 
Information is knowledge
Information is knowledgeInformation is knowledge
Information is knowledge
 
Gimme my data: government transformation
Gimme my data: government transformationGimme my data: government transformation
Gimme my data: government transformation
 
How to collect and organize data
How to collect and organize dataHow to collect and organize data
How to collect and organize data
 
Transforming policy skepticism into policy co makership
Transforming policy skepticism into policy co makershipTransforming policy skepticism into policy co makership
Transforming policy skepticism into policy co makership
 
The promise and peril of big data
The promise and peril of big dataThe promise and peril of big data
The promise and peril of big data
 
Around Data Science (v. 2021 ITA)
Around Data Science (v. 2021 ITA)Around Data Science (v. 2021 ITA)
Around Data Science (v. 2021 ITA)
 
Big data-and-creativity v.1
Big data-and-creativity v.1Big data-and-creativity v.1
Big data-and-creativity v.1
 
Data Science for Finance Interview.
Data Science for Finance Interview. Data Science for Finance Interview.
Data Science for Finance Interview.
 
ST&I National Information System Platform: the Brazilian case of Lattes
ST&I National Information System Platform: the Brazilian case of LattesST&I National Information System Platform: the Brazilian case of Lattes
ST&I National Information System Platform: the Brazilian case of Lattes
 
use of social media sites in work place
use of social media sites in work placeuse of social media sites in work place
use of social media sites in work place
 
Final baxis patel
Final baxis patelFinal baxis patel
Final baxis patel
 
Infrastructure as Destiny — How Purdue Builds a Support Fabric for Big Data E...
Infrastructure as Destiny — How Purdue Builds a Support Fabric for Big Data E...Infrastructure as Destiny — How Purdue Builds a Support Fabric for Big Data E...
Infrastructure as Destiny — How Purdue Builds a Support Fabric for Big Data E...
 
WEF - Personal Data New Asset Report2011
WEF - Personal Data New Asset Report2011WEF - Personal Data New Asset Report2011
WEF - Personal Data New Asset Report2011
 

Viewers also liked

Normatividad para sitios web
Normatividad para sitios  webNormatividad para sitios  web
Normatividad para sitios web
argye77
 
gatwick north terminal parking
gatwick north terminal parking gatwick north terminal parking
gatwick north terminal parking
haedamell
 
Horario
HorarioHorario
Guia 12.
Guia 12.Guia 12.
Guia 12.
DayanaBejarano
 
Como crear un icono
Como crear un iconoComo crear un icono
Como crear un icono
Yazmin Silva
 
EL PODER DEL ACUERDO
EL PODER DEL ACUERDOEL PODER DEL ACUERDO
EL PODER DEL ACUERDO
msgn
 
Evidencia 3
Evidencia 3Evidencia 3
Evidencia 3
Yara Anota
 
Concurso lit14
Concurso lit14Concurso lit14
Concurso lit14fgmavi1
 
Lanzarán campaña "Pon de tu parte" para luchar contra el cambio climático
Lanzarán campaña "Pon de tu parte" para luchar contra el cambio climático Lanzarán campaña "Pon de tu parte" para luchar contra el cambio climático
Lanzarán campaña "Pon de tu parte" para luchar contra el cambio climático
Perú 2021
 
Raw images
Raw imagesRaw images
Raw imagesHolly182
 
Ordenadores
Ordenadores Ordenadores
Ordenadores
Gabriel Fernando
 
3. marte aṣṭatottaraśatanāmavaliḥ
3. marte aṣṭatottaraśatanāmavaliḥ3. marte aṣṭatottaraśatanāmavaliḥ
3. marte aṣṭatottaraśatanāmavaliḥ
Karen Witt
 
Director
DirectorDirector
Director
João Couto
 
Ferynico
FerynicoFerynico
Ferynico
nico_26
 
Adhetop
AdhetopAdhetop
LOGISFASHION Corporate ESP
LOGISFASHION Corporate ESPLOGISFASHION Corporate ESP
LOGISFASHION Corporate ESP
Logisfashion
 

Viewers also liked (20)

Normatividad para sitios web
Normatividad para sitios  webNormatividad para sitios  web
Normatividad para sitios web
 
gatwick north terminal parking
gatwick north terminal parking gatwick north terminal parking
gatwick north terminal parking
 
Promo Ashley Schiess
Promo Ashley SchiessPromo Ashley Schiess
Promo Ashley Schiess
 
Horario
HorarioHorario
Horario
 
Guia 12.
Guia 12.Guia 12.
Guia 12.
 
Avui 4 de març a l'espai brosa resposta a cartes impertinents de marta mombla...
Avui 4 de març a l'espai brosa resposta a cartes impertinents de marta mombla...Avui 4 de març a l'espai brosa resposta a cartes impertinents de marta mombla...
Avui 4 de març a l'espai brosa resposta a cartes impertinents de marta mombla...
 
Como crear un icono
Como crear un iconoComo crear un icono
Como crear un icono
 
EL PODER DEL ACUERDO
EL PODER DEL ACUERDOEL PODER DEL ACUERDO
EL PODER DEL ACUERDO
 
Evidencia 3
Evidencia 3Evidencia 3
Evidencia 3
 
Concurso lit14
Concurso lit14Concurso lit14
Concurso lit14
 
Maria montero chozas
Maria montero chozasMaria montero chozas
Maria montero chozas
 
Lanzarán campaña "Pon de tu parte" para luchar contra el cambio climático
Lanzarán campaña "Pon de tu parte" para luchar contra el cambio climático Lanzarán campaña "Pon de tu parte" para luchar contra el cambio climático
Lanzarán campaña "Pon de tu parte" para luchar contra el cambio climático
 
Raw images
Raw imagesRaw images
Raw images
 
Ordenadores
Ordenadores Ordenadores
Ordenadores
 
3. marte aṣṭatottaraśatanāmavaliḥ
3. marte aṣṭatottaraśatanāmavaliḥ3. marte aṣṭatottaraśatanāmavaliḥ
3. marte aṣṭatottaraśatanāmavaliḥ
 
Director
DirectorDirector
Director
 
Ferynico
FerynicoFerynico
Ferynico
 
Adhetop
AdhetopAdhetop
Adhetop
 
LOGISFASHION Corporate ESP
LOGISFASHION Corporate ESPLOGISFASHION Corporate ESP
LOGISFASHION Corporate ESP
 
Presentation1
Presentation1Presentation1
Presentation1
 

Similar to Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

KOHN.ppt
KOHN.pptKOHN.ppt
KOHN.ppt
KOHN.pptKOHN.ppt
KOHN.ppt
Arvind Bhisikar
 
A Guide to Data Innovation for Development - From idea to proof-of-concept
A Guide to Data Innovation for Development - From idea to proof-of-conceptA Guide to Data Innovation for Development - From idea to proof-of-concept
A Guide to Data Innovation for Development - From idea to proof-of-concept
UN Global Pulse
 
An era of game changing insight from Big Data
An era of game changing insight from Big DataAn era of game changing insight from Big Data
An era of game changing insight from Big Data
IBM Government
 
Data for Impact Fellowship - SocialCops Careers
Data for Impact Fellowship - SocialCops CareersData for Impact Fellowship - SocialCops Careers
Data for Impact Fellowship - SocialCops Careers
SocialCops
 
Crowdsourcing: A Survey
Crowdsourcing: A SurveyCrowdsourcing: A Survey
Crowdsourcing: A Survey
IJERA Editor
 
A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...
A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...
A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...
ijistjournal
 
PatternLanguageOfData
PatternLanguageOfDataPatternLanguageOfData
PatternLanguageOfData
kimErwin
 
Big datafordevelopment un-globalpulsejune2012
Big datafordevelopment un-globalpulsejune2012Big datafordevelopment un-globalpulsejune2012
Big datafordevelopment un-globalpulsejune2012
中文互联网数据研究资讯中心--199it
 
July Update Breakfast
July Update BreakfastJuly Update Breakfast
July Update Breakfast
ICCI Melbourne
 
Differentiating in the Digital Era
Differentiating in the Digital EraDifferentiating in the Digital Era
Differentiating in the Digital Era
Bharat Bhushan
 
The Cognitive Digital Twin
The Cognitive Digital TwinThe Cognitive Digital Twin
The Cognitive Digital Twin
Dr. Ahmed El Adl, Ph.D.
 
Matchbox presentation
Matchbox presentation Matchbox presentation
Matchbox presentation
Point_conference
 
Lifelogging & Personal Data Analytics
Lifelogging & Personal Data AnalyticsLifelogging & Personal Data Analytics
Lifelogging & Personal Data Analytics
Cathal Gurrin
 
Data fluency for the 21st century
Data fluency for the 21st centuryData fluency for the 21st century
Data fluency for the 21st century
MartinFrigaard
 
Transcript of Webinar: Data management plans (DMPs) - audio
Transcript of Webinar: Data management plans (DMPs) - audioTranscript of Webinar: Data management plans (DMPs) - audio
Transcript of Webinar: Data management plans (DMPs) - audio
ARDC
 
Essay Information
Essay InformationEssay Information
SXSW: The Talks, Tech and Trends
SXSW: The Talks, Tech and TrendsSXSW: The Talks, Tech and Trends
SXSW: The Talks, Tech and Trends
IsobarUS
 
Implementation of Mobile Information Systems in Organizations: Practical Study
Implementation of Mobile Information Systems in Organizations: Practical StudyImplementation of Mobile Information Systems in Organizations: Practical Study
Implementation of Mobile Information Systems in Organizations: Practical Study
Vinícius Caixeta
 
Big Data for International Development
Big Data for International DevelopmentBig Data for International Development
Big Data for International Development
Alex Rascanu
 

Similar to Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science) (20)

KOHN.ppt
KOHN.pptKOHN.ppt
KOHN.ppt
 
KOHN.ppt
KOHN.pptKOHN.ppt
KOHN.ppt
 
A Guide to Data Innovation for Development - From idea to proof-of-concept
A Guide to Data Innovation for Development - From idea to proof-of-conceptA Guide to Data Innovation for Development - From idea to proof-of-concept
A Guide to Data Innovation for Development - From idea to proof-of-concept
 
An era of game changing insight from Big Data
An era of game changing insight from Big DataAn era of game changing insight from Big Data
An era of game changing insight from Big Data
 
Data for Impact Fellowship - SocialCops Careers
Data for Impact Fellowship - SocialCops CareersData for Impact Fellowship - SocialCops Careers
Data for Impact Fellowship - SocialCops Careers
 
Crowdsourcing: A Survey
Crowdsourcing: A SurveyCrowdsourcing: A Survey
Crowdsourcing: A Survey
 
A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...
A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...
A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...
 
PatternLanguageOfData
PatternLanguageOfDataPatternLanguageOfData
PatternLanguageOfData
 
Big datafordevelopment un-globalpulsejune2012
Big datafordevelopment un-globalpulsejune2012Big datafordevelopment un-globalpulsejune2012
Big datafordevelopment un-globalpulsejune2012
 
July Update Breakfast
July Update BreakfastJuly Update Breakfast
July Update Breakfast
 
Differentiating in the Digital Era
Differentiating in the Digital EraDifferentiating in the Digital Era
Differentiating in the Digital Era
 
The Cognitive Digital Twin
The Cognitive Digital TwinThe Cognitive Digital Twin
The Cognitive Digital Twin
 
Matchbox presentation
Matchbox presentation Matchbox presentation
Matchbox presentation
 
Lifelogging & Personal Data Analytics
Lifelogging & Personal Data AnalyticsLifelogging & Personal Data Analytics
Lifelogging & Personal Data Analytics
 
Data fluency for the 21st century
Data fluency for the 21st centuryData fluency for the 21st century
Data fluency for the 21st century
 
Transcript of Webinar: Data management plans (DMPs) - audio
Transcript of Webinar: Data management plans (DMPs) - audioTranscript of Webinar: Data management plans (DMPs) - audio
Transcript of Webinar: Data management plans (DMPs) - audio
 
Essay Information
Essay InformationEssay Information
Essay Information
 
SXSW: The Talks, Tech and Trends
SXSW: The Talks, Tech and TrendsSXSW: The Talks, Tech and Trends
SXSW: The Talks, Tech and Trends
 
Implementation of Mobile Information Systems in Organizations: Practical Study
Implementation of Mobile Information Systems in Organizations: Practical StudyImplementation of Mobile Information Systems in Organizations: Practical Study
Implementation of Mobile Information Systems in Organizations: Practical Study
 
Big Data for International Development
Big Data for International DevelopmentBig Data for International Development
Big Data for International Development
 

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

  • 1. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 1 Department of Information Systems and Computing BSc (Hons) Information Systems (Business) Academic Year 2013 – 2014 Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) Adebowale Nadi 1008089 A report submitted in partial fulfilment of the requirements for the degree of Bachelor of Science Brunel University Department of Information Systems and Computing Uxbridge Middlesex UB8 3PH United Kingdom T: +44 1895 203397 F: +44 (0) 1895 251686
  • 2. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 2 Abstract The main objective of the paper is to explore the idea of prosumption and how digital personhood data that we produce can be extracted, filtered and analysed and given back to us [prosumers] in a way that is commodifiable, subsequently empowering citizens to utilize data that they produce. One aspect of this hypothesis is the identification of personas through clustering which is facet of intelligent data analysis. With the sole aim being of building a Persona Identification Application (PIA) which sole purpose is to be able to deduce personas from data stores. In 2011 it was estimated that 274.2 million Americans were connected to the internet leading to 81 billion minutes being spent on social networking sites and blogs. In the same year 117.6 million people visited the internet via a mobile phone accounting for $246 billon being spent making online purchases (Palis, 2012). Well renowed mangement consultency firm Boston Consulting Group projects that the Internet Econmoy will contribute $4.2 billion to G20 total GDP by 2016. This lead co-author David Dein to emphasise that “If it were a national economy [internet economy], it would rank in the world’s top five, behind only the U.S., China, India, and Japan, and ahead of Germany,” (Dein, 2012). With the rise of the internet economy coupled with the increased rise of mobile devices connected to the internet, faciliating an unprecedently amount of data being held, intelligent data analysis needs to be used to be able to isolate the key information thus producing personas that can be later traded on a futures market. This paper will look at the rise of the internet economy coupled with the emergance of the digital prosumer. In addtion clustering will be look at in finite detail, looking at the various clustering techniques that can be used in the purposed application, looking into the advantages and disadvantages of each before deciding on which is the appropriate method for this project. Furthmore this paper will detail the step by step implementation of the application detailing all the design and requirement analysis that took place before hand. Finally a detailed evaluation will be explained and executed relaying the findings from the application and seeing if, infact, the application meets the aim in a coherent and chomprehensible manner.
  • 3. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 3 Acknowledgements First and foremost I would like to take this opportunity to thank my Lord Jesus Christ for guiding me through this project and giving me the strength to be able to conclude this dissertation. I would also like to thank my Mum & Dad for their indubitable and unconditional support given to me throughout my time working on this project. In addition, all the people that helped, supported and assisted me in anyway shape or form in putting this dissertation together I would like to personally thank and extend my sincere gratitude towards. (There are too many to name personally but they know who they are). Last but certainly not least, I would like to personally thank my supervisor Panos Louvieris and his assistant Natalie Clewley for all their support rendered to me throughout this project. This dissertation was, no doubt, the biggest challenge I have faced in all my 19 years in education, but definitely the most rewarding, learning a highly complex topic (data mining) and learning to code in a completely new software environment with no prior experience. I truly wouldn’t have been able to complete it without their guidance, assistance and motivation. In closing I would like to wish Panos and his team the best of luck in completing their EPSRC sponsored project Digital Personhood: Digital Prosumer. Total Words: 15,500 I certify that the work presented in the dissertation is my own unless referenced. Signature Adebowale Olatunde Nadi Date 24/03/2014
  • 4. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 4 Table of Contents Abstract...........................................................................................................................................................................2 Acknowledgements.................................................................................................................................................... 3 Table of Contents........................................................................................................................................................ 4 List of Tables.................................................................................................................................................................7 List of Figures............................................................................................................................................................... 7 1 Introduction ........................................................................................................................................................ 9 1.1 Problem Definition..................................................................................................................................9 1.2 Aims and Objectives............................................................................................................................... 9 1.3 Project Approach.................................................................................................................................. 10 1.4 Dissertation Outline ............................................................................................................................ 11 2 Literature Review .......................................................................................................................................... 12 2.1 Personal Data......................................................................................................................................... 12 2.2 Value of Personal Data ....................................................................................................................... 12 2.3 The Internet [Digital] Economy...................................................................................................... 13 2.3.1 Midata .................................................................................................................................... 13 2.3.2 Information Economy Strategy (IES)........................................................................ 13 2.4 What is a Persona?............................................................................................................................... 14 2.5 What is a Prosumer? ........................................................................................................................... 14 2.5.1 The Rise of the Digital Prosumer................................................................................ 15 2.6 Data Mining............................................................................................................................................. 15 2.6.1 Knowledge Discovery from Data [KDD] .................................................................. 16 2.7 Cluster Analysis..................................................................................................................................... 17 2.7.1 Partitioning Technique................................................................................................... 17 2.7.2 Advantages and Disadvantages................................................................................... 17 2.7.3 Hierarchical Technique................................................................................................... 18 2.7.4 Advantages and Disadvantages................................................................................... 18 2.8 Critical Discussion................................................................................................................................ 19 2.9 Summary.................................................................................................................................................. 20 3 Methodology..................................................................................................................................................... 21 3.1 Design Science ....................................................................................................................................... 21 3.2 Positivist Approach (Positivism)................................................................................................... 22 3.3 Interpretive Approach........................................................................................................................ 23 3.4 Critical Discussion................................................................................................................................ 23 3.5 Software Development Lifecycle Models.................................................................................... 24
  • 5. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 5 3.5.1 Rapid Application Development (RAD)................................................................... 24 3.5.2 Analysis ................................................................................................................................. 25 3.6 Waterfall Model..................................................................................................................................... 25 3.7 Analysis..................................................................................................................................................... 26 3.8 User Interface Evaluation.................................................................................................................. 26 3.8.1 Nielsen Heuristics............................................................................................................. 27 3.8.2 Advantages and Disadvantages................................................................................... 28 3.9 Critical Discussion................................................................................................................................ 28 3.9.1 Cognitive Walkthrough................................................................................................... 29 3.10 Critical Discussion................................................................................................................................ 30 3.11 Summary.................................................................................................................................................. 30 4 Requirements Analysis and Design........................................................................................................ 31 4.1 Customer Requirements.................................................................................................................... 31 4.2 Functional Requirements.................................................................................................................. 31 4.3 Non-Functional Requirements........................................................................................................ 32 4.4 Requirements Summary.................................................................................................................... 32 4.5 Design........................................................................................................................................................ 32 4.6 Activity Diagram.................................................................................................................................... 33 4.7 Use Case.................................................................................................................................................... 34 Summary ................................................................................................................................................................ 34 5 Implementation .............................................................................................................................................. 35 5.1 Software Environment – R................................................................................................................ 35 5.2 Software Environment - MatLab.................................................................................................... 35 5.3 Persona Identification Application Implementation............................................................. 35 5.3.1 Application Coding Screenshots ................................................................................. 36 5.3.2 Application Interface Screenshots ............................................................................. 39 5.4 Assumptions........................................................................................................................................... 40 5.5 Summary.................................................................................................................................................. 40 6 Results and Evaluation................................................................................................................................. 41 6.1 Data Pre-Processing............................................................................................................................ 41 6.2 Results Summary.................................................................................................................................. 43 6.3 Evaluation................................................................................................................................................ 45 6.3.1 Participant selection........................................................................................................ 46 6.4 Black-Box Testing................................................................................................................................. 46 6.5 Evaluation Results................................................................................................................................ 47 6.6 Black Box Testing Results................................................................................................................. 48
  • 6. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 6 6.7 Evaluation Summary........................................................................................................................... 48 7 Conclusion......................................................................................................................................................... 49 7.1.1 Aim - Identify individual personas from prosumers personal information. 49 7.1.2 Objective 1 - Undertake a state-of-the-art literature review to inform, create a design specification for an identifying personas/Investigate in greater detail the pros and cons of clustering with reference to appropriate literature ..................................... 49 7.1.3 Objective 2 - Build a persona identification application................................... 50 7.1.4 Objective 3 - Evaluate the application...................................................................... 50 7.2 Future Development ........................................................................................................................... 50 Appendix A Personal Reflection........................................................................................................... 51 A.1 Reflection on Project........................................................................................................................... 51 A.2 Personal Reflection.............................................................................................................................. 51 Bibliography............................................................................................................................................................... 53 A.3 Appendices.............................................................................................................................................. 57 A.4 Appendices.............................................................................................................................................. 57
  • 7. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 7 List of Tables Table 1 – User Requirements.............................................................................................................................. 31 Table 2 - Functional Requirements.................................................................................................................. 32 Table 3 - Non-Functional Requirements........................................................................................................ 32 Table 4 - Use Case Narrative ............................................................................................................................... 33 List of Figures Figure 1 - Fayyad KDD representation ........................................................................................................... 16 Figure 2 - Example of a word sorting dendrogram output from: http://www.macs.hw.ac.uk/texturelab/people/thomas-methven/ ....................................... 18 Figure 3 - Design Science Guideline from MIS Quarterly Research Essay. ...................................... 21 Figure 4 - The Engineering Cycle ...................................................................................................................... 22 Figure 5- Epistemological Assumptions for Qualitative and Quantitative Research from http://dstraub.cis.gsu.edu:88/quant/2philo.asp............................................................................. 23 Figure 6 - RAD Diagram......................................................................................................................................... 25 Figure 7 - Waterfall Model ................................................................................................................................... 26 Figure 8 - Activity Diagram of Persona Identification Application..................................................... 34 Figure 9 - Use Case Diagram of Persona Identification Application................................................... 34 Figure 10 - Import csv file plus description.................................................................................................. 36 Figure 11 – Choose variables plus description............................................................................................ 36 Figure 12 – Standardize data and run k-means plus description........................................................ 37 Figure 13 – Choose K function plus description ......................................................................................... 37 Figure 14 – Show analysis results plus description .................................................................................. 38 Figure 15 – Download results csv file plus description........................................................................... 38 Figure 16 - Screenshot of Persona Application Interface 1.0................................................................ 39 Figure 17 – Screenshot of Persona Identification Application 2.0...................................................... 39 Figure 18 – Evidence of data pre-processing Results............................................................................... 41 Figure 19 - Screenshot of results out CSV file.............................................................................................. 42 Figure 20 - Identifying Personas Breakdown .............................................................................................. 42 Figure 21 –Percentage Calculator Example.................................................................................................. 43 Figure 22 - Persona Percentage Results (Test 1) ....................................................................................... 43 Figure 23- Persona Percentage Results (Test 2) ........................................................................................ 44 Figure 24 - System Usability Questionnaire................................................................................................. 45 Figure 25 - Graph showing the optimum number of evaluators.......................................................... 46 Figure 26 - Functional Test Questionnaire.................................................................................................... 47
  • 8. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 8 Figure 27 - Table of Usability Questionnaire Results ............................................................................... 47 Figure 28 - Bar Chart of Usability Questionnaire Results....................................................................... 47 Figure 29 Bar Chart showing average usability questionnaire results............................................. 48 Figure 30 - Results of System Functionality Questionnaire................................................................... 48
  • 9. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 9 1 Introduction This dissertation will be looking at the digital prosumer; in particular, concentrating on the identification of personas gained from wholesome prosumer data stores which can be used as valuable commodities to sell on the ‘futures’ market. I plan to execute this by identifying specific personas from a digital vault of prosumer personal information by using intelligent data analysis, in this case, clustering. During the course of this dissertation I expect to isolate, analyze and categorize raw prosumer data and present it in a way were I can link it to a persona. Also I expect to find the best clustering technique, through an extensive literature review analyzing both the advantages and disadvantages of each selected method before coming to a conclusion on the best technique to use. I will also develop a persona identification application, which will be used to analyze the data and set them into clusters which can then be classified into personas. Then finally I will be undertaking a comprehensive evaluation of the app to scope the overall effectiveness of the application. 1.1 Problem Definition Personal data can generate unprecedented economic and social value for governments, organizations and individuals in many ways. By 2020 it is estimated that more than 50 billion devices may be connected to the Internet (Nagel, 2013) and more than 40 times as many personal data records stored. With the large amounts of data collected from prosumers, smarter data mining techniques need to be employed to efficiently analyze the data and identify personas for which data can be traded on a data exchange. Data mining is the search for valuable information within large volumes of data by systematically exploring underlying patterns, trends, and relationships hidden in available data. Data mining techniques can generally be categorized into: (i) classification and prediction; (ii) clustering; (iii) outlier prediction; (iv) association rules; (v) sequence analysis; (vi) time series analysis; and (vii) text mining. 1.2 Aims and Objectives The aim of this project is to identify individual personas from prosumers personal information stored in a digital vault using an intelligent data analysis technique, Clustering. To aid me in achieving this aim within this project I have set out a list of objectives that will help develop the body of this dissertation as well as assist me in determining whether the project aim has been successfully satisfied.
  • 10. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 10 • Undertake a state-of-the-art literature review to inform, create a design specification for an identifying personas from digital personhood data using intelligent data analysis techniques (Clustering). • Investigate in greater detail the pros and cons of clustering with reference to appropriate literature • Build a persona identification application (e.g. using MatLab or R). • Evaluate the application. 1.3 Project Approach In order to successfully complete this project I have adopted a five-step approach. At each stage there will be a set of deliverables I will set that will help achieve my aims and objectives and also to assist me in completing this project on time. The first step will be to conduct a state-of-the-art literature review. This review will look at different cluster analysis techniques from a variety of different physical and online sources. This will enable me to inform the design of my application, which is the cornerstone of this project. In addition I will look at what has been done in terms of cluster analysis and try to synthesize that information and relate it back to my project. The second step will be to looking at different methodology principles and models, picking the most appropriate method for this project with appropriate reference to literature. Selecting the right methodology is pivotal to the success of this project. The third stage will be to analyses the user requirements and talk about the design of my application and evaluating the GUI. After this has been discussed and illustrated then I will proceed in coding my application, which will be done in R-Studio. The fourth stage will be ascertaining the results of the application and trying to find personas out of the dataset clustered. The way I went about de-cyphering the information and deducing personas will be shown and explained at this stage. The final stage of this project will involve evaluating the application and the project as a whole. This will be coupled with personal reflection on my experiences on putting together this project
  • 11. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 11 1.4 Dissertation Outline Chapter 2: Literature Review – This chapter will look into pervious literature that will equip me to gain a deeper understanding into my research problem. Subsequently it will help inform my design of my application. Chapter 3: Methodology - This chapter will look at different methodologies principles as well as software development lifecycle models and critically discussing each of their strengths as well as weaknesses before isolating a principle and SDLC that will be the most appropriate for my project. Chapter 4: Requirement Analysis and Design – This chapter will look at the requirements of the application set out by the user and analyzing the functional and non-functional requirements. In addition I will be going through the design process of my application and how I intend to put it all together. Chapter 5: Implementation – This chapter will demonstrate the coding of the logic of my application in R and the coding of the interface using R-Shiny. I will be including fully annotated screenshots depicting evidence of implementation. Chapter 6: Results and Evaluation – This chapter will be showing the results of the application as well as showing how I went about deducing personas from the application. I will also be looking into evaluating the app and seeing if it has met the aims and objectives set out at the beginning. Chapter 7: Conclusion – This chapter will be drawing conclusions to all the findings brought about in this project. I will be concluding my aims as well as all 3 of my objectives. In addition I will be evaluating my application from a subjective point of view as well as the project in its entirety. I will also be suggesting future work to make my application even better.
  • 12. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 12 2 Literature Review In this chapter I will be discussing and reviewing the different clustering methodologies available, analyzing the advantages and disadvantages of each technique with reference to the appropriate literature. This, along with personal evaluation, will fortify me in concluding which chosen technique is the most appropriate in executing this project by giving me the adequate justification for that chosen method. In addition to this I will be looking into further detail into what personal data is as well as how it has metamorphosed into being an increasing important aspect of a to economic growth and corporate supremacy, consequently delivering a new breed of prosumers, the digital prosumer. 2.1 Personal Data If we look at the European Data Protection Directive [Article 2] we see that personal data is defined “by reference to whether information relates to an identified or identifiable individual” (Information Commissioner Office, 2010) in other words personal data is any given piece of information that can be used to in identify and individual or individual characteristic. The Data Protection Act of 1998 adds a different dimension to the EDPD definition of ‘data’ by taken into account the way the information was processed before it can be regarded as data e.g. processed automatically or processed non automatically. The EDPD and Data Protection Act have a common consensus on what personal data/information is; - Information processed, or intended to be processed, wholly or partly by automatic means (that is, information in electronic form) (ICO, 2010) - Information processed in a non-automated manner which forms part of, or is intended to form part of, a ‘filing system’ (that is, manual information in a filing system) (ICO, 2010) 2.2 Value of Personal Data Personal information is an increasingly important asset in the twenty-first century, both in terms of corporate monetary value and government efficiency as well as economic prowess. Coincidentally, corporate companies around the world have begun the transition into investing greatly in software that helps facilitate the collation of consumer data (Schwartz, 2003). It’s estimated that everyday people across the world send 10 billion text messages daily, coupled with that 1 billion posts to a blog or social media sites are made leading to a new type of economy emerging, The Internet economy. It is estimated that that the Internet economy within the G20 amounted to $2.3 trillion or 4.1% total GDP in 2010 (Group, 2012).
  • 13. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 13 2.3 The Internet [Digital] Economy Sometimes called the digital or web economy the Internet Economy is a concept based on digital technologies fusing with the traditional economy. First established by Don Tapscott in his critically acclaimed book; The Digital Economy: Promise and Peril in the Age of Networked Intelligence’’, it is widely believed that the internet economy is positioning itself as the new cornerstone for any emerging or established economy (Tapscott, 1997) This is evident by the recent figures released by the Boston Consulting Group their Digital Manifesto Report which states that currently the value of the internet economy is larger than that of countries like Brazil and Italy and that by the year 2016 the Internet economic value is expected to double to $4.2 trillion. The report also goes on to say that ‘’no company or country can afford to ignore this [Internet economy] phenomenon’’. (David Dean, 2012) The rise in the amount of data being produced is strongly linked to the innovation of mobile technology, from the turn of the millennium, allowing more devices than ever to be able to make a connection with the cyber-world that is the Internet. Steve Wojtowecz, Vice President of storage software development at IBM, stated that by the year 2015 over a trillion devices would be connected to the internet (King, 2011). As a consequence the UK government has started up two initiatives, Midata and Information Economy Strategy (IES) to aid prosumers with improved and sufficient access to their own personal data that companies hold about them. (BIS, 2011). 2.3.1 Midata These are the key principles [aims] of the Midata initiative outlined in its government report: (Department for Business, Innovation & Skills , 2013) - Get more private sector businesses to release personal data to consumers electronically - Make sure consumers can access their own data securely - Encourage businesses to develop applications (apps) that will help consumers make effective use of their data 2.3.2 Information Economy Strategy (IES) These are the key principles [aims] of the IES project outlined in its government report: (Department for Business, Innovation and Skills, 2013) - A strong, innovative, information economy sector exporting UK excellence to the world
  • 14. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 14 - UK businesses and organizations, especially small and medium enterprises (SMEs), confidently using technology, able to trade online, seizing technological opportunities and increasing revenues in domestic and international markets - Citizens with the capability and confidence to make the most of the digital age and benefiting from excellent digital services.’’ Long-term success will be underpinned by: - A highly skilled digital workforce (whether specialists who create and develop information technologies, or non-specialists who use them) - The digital infrastructure (both physical and regulatory) and the framework for cyber security and privacy necessary to support growth, innovation and excellence.’’ (Department for Business, Innovation and Skills, 2013) It’s important to remember that both these government initiatives are being reinforced by reviews and changes to legislation such as the Data Protection Act, Consumer Rights Bill [Both UK and EU level] and the Enterprise and Regulatory Reform Act 2013. Reason being is that this will necessitate companies to disclose customers’ personal data to them if they opt not to do so voluntarily. (Department for Business, Innovation & Skills , 2013) 2.4 What is a Persona? Typically used as marketing tool and human centered design [HCD] personas are hypothesized groups of users that illustrate similar behavioral patterns in their use of technology, lifestyle decisions, customer service preferences as well as their purchasing decisions. Angus Jenkinson first came up with a top down analytical approach that works by ‘grouping’ focusing on a synthetic, clustering process leading to ‘customer communities’ and the creation and preservation of loyalty within these communities in his 1994 journal Beyond Segmentation (Jenkinson, 1994). This concept was refined five years later by Alan Cooper in his pioneering book The Inmates Are Running the Asylum in which Cooper creates the actual concept called ‘persona’ that is used today to identify customer relative behavior and consumption patterns. (Cooper, 1998) 2.5 What is a Prosumer? It is widely considered that Alvin Toffler is the creator of concept of prosumption, he goes on to define it in his book ‘The Third Wave’ as people who “produce some of the goods and services entering their own consumption” (Toffler, 1980) (Kotler, 1986). In other words people that produce and consume their own products and services are prosumers. In the 21st
  • 15. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 15 century the prosumer has become more and more prominent replacing the traditional consumers of the Industrial Age, this lays credence to Toffler’s own prediction that; as society moves to towards the Post-Industrial Age the number of pure consumers will decline being replaced with “prosumers” (Toffler, 1980). 2.5.1 The Rise of the Digital Prosumer Consequently as we divulge deeper into the Information Age and the Internet Economy continues to evolve into an economic juggernaut, a new type of prosumer has emerged, the digital prosumer. The digital prosumer is a person that creates and consumes his or her own data. As of today the biggest benefactors of personal data produced are the depicted as the big 3 data companies, which are; Google, Facebook and Twitter making upwards of $1200 from a user profile. (Madrigal, 2012) 2.6 Data Mining Data mining is the iterative process of extracting or “mining” knowledge from excessive amounts of data stores, which can be put into perspective and exported into useful information. Data mining is thought to involve six common classes of that lead to prediction and description, which is one of the primary goals of data mining: (Wikipedia, 2011) (Kamber, 2006) • Classification – is learning a function that classifies a single data item into one of several predefined classes. Examples of classifications techniques: - Bayesian classifiers - K-nearest neighbor - Linear classifiers • Regression – is learning a function that maps a data item to a prediction variable. In other words regression estimates the relationship between any two variables. Some examples of regression models are: - Percentage regression - Bayesian linear regression - Nonparametric regression • Clustering- is a descriptive task that works by aiming to identify cluster or categories that seek to describe data. Examples of clustering techniques are: - Hierarchical - Partitioning - Density-Based - Centroid-Based
  • 16. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 16 • Summarization – is a method for finding a cohesive description of a data set, this includes analytical representation such as visualization and report generation • Dependency modeling – is a method that consists of finding a model that depicts significant dependencies between variables • Change and deviation detection – is a method that focuses on finding the most significant changes from previously measured data. (Usama Fayyad, 2008) 2.6.1 Knowledge Discovery from Data [KDD] KDD can often be misconstrued as data mining in itself; however it’s safe to say that data mining is an essential part of the knowledge discovery. Usama Fayyad purposed the methodology of KDD in 1995 with the purpose of making data produced by companies useful to their business needs. (Deutsch, 2010) Figure 1 - Fayyad KDD representation Knowledge discovery takes an iterative sequence approach to its philosophy, which consists of; (Kamber, 2006) • Data Cleaning – to remove noise and inconsistent data • Data Integration – where multiple data sources may be combined • Data Selection - where data relevant to the analysis task are retrieved from the database • Data Transformation - where data are transformed or consolidated into forms appropriate for mining • Data Mining – an essential process where intelligent methods are applied in order to extract data pattern • Pattern Evaluation – to identify the truly interesting patterns representing knowledge • Knowledge Presentation – where visualization and knowledge representation are used to present the finished knowledge to the user
  • 17. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 17 2.7 Cluster Analysis Cluster analysis can be defined as the process of grouping a set of physical or abstract objects into classes that have similar objects. In other words a cluster can be depicted as collection of data objects that a similar to object within the same cluster or dissimilar to objects in another cluster. An advantage of clustering or cluster analysis is that it can single out useful features that define characteristics within different groups, which, in turn, will help me in my aim of identifying personas from prosumer data (Kamber, 2006). They’re a various different cluster analysis techniques such as; Partitioning, Hierarchical (Agglomerative and Divisive) and The Single Link Method (Raza Ali, 2004) 2.7.1 Partitioning Technique Partitioning methods aims to relocate clusters of data from one cluster to another; this is usually started by the initial partitioning. The method also requires the number of clusters to be pre-set by the user. It is also commonly cited that to achieve global optimality in this type of clustering an exhaustive enumeration process of all possible partitions is needed, because of this necessity most applications choose one of two popular algorithms, K-means and K- medoids algorithms (Kamber, 2006): • K-Means Algorithm K-means enables the user to mine data by representing each cluster by the mean value (usually K) of the objects present in the cluster • K-Medoids Algorithm K-medoids on the other hand, enables each cluster to be represented by one of the objects located nearer to the center of the cluster. 2.7.2 Advantages and Disadvantages Now the K-means technique has advantages as well as disadvantages, one of the main advantages is that k-means work well for finding spherical-shaped clustering within small to medium-sized data stores. Another advantage of k-means is that the method tends to produce tighter, more compact clusters than say hierarchical clustering. (Lior Rokach, 2010) However there are also disadvantages to this technique, one of them being that it is very limited to the type of cluster model the algorithm is applied to. The effectiveness of the algorithm is predicated on the spherical shaped clusters, sometimes called globular, as this enables the mean value to be positioned closer towards the center of the cluster. This consequently means that clusters that aren’t a similar size or have large datasets won’t work
  • 18. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 18 well with this algorithm. Another disadvantage to this algorithm is that it is very sensitive to noisy data and outliners, which can increase the squared error significantly; this leads to the user mandated to know the number of clusters beforehand, which is a very tedious task. (Improved Outcomes Software (ios), 2009) 2.7.3 Hierarchical Technique Hierarchical methods aim to create a hierarchical decomposition of the given sets of data objects. This method can be sub-partitioned into two techniques; Agglomerative and Divisive. The agglomerative method, which is also called the bottom up approach, works by each data object forming a separate group, after this is done the clusters are successively merged until the desired cluster structure is achieved. The divisive method, which is also called the top-down approach, works by all the data objects being in the same cluster then partitioned into sub-clusters, which in turn is partitioned further sub-clusters. This sequential process is repeated until the desired cluster structure is obtained. One of the intriguing things about hierarchical clustering is that it provides a decipherable visual of the algorithm plus data; this is called a Dendrogram. This is a resourceful summarization tool that makes hierarchical clustering extremely popular. (Lior Rokach, 2010) Figure 2 - Example of a word sorting dendrogram output from: http://www.macs.hw.ac.uk/texturelab/people/thomas-methven/ 2.7.4 Advantages and Disadvantages It’s important to remember that hierarchical techniques have many advantages as well as disadvantages. One of the advantages is that it is very versatile; methods like single-link work maintain a strong performance on datasets delivering well-separated, chainlike and concentric clusters. Another advantage to hierarchical methods is the fact that they produce multiple partitions, this is particular resourceful for users that want to choose different
  • 19. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 19 partitions from those already nested in the overall cluster according to the desired similarity level chosen by the user. On the other hand the disadvantages to this particular technique are quite evident. Hierarchical algorithms are notorious for their inability to scale well; the algorithm is also accredited to causing high I/O costs when trying to cluster a large number of objects. Another disadvantage to the hierarchical technique is that its rigidity, simply put, once one step is done in the sequence it can never be undone or modified. (Lior Rokach, 2010) 2.8 Critical Discussion Having reviewed the advantages and disadvantages of hierarchal and partitioning techniques it’s important to offer an analysis of both techniques, in relation to this project, in order for to be able to distinguish the most appropriate technique for clustering. From my research I can see that partitioning clustering works well on small sized data sets as opposed to bigger data sets, the dataset used in this project is fairly large containing data from 2,500 household’s weekly shop. Partitioning clustering also goes about making tighter, more cohesive, clusters through its k-means algorithm, which makes it easier to depict the key features within the cluster, which in turn defines persona characteristics. On the other hand, for users not to encounter noisy data while clustering it is advantageous for them to know the number of clusters in advance, this is near on impossible with the size of the database in question. Looking on the other side of the coin we see that the Hierarchical technique is very versatile offering different methods such as single link, complete link and average link, which, consequently, delivers separate clusters. This I believe will work well in this project, as it will aid in presenting persona’s from the dataset provided. In addition to this the hierarchical technique has a very good quality assurance type algorithm to ensure quality of cluster such as Chameleon which will be good in ensure that the personas defined are validated. On the other hand the hierarchical technique is very rigid so if erroneous decisions occur it is nearly impossible for it to be corrected which provides a big disadvantage to this project as identifying personas will need a great deal of flexibility as parameters for personas can change at any given time. In light of all the information reviewed it’s fair to say there are a number of advantages and disadvantages that both offer however in order to obtain the best and more concise results I believe consensus clustering would be the best option. However due to time constraints and lack of expertise in coding, I have decided to use the K-Means algorithm to provide the logic to my application. I intend to then build an interface, which simplifies the steps of the K- Means algorithm and puts it in a way that is easy to administer for the user. The choice of
  • 20. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 20 which software environment I will use to code the interface as well as the justifications for it will be made in Chapter 5. 2.9 Summary In this chapter I have spoken about personal data and its value, I have also looked into the definition of personas coupled with the rise of the prosumer and Internet economy. Furthermore I have discussed in detail what is cluster analysis is looking in particular at two clustering techniques (Hierarchical and Partitioning), offering an in-depth critical discussion about my chosen technique to take forward into my application. The findings of the chapter will further equip me into meeting my aims and objectives set out for this project. In addition it will assist me in constructing a design specification for my application
  • 21. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 21 3 Methodology This chapter will be exploring different research methodologies and coming up with the appropriate justification for applying the chosen methodology to this project the three methods in question will be; Design Science, Positivist and Interpretive. The methodology I have decided to use is the design science approach. The justification will be validated through the appropriate reference to literature sourced, as well as a personal analysis of the different approaches. 3.1 Design Science As previously mentioned the design science approach is my chosen methodology for this project. Design science simply put is the methodical form of designing or research design. First established by American inventor Richard Buckminster Fuller in 1963, the concept of design science proceeded to be further developed by Gregory in his 1966 book “The Design Method” in which he demarcates the relationship between design method and scientific method. He further accentuates his view that design is not inherently a science and that the actual term design science pertains to the scientific study of design. As technology continued to evolve at the turn of the century design science started becoming more integrated into Information systems research and software design projects. Alan Hevner in 2004 produced a seven-guideline framework, with the aim to assist information system researchers to; conduct, evaluate and present design-science research. (Alan R. Hevner, 2004) Figure 3 - Design Science Guideline from MIS Quarterly Research Essay. Further refinement this framework by Peffers, was later made in order to explain how the regulative cycle fits into the design science research framework.
  • 22. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 22 Figure 4 - The Engineering Cycle This framework is widely used today by information system researchers as it provides researchers a medium to analyze and de-cipher an existing problem and offer a solution design or solution hypotheses. After which they can then look at whether their solution or hypotheses is effective or meets the specified criteria, this can be executed through a pilot scheme or prototyping after which the full implementation can take place. (Roel Wieringa, 2010). This principle in particular would suit my project the most in my opinion, as I aim to design a software solution (clustering program), design it, and then evaluate the effectiveness of the solution. 3.2 Positivist Approach (Positivism) The positivist approach is a methodology based on an objective hypotheses based on introspection or intuition validated or dis-proved by scientific testing and experimentation (Sage Publications, 2009). In other words a positivist approach will have a hypotheses validating a subject area or discrediting it then going on to prove the hypotheses by experimentation or building a solution (University of the West of England, 2007). The origins of the method lie with sociologist Auguste Comte who coined and developed the term in the early 19th century. Today the positivist approach is used increasingly in IS and software engineering projects (Sociology Guide, 2008). Some of the advantages of the positivist approach are that it relies heavily on quantitative data as opposed to qualitative data which is seen as more scientific thus being a more reliable source to base hypotheses on. Another advantage to the positivist approach is the fact that it follows a very stringent structure, as the positivist approach believes that there are guidelines in place that need to be adhered to, which as a consequence should minimize room for error. This ideology makes positivist believe that the reduced room for error will make the whole approach more accurate when it pertains to experiments and applications. However on the other hand there
  • 23. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 23 are drawbacks to the approach one of them being human behavior. Positivists strongly believe in objective based assumptions however there is no guarantee that bias or subjective analysis won’t corrupt the study. (Johnson, 2010) (Wikipedia, 2014) Figure 5- Epistemological Assumptions for Qualitative and Quantitative Research from http://dstraub.cis.gsu.edu:88/quant/2philo.asp 3.3 Interpretive Approach The Interpretive approach is a qualitative research method that is based on subjective assumptions with the knowledge derived from value-laden socially constructed interpretations (Packer, 2007). In a stark contrast to the positivist approach interpretivist researchers aim to understand and interpret human behavior as opposed to generalizing and predicting cause and effect. The impact this has on information system and software design projects is that the researcher will aim to ask several open ended questions generally through questionnaires or unstructured / semi-structured interviews and sometimes observations to gather as much primary information as possible once the scope of the project has been defined (WordPress, 2012). This particular approach also enables the researcher to open to new ideologies throughout the duration of the project as opposed to that of the positivist approach who believe in a pre-ordained rules and guidelines. With that being said there are many advantages as well as disadvantages to this approach. One advantage is that the research methodology is highly qualitative based meaning that the data gathered will be in more depth. However a drawback will be that interpretivists have a subjective view about the project this into which will lead to bias getting in the way of ascertaining the correct results or the best methods to apply in completing the project. (Institute of Public & International Affairs, 2009) (Slideshare, 2013) 3.4 Critical Discussion Having looked out all three research approaches in appropriate detail, highlighting the advantages and disadvantages of each, it’s safe to say that all have adequate potential in being the framework for any information systems project. However I believe that the best approach to adopt for this particular project will be the Design Science approach as this offers the strongest correlation between what I am trying to achieve in this project and the
  • 24. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 24 actual design science approach itself (design, build evaluate). However with that being said I believe that I can still look at this project from a positivist point of view. The reason I say this is that the idea of using data mining to develop ‘personas’ is a relatively novel idea, so using a hypotheses I am trying to positively prove that it is possible and can be done. 3.5 Software Development Lifecycle Models There are many models that can be used to develop a software project. All of these models follow the design science principle of design, build evaluate. What I aim to achieve in this section will be to identify and describe two common models, offering adequate analysis on each. After which I will isolate the best model that can be adopted to my project. 3.5.1 Rapid Application Development (RAD) Rapid Application Development is an iterative model that favors rapid, early software prototyping as opposed to traditional planning. This approach consequently allows the development of software to take place much sooner. It also keeps stakeholders at the heart of the development process and allows requirement changes to take place easily. RAD typically follows four phases in it model; Requirements Planning Phase, User Design Phase, Construction Phase and Cutover phase. (Wikipedia, 2014) (David C. Yen, 1999) 1. Requirements Planning Phase – The inaugural phase of the project were the project team meet with the stakeholders to go over the business needs of the client, the project scope, system requirements and constraints. This is then preceded by an agreement of the key issues that need to be addressed after which the relevant authorization needs to be obtain in order to proceed 2. User Design Phase – The second phase of the project aims for the stakeholders to maintain dialogue with the project analysts to develop prototype models of the system that shows clear representation of all system input and output features plus all the processes within the system. This phase of RAD is perceived to be a continuous interactive process that allows the stakeholders to play an active role in understanding, modifying and consequently approving a working prototype model once they see a model that caters to their business needs 3. Construction Phase – The penultimate phase of project continues to focuses on program and application development. Stakeholders further participate in suggesting changes and improvement to any user interfaces or reports that are typically developed at this phase. Unit-integration, system testing, programming and application development is done at this phase of RAD.
  • 25. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 25 4. Cutover Phase – The final phase of RAD is typically when the whole project is brought to a head. Tasks such as testing, data conversion, user training and system changeover is done at this stage. The compression of all this tasks that the final stage enables the new system to be delivered back to the stakeholders in a much quicker timeframe. Figure 6 - RAD Diagram 3.5.2 Analysis The RAD model comes with many advantages as well as disadvantages. However the key is to be able to synthase them and be relate it back to my project. One of the common advantages of the RAD model is that it drastically reduces the time need for requirement analysis and software requirement software requirement. Also all prototypes created can be stored for future use; this will consequently speed up the software development of the product. Relatively speaking heavy prototyping is not necessary for my project as it’s a fairly short, small project with strict user requirements. (Rouse, 2007) (ISTQB Exam Certification, 2012) 3.6 Waterfall Model The waterfall model is a sequential design model that establishes software development through downward flow of task/activities through several phases (reminiscent of an actual waterfall). It differs from conventional agile development models as it seeks to fully describe the application through written documents before actual software development commences. Originally developed by Royce in 1970 the waterfall model follows seven sequential phases. (The Waterfall Development Methodology, 2012) 1. Requirements Specification – The requirements are gathered from the stakeholders and agreed on in principle with development team. 2. Design – The blueprint of the project is drawn up and given to the developers to commence coding and start implementation
  • 26. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 26 3. Implementation - The actual system is developed at this stage, all coding is completed resulting in the actual program being developed 4. Integration – The system created is integrated in the environment agreed on in the preliminary phase 5. Testing – Full testing of the integrated system is performed at this stage, debugging also happens at this stage with the view of determining any bugs and work on potential fixes and patches 6. Installation – Installing of the system including the removal of the old system is done at this stage. This stage also includes training for all stakeholders and staff members 7. Maintenance – The installed system is maintained through continuous updates and patches being developed and installed. The waterfall model follows a strict principle that you can only move forward to the next phase once the existing phase has been completed and worked to perfection meaning that once a phase is completed it cannot be looked at again. (ISTQB Exam Certification, 2012) Figure 7 - Waterfall Model 3.7 Analysis The waterfall model comes with many advantages. One of the most common is that sequential nature of the model, which makes it very easy to understand and execute. Another advantage is that it works well on projects that are fairly small with strict set-in-stone requirements, which suit my project adequately. Another reason I favor this SDLC is that it seems to go hand in hand with the design science approach (design, build & evaluate). ( Select Business Solutions, Inc., 2010) 3.8 User Interface Evaluation One of the most integral parts of any software project is to be able to coherently evaluate the design of the artefact. Like previously stated the user requirements are used to inform the design of the application, once this is done a framework or principle needs to be implemented in order to evaluate it. One of the most popular techniques for usability
  • 27. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 27 evaluation is the Nielsen Heuristics. In this section of the report I aim to talk about the Nielsen Heuristics in detail as well as another usability inspection method, The Cognitive Walkthrough, in order to draw qualitative comparisons to both methods. This in turn will help me decide on the most suitable approach in evaluate the usability of the Persona Identification Application. 3.8.1 Nielsen Heuristics As previously stated the Nielsen Heuristics is one of the most popular usability evaluation techniques and one of the most used today. It’s important to remember that heuristic evaluation bridges the gap between conventional user testing. This is achieved by providing a template or set of principles that help uncover problems a user will likely come across does this. Looking back it was Jakob Nielsen work with Rolf Molich in the 1990’s that helped originate the heuristics that is widely used today. However it was in his 1994 publication Usability Engineering that the actual ten heuristics were published for the first time. (Nielsen, 1994) (Some of the heuristics have been shortened for brevity) 1. Simple and Natural Dialogue – The dialogue should not contain information that is irrelevant or rarely needed 2. Speak the User’s Language – The dialogue should be expressed clearly in words, phrases, and concepts familiar to users rather than in system oriented terms 3. Minimize the User’s Memory Load – The user should not have to remember information from one part of the dialogue to another 4. Consistency – Users should not have to wonder whether different words, situations or actions mean the same thing 5. Feedback – The system should always keep users informed about what is going on, through appropriate feedback within reasonable time. 6. Clearly Marked Exits – Users often choose system functions by mistake and would need a clearly marked ’emergency exit’ 7. Shortcuts (Accelerators) – Unseen by the novice users by often speed up the interaction for expert users. 8. Good Error Messages – They should be expressed in plain language (no code) to precisely indicate the problem 9. Prevent Errors – Even better than good error messages is a careful design that prevent a problem from occurring in the first place
  • 28. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 28 10. Help and Documentation –Even though it is better if the system can be used without documentation, it may be necessary to provide help and documentation. Any such information should be easy to search, be focused on the user’s tasks, list concrete steps to be carried out and not be too large 3.8.2 Advantages and Disadvantages Nielsen heuristics comes with many advantages as well as disadvantages. Some of the advantages to this principle are that it’s a very useful and relative inexpensive way of providing some quick feedback to designers, which can reduce the overall turnover time that a product is in the usability evaluation stage. Furthermore it can be a good way of obtaining qualitative feedback EARLY in the design process. Another advantage to the heuristics evaluation is that it can help immensely in suggesting the best corrective measures for designers provided that the correct heuristic has been assigned in the first place. This would prove to be helpful when designing the user interface for the Persona Identification Application (PIA). Looking deeper into Nielsen Heuristics there is a few disadvantages to this evaluation principle. One being that it requires specialist knowledge and competent experience for it the application of the heuristics to be effective. Moreover usability experts trained to administer the heuristics effectively and hard to come by and can be relatively expensive to source. Another disadvantage to the heuristics is that it can tend to be misleading in that it can identify more of the minor issues and less of the actual major issues with the design. (Usability.Gov, 2010) (Nielsen, 1994) Moving forward it is important to remember that heuristic evaluation does not replace conventional usability testing and it should not be seen as an alternative to it. Many of the benefits and drawbacks have been highlighted above and with all being discussed I’m in no doubt that Nielsen Heuristics is the perfect evaluation metric for evaluating the user interface for the application. Reason being is that, in essence, it evaluates all the basic requirements set by the stakeholders and also it gives me things to consider while designing the app i.e. accelerators and consistency etc. as well as things to evaluate on at the end of the design process 3.9 Critical Discussion Nielsen heuristics comes with many advantages as well as disadvantages. Some of the advantages to this principle are that it’s a very useful and relative inexpensive way of providing some quick feedback to designers, which can reduce the overall turnover time that a product is in the usability evaluation stage. Furthermore it can be a good way of obtaining qualitative feedback EARLY in the design process. Another advantage to the
  • 29. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 29 heuristics evaluation is that it can help immensely in suggesting the best corrective measures for designers provided that the correct heuristic has been assigned in the first place. This would prove to be helpful when designing the user interface for the Persona Identification Application (PIA). Looking deeper into Nielsen Heuristics there is a few disadvantages to this evaluation principle. One being that it requires specialist knowledge and competent experience for it the application of the heuristics to be effective. Moreover usability experts trained to administer the heuristics effectively and hard to come by and can be relatively expensive to source. Another disadvantage to the heuristics is that it can tend to be misleading in that it can identify more of the minor issues and less of the actual major issues with the design. Moving forward it is important to remember that heuristic evaluation does not replace conventional usability testing and it should not be seen as an alternative to it. Many of the benefits and drawbacks have been highlighted above and with all being discussed I’m in no doubt that Nielsen Heuristics is the perfect evaluation metric for evaluating the user interface for the application. Reason being is that, in essence, it evaluates all the basic requirements set by the stakeholders and also it gives me things to consider while designing the app i.e. accelerators and consistency etc. as well as things to evaluate on at the end of the design process. The way I intend to go about this heuristic evaluation is to construct a usability questionnaire as well as system functionality test in order to be able to coherently ascertain the usability of the system, also to be able to test the functionality of the system, thus validating the user requirements. 3.9.1 Cognitive Walkthrough In order to balance the argument for which evaluation technique to use it’s imperative to draw on a comparison. One of the direct comparisons to the Nielsen Heuristics is the Cognitive Walkthrough approach. Cognitive Walkthrough was developed as an additional tool in usability engineering. The technique involves a group of evaluators undertaking a set of tasks on the interface to evaluate its ease of learning and understandability. Lewis and Polson first set out the concept of cognitive walkthrough, and it works by tasking the evaluators with four questions; (usabilityfirst, 2011) (Cathleen Wharton, 1994) • Will the user try to achieve the right effect? • Will the user notice that the correct action is available? • Will the user associate the correct action with the effect to be achieved? • If the correct action is performed will the user see that the progress is being made toward solution of the task? After all these questions are ascertained the evaluator attempt to conjure a ‘success story’ for each incremental step of the process. If this turns out to be impossible then the evaluator will
  • 30. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 30 then create a ‘failure story’, which aims to assess why the user cannot accomplish the task based on the GUI. The findings from the walkthrough are later aggregated and used to make improvements on the application, in this case the Persona Identification App. Like the heuristics stated earlier cognitive walkthrough has many advantages as well as disadvantages. One of the main advantages is that it’s useful for identifying problems early in the design phase as well as help define users goals and assumptions with fewer resources’ that say full user testing would demand. This technique fits well with the scope of my project as it provides a short and concise evaluation of the user interface I will be designing it also provides a user centered perspective similar to what the heuristics offer in comparison. However one of the main issues with cognitive walkthrough is more susceptible to subjective bias from the evaluators, which may hinder the main issues not being covered. Another issue is that it can be very difficult for a seasoned evaluator to assume the perspective of an inexperienced user of the system. (Lewis, 1997) 3.10 Critical Discussion Like the heuristics stated earlier cognitive walkthrough has many advantages as well as disadvantages. One of the main advantages is that it’s useful for identifying problems early in the design phase as well as help define users goals and assumptions with fewer resources’ that say full user testing would demand. This technique fits well with the scope of my project as it provides a short and concise evaluation of the user interface I will be designing it also provides a user centered perspective similar to what the heuristics offer in comparison. However one of the main issues with cognitive walkthrough is more susceptible to subjective bias from the evaluators, which may hinder the main issues not being covered. Another issue is that it can be very difficult for a seasoned evaluator to assume the perspective of an inexperienced user of the system. 3.11 Summary In this chapter I have looked in depth at three design principles, evaluating each of them and choosing the most appropriate one for my project. In addition I looked into software development lifecycle and picked out the waterfall model as the most efficient lifecycle for this project. Finally I looked into user interface evaluation choosing Nielsen heuristics as my way of evaluating the application interface. The findings of this chapter have helped me choose the appropriate methodology and evaluation for this project.
  • 31. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 31 4 Requirements Analysis and Design In this chapter I will be reviewing and discussing the fundamental requirements of this project. There are many types of requirements categories that can be used. In this project I will be using three; Customer requirements, Functional and Non-Functional requirements. In addition to this I will be discussing the design process of my project making use of activity diagrams, use case diagrams and narrative to help illustrate the design of my application 4.1 Customer Requirements Customer requirements are direct statements or expectations that come from the principle stakeholders or the prime actors of the project being developed. They directly impact scope of the project and have unequivocal ramifications on the key features of the system being developed. In this particular case I spoke directly to some of the principle stakeholders for the Persona Identification Application who told me directly what their mission statement/requirements were the following: 1. To be able to use wholesome dataset (Excel) 2. To be able to cluster the dataset through an application interface 3. Be given back a visual representation of the clustering results through the application interface 4. To be able to download a CSV table that show the clustering results which can help facilitate the identification of personas Table 1 – User Requirements 4.2 Functional Requirements Functional requirements are the mandatory tasks and activities that need to be fulfilled in order to exert the full functionality of the app. In others words it should depict what the system should do and the features it should provide to its users. The table below shows the functional requirements for the Persona Identification Application.
  • 32. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 32 Table 2 - Functional Requirements 4.3 Non-Functional Requirements Non-functional requirements are the requirements that depict the functionality of the system, in this case the Persona Identification Application. The table below shows the non- functional requirements for this system. Table 3 - Non-Functional Requirements 4.4 Requirements Summary Thus far, one of the key things to remember is that requirement gathering and analysis is that it plays a crucial role in informing the design of the software solution. The requirements along with research conducted in the literature review will assist me in putting together an adequate design of the system, which will be shown in the second half of this chapter. 4.5 Design In this part of the chapter I will be concentrating on the design aspect of the Persona Identification Application. As previously stated the outcomes of my literature review
  • 33. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 33 coupled with the results from the requirement analysis have helped put this part of the chapter together. I will draw up different diagrams such to clearly show the interaction with the user and the system. I will also be providing reasoning behind why each method was chose. 4.6 Activity Diagram One of the important UML models, an activity diagram illustrates the workflow of a business process. In this case the diagram below shows the set of incremental steps that an end user would need to achieve to get to attain his or her end goal. Along the way there are different decision points that a customer will face which will ultimately lead them to the same main deliverable. One of the reasons I opted to construct an activity diagram it is one of the most comprehensible diagrams offering a clear understanding of the business flow within the system not only to the developers but to them stakeholders as well. (Wang Linzhang, 2004
  • 34. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 34 Figure 8 - Activity Diagram of Persona Identification Application 4.7 Use Case Another important UML model the use case aims to offer the simplest way of demonstrating the user’s interaction with the proposed system. The diagram below shows the user interactions with the Persona Identification App. In addition to the diagram I put together a use case narrative, which basically provides a more in depth description to the use case diagram. The reason I chose to implement a use case diagram and narrative is that it provides an abstract view of the application from the user perspective. (Elenburg, 2005) Figure 9 - Use Case Diagram of Persona Identification Application
  • 35. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 33 Table 4 - Use Case Narrative
  • 36. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 34 Summary This chapter has looked at the requirements set out by the user setting out the functional and non-functional of the application. Also this chapter has shown how I went about designing the application; in addition to this I have been able to discuss different techniques in evaluating the usability of the application interface and functionality. The findings in this chapter will help me greatly in implementing the application taking into consideration the requirements from the users; equally it will help me evaluate the application as a whole. This will be explained more in Chapter 6.
  • 37. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 35 5 Implementation In this chapter I will be discussing the implementation of the Persona Identification App. In particular I will be looking into the software environment I chose to implement the application in, which in this project is R, providing adequate justification for why my selected software environment was chosen. In addition to this I will be detailing the full functionality of the application by way of screenshots with adequate description of each point. 5.1 Software Environment – R R is a free command line based programming language specifically for statistical computing and data mining. Its software environment enables its users to construct statistical software as well as graphical user interfaces. As previously stated R is a command-based line programming language meaning it runs through a MS-DOS style display; however several GUI platforms have been developed to use alongside R such as R-Studio. One of the main reasons I decided to use R to implement this system is that it was a free meaning that I could use it at will as opposed to having to obtain a license. Another reason I chose to use it was because I felt quite comfortable using a command line based system due to my prior experience with MS-DOS. Subsequently R offers a good and easy to understand package in developing interactive web-based interfaces (R-Shiny) which I used to develop the interface. 5.2 Software Environment - MatLab MatLab is a high level, interactive programming environment written in a bevy of programming languages such as Java, C and C++. One of the advantages of MatLab is that it allows its users to access a world of different features such as plotting and mapping functions and data, implementing algorithms and using built in math functions. Furthermore MatLab allows its user to create graphical user interfaces to work hand in hand with the programs coded in its environment. One of the main reasons I chose not to use MatLab to develop and implement the Persona Identification App was because I was unable to obtain a license to use it at home from the university, meaning that every time I wanted to work on development I would have to come onsite which is not feasible or indeed efficient. 5.3 Persona Identification Application Implementation As previously stated I developed the persona identification program in R then subsequently developed the interface using R’s own package Shiny. In order to do this I had to code in different functions then put it together in Shiny based application. I have enclosed below screenshots of the coding of the most important functions with annotations to help depict what each function is doing. For convince sake I have also listed the functions below:
  • 38. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 36 5.3.1 Application Coding Screenshots 1. Import CSV File Figure 10 - Import csv file plus description 2. Choose variables Figure 11 – Choose variables plus description 1. Import CSV file and convert to data matrix 2. Choose variables 3. Standardize data option and cluster data 4. Show within groups sum of errors squared (Number of clusters) 5. Show results
  • 39. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 37 3. Standardize data and run K-Means algorithm Figure 12 – Standardize data and run k-means plus description 4. Show within group’s sum of errors squared (Number of clusters) Figure 13 – Choose K function plus description
  • 40. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 38 5. Show Analysis Results Figure 14 – Show analysis results plus description 6. Download cluster results CSV file Figure 15 – Download results csv file plus description
  • 41. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 39 5.3.2 Application Interface Screenshots This part of this chapter I will be presenting screenshots depicting the actual interface of the application. This will add a visual impression to the lines of code explained earlier. The screen shots will further be annotated to provide more in-depth descriptions on what is transpiring within the application. Figure 16 - Screenshot of Persona Application Interface 1.0 Figure 17 – Screenshot of Persona Identification Application 2.0
  • 42. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 40 5.4 Assumptions In order to run the application successfully there needs to be some prerequisites that need to be adhered to. One of them is that all the data that is in the csv file needs to be numeric else the K-Means algorithm will just throw errors. In addition the data imputed has to be pre- processed in order to gain tangible results. This will be further discussed in chapter 6. Finally when running this application in R the shiny library needs to unpackaged and run after this is done a simple command line of runApp(“.”) needs to be entered to run the application. 5.5 Summary This chapter has shown the implementation of the application as well as the reasoning behind why I chose the software environment to code it in. I have also discussed the prerequisites that need to be fulfilled in order for the application to work. The findings in this chapter have demonstrated my ability to code an application and present it in a user- friendly manner.
  • 43. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 41 6 Results and Evaluation In this chapter I will be looking at the results gained from the application developed. I will also be detailing how I went about gaining personas from the results data. It’s important to remember that this application can work with any dataset as long as its numeric and for the purposes of this project I have focused on a dataset containing 500 families weekly shop over a 2 month period. Furthermore I will be evaluating the application usability through the Nielsen Heuristics principle and conducting black-box testing to test the system functionality. 6.1 Data Pre-Processing As previously stated data preprocessing is an essential part of the data mining process as it helps lay the foundation for more concise result analysis. It also helps clear up the so-called ‘garbage’ data that may spew the results. To pre-process the data used for this project I first choose the two most important variables that will help me identify personas from the Dunhummby dataset, which in this case was household key (hkey) and product category (prodcatID). I used a technique called “Quota Sampling” to select which data I wanted to use for this analysis (Riley, 2012). After which I created my own data subset to make with the two variables only in the CSV file. Finally, to adhere to the rule of K-Means, I assigned each of the 22 product categories to a numeric value and inputted them into the data subset keeping a reference of the category and the numeric value its assigned to which can be seen below. For ease of understanding I used the product category as the “personas” e.g. GROCERY will be a grocery persona etc. Figure 18 – Evidence of data pre-processing Results
  • 44. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 42 Once the results CSV file is downloaded the contents show four columns; kclust, which shows how, many clusters there are hkey and prodcatID, these are the two variables we chose to analyze and finally fit.cluster which show where each of the variables assigned fit in each cluster. Figure 19 - Screenshot of results out CSV file I can see from here that the prodcatID and hkey have been assigned to a fit.cluster, which has been set by the user already (see. From this I can then filter the rows in the csv file to see how many numeric variables e.g. 1001, 1002 are in each cluster. Once I have found out how many of each variable are in each cluster, I aggregate the total amount, which in turn helps me work out a persona percentage on each category in each cluster. I make sure all the results are documented which can be seen below. Figure 20 - Identifying Personas Breakdown
  • 45. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 43 The formula I used to work out the percentage was relatively straightforward. After I aggregate the total amount a calculated the instances of variables against the total amount within the cluster. For example 1001(Grocery) has 2050 instances in cluster 1, I run that number against the total amount of instances in cluster one using an online percentage calculator. Figure 21 –Percentage Calculator Example 6.2 Results Summary To be able to identify personas, thus meeting my aim, I conducted some tests on my own data sub-set (Figure 11). The first test I ran was with K (Number of Clusters) set to 3, which is the optimum number of clusters for this dataset (see Figure 10). After mining the raw data based on the method stated above, the following results were found: Figure 22 - Persona Percentage Results (Test 1)
  • 46. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 44 From the results found I can say that the GROCERY persona was the most consistent and populous persona found in the data set averaging around 60-65% in terms of persona percentage. The next best persona found was the DRUG GM persona, averaging around 10- 11% persona percentage. This tells me that the dataset is heavily populated with GROCERY Personas with very little other variances of personas following. To validate this finding I ran the application again on that same dataset, however this time with K = 4. The results were as follows: Figure 23- Persona Percentage Results (Test 2)
  • 47. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 45 From this particular test I can see some sort of correlation with the first test I conducted with K set at 3. I can deduce that the GROCERY persona is averaging between 63-66% persona percentages spread across 4 clusters, which is very similar to the first test run. The DRUG GM persona keeps its mark with around 10% persona percentage, with PRODUCE coming in at around 9-10% average in terms of persona percentage. This indicates to me that the dataset is densely populated with GROCERY personas 6.3 Evaluation As previously mentioned in chapter 3.8.1 I have chosen to use the Nielsen heuristics to evaluate the usability of the application interface. To go about this I have used a System Usability Scale questionnaire, which was developed by John Brooke (Brooke, 2011). The questionniare itself is ten questions long based on a likert scale scoring system (1= Strongly disagree, 2= Strongly agree) if the particitpant is uncertain of an answer than they will select 3. The reason for me choosing this questionnarie is that the questions asked are similar to that of Nilesen 94’ huerisitcs which is what I planned to use to evaluate the system with to begin with. In addtion using a likert scale system makes it more choerent and easier for the participents to complete, thus saving time (Dane Bertram, 2012). Below is an example of the questionniare that will be given to the participants; Figure 24 - System Usability Questionnaire
  • 48. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 46 6.3.1 Participant selection Selecting the number of participant to evaluate the application is very important especially when it pertains to this project. In an ideal world the more evaluators I have the better as different evaluators can pick up different usability issues. However according to Nielsen the most optimum number for evaluating a software system are 5 evaluators or at least 3. (Nielsen, 1995). Figure 25 - Graph showing the optimum number of evaluators The above figure (23) shows that optimum number of evaluators against the proportion of usability problems found. I can see here that 5 evaluators can find 75% of usability problems. 6.4 Black-Box Testing Black box testing is a form of functional testing which aims to test if the software developed does what it is supposed to do. The way I went about this was to create a questionnaire which is based on the functional requirements, which the same participants that are testing the usability would have to fill out. (Williams, 2006)
  • 49. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 47 Figure 26 - Functional Test Questionnaire The reason I chose to design the questions this way (figure 24) was to be able to gauge whether or not the functional requirements have been met with a straightforward yes or no response. This directly has a knock on effect as the outcome of this questionnaire will indicate to me how far I have gone in meeting the user requirements. 6.5 Evaluation Results After the evaluation was completed I put all the results from the questionnaire and deduced a bar chart from it to add a visual representation to the evaluation results. The first thing I did was to put all the answers from each participant in a table which can be seen below (Figure 25). After this I was able to construct a bar chart using Excel. Figure 28 - Bar Chart of Usability Questionnaire Results To make the output more meaningful to me I aggregated the results and draw up a bar chart to give a visual representation of the average score of the usability questionnaire Figure 27 - Table of Usability Questionnaire Results
  • 50. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 48 Figure 29 Bar Chart showing average usability questionnaire results 6.6 Black Box Testing Results As previously stated the system functionality testing (black box) was conducted concurrently with the usability testing. Everyone that took part reported back that they execute all the functionalities that the system offered. The results is illustrated below in figure 28 Figure 30 - Results of System Functionality Questionnaire 6.7 Evaluation Summary To conclude this chapter I can say that the usability and system evaluation was highly successful, in particular the black box testing. From all 5 subject experts who conducted the evaluation, their response was highly positive which tells me that, from an expert point of view, the application is very useable and does what its set out to do. On the functionality side 5/5 evaluators answered YES to all 7 functionality questions (Figure 28). This tells me that the system functionality is fit for purpose and crucially it validates the customer requirements set out in Chapter 4.
  • 51. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 49 7 Conclusion This dissertation has covered a lot of topics as well as fresh, novel ideas i.e. persona identification. However it’s important to be able to competently draw conclusions from the findings of this project, offering appraisal on the positives found and being able to offer constructive critique on the weaker aspects of the dissertation. 7.1.1 Aim - Identify individual personas from prosumers personal information. To answer this question I can say that I was able to identify individual “personas” from prosumer data, however there were issues that I came across during in regards to this. The first issue was the strength of the persona. The main personas found on the dataset tested were the GROCERY “persona” however this could be deemed by some analyst as too vague or not in depth enough. Thorough my own investigation into this perception I found out that a much deeper pre-processing method, e.g. using sub-product categories instead of main product categories, would be required in order to fish out much more ‘features’ within the clusters. This will help facilitate more diverse and meaningful “personas”. It’s important to stress that this could have been achieved within the boundaries of this particular project however I believed that deriving personas from main product categories i.e. grocery, produce, nutrition etc. would be a much better way of obtaining good individual personas. However from hindsight I believe a deeper pre-processing method would have produced more meaningful persona. Nevertheless I believe this shouldn’t take away from the fact that I was able to identify individual “personas” which was the ultimate aim of this dissertation. 7.1.2 Objective 1 - Undertake a state-of-the-art literature review to inform, create a design specification for an identifying personas/Investigate in greater detail the pros and cons of clustering with reference to appropriate literature To conclude this objective I can confidently say that a state-of-the-art literature review was undertaken (See Chapter 2) carefully analyzing two of the main clustering methods (hierarchical and partitioning) drawing advantages and disadvantages and relating it back to how it would impact my aim of this project. In addition I looked into the importance of personal data and how it has risen to be the new “oil”, also I looked at the rise of the digital prosumer, in particular, how prosumption is poised to take over typical consumption laying credence to Toffler prediction on how prosumption is going to take over consumption by the turn of the 21st century. This all provided the necessary justification for undertaking the project and exposed the potential value in building an application that can identify personas.
  • 52. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 50 In essence I believe this objective was met at a high standard making use of various white literatures. This subsequently enabled me to create a design specification for my application. 7.1.3 Objective 2 - Build a persona identification application. The particular part of the project was by far the most challenging yet the most rewarding. First off I was tasked with choosing the appropriate software environment in which the application will be coded in, after this was ascertained then the code development begun. Although this was a very tedious task, involving numerous failed attempts and heavily bugged versions, a final version was created bringing to life all the research and personal hypotheses set out at the beginning of the project. (See Chapter 5) Overall I was hugely satisfied with the implementation of the application despite the fact that it took a huge amount of time and resources to put together, I believe it was a very strong and well put together application that was indeed fit for purpose 7.1.4 Objective 3 - Evaluate the application. The final part of this dissertation required me to evaluate the application, to not only provide validation against my aim but to validate the customer requirements defined in Chapter 4. I went about this by, first evaluating the usability of the system; this was done via a questionnaire which was very heavy influenced by the Nielsen heuristic principle. After this a black-box test was put together to evaluate the functionality of the application. Both test were a huge success, as I was using experts to evaluate the system, there was a lot of extra scrutiny laid on both the usability and functionality. The feedback was highly positive which went a long way in validating my aim and user requirements. (See Chapter 6) 7.2 Future Development One of the most underrated aspects of any project is to negate things that haven’t been done, due to time or resources, and over-emphasis the things that have been achieved in a project. I believe that there is a world of benefits to be unlocked once we can sit back and look at what can be developed in the future to make this project even better. There are a number of things that can be achieved with future work/development that would enhance the application even further. The first is obviously a much deeper pool of personas which was explained in the chapter. Another future development would be adding more algorithms to the application instead of just the single K-Means. This was explained in more detail in Chapter 2.8. Another development would be the ability to but the application on a server and connect it to a database, this will enhance the application even more as it would mean that data from the data lockers could be stored on the databases and be called into the application via a database query etc. making the application more robust, expanding the