This document summarizes a dissertation that aims to identify personas from personal data through clustering analysis. The dissertation will conduct a literature review on clustering techniques, develop a persona identification application using R or Matlab, and evaluate the application. Key areas covered include digital prosumers, data mining, clustering methods (partitioning and hierarchical), and software development lifecycles. The application will analyze a dataset to cluster the data and deduce personas. The evaluation will assess if the application successfully meets the aim of identifying personas from personal information.
The document discusses defining and conceptualizing data journalism. It explores what journalism and data are, and considers data journalism as a process-oriented form of journalism where data is a key source. The document suggests defining data journalism as a journalistic practice influenced by the need to engage with data as a source. It examines how the data journalism process may differ from traditional forms of journalism.
How Data Loss Prevention End-Point Agents Use HPE IDOL’s Comprehensive Data C...Dana Gardner
Transcript of a discussion on how cybersecurity attacks are on the rise but new capabilities are being brought to the edge to provide for better data loss prevention.
This document discusses data mining techniques for big data. It defines big data as large, complex collections of data from various sources that contain both structured and unstructured data. Big data is growing rapidly due to data from sources like social media, sensors, and digital content. Data mining can extract useful insights from big data by discovering patterns and relationships. The document outlines common data mining techniques like classification, prediction, clustering and association rule mining that can be applied to big data. It also discusses challenges of big data like its huge volume, variety of data types, and rapid growth that require new data management approaches.
The document discusses various aspects of managing virtual teams and doing business in the digital age. It begins by describing how virtual teams rely heavily on communication technologies and how building trust is important given the lack of in-person interaction. It then discusses some of the challenges virtual team leaders face, such as developing trust, effective communication patterns, and managing distributed teams. The document also provides tips for how managers can lead virtual teams effectively, such as picking the right people, focusing on communication, building trust, motivating team members, understanding challenges, and providing support.
The document discusses the concept of "guerrilla research" through exploring various definitions and examples. It considers guerrilla research as potentially unconventional research that challenges accepted norms or paradigms through novel hypotheses or untried techniques. Small grants are discussed as a traditional form of smaller-scale research that still requires considerable overhead to submit bids. The etymology of "guerrilla" suggests guerrilla research may refer to "little research" on a smaller, resistance-like scale compared to larger conventional projects.
Three key points:
1. There are three emerging capability areas for cognitive computing: engagement, decision making, and discovery. Engagement systems change human-computer interaction, decision systems make evidence-based decisions, and discovery systems find new insights.
2. Case studies show how cognitive computing is being used by organizations like USAA, WellPoint, and Baylor College of Medicine to improve customer service, clinical decision making, and medical research.
3. The future evolution of cognitive computing will be influenced by six forces: technology advances, societal acceptance, information growth, perceptions, skills availability, and policies. Balancing these forces will impact adoption rates.
The document discusses defining and conceptualizing data journalism. It explores what journalism and data are, and considers data journalism as a process-oriented form of journalism where data is a key source. The document suggests defining data journalism as a journalistic practice influenced by the need to engage with data as a source. It examines how the data journalism process may differ from traditional forms of journalism.
How Data Loss Prevention End-Point Agents Use HPE IDOL’s Comprehensive Data C...Dana Gardner
Transcript of a discussion on how cybersecurity attacks are on the rise but new capabilities are being brought to the edge to provide for better data loss prevention.
This document discusses data mining techniques for big data. It defines big data as large, complex collections of data from various sources that contain both structured and unstructured data. Big data is growing rapidly due to data from sources like social media, sensors, and digital content. Data mining can extract useful insights from big data by discovering patterns and relationships. The document outlines common data mining techniques like classification, prediction, clustering and association rule mining that can be applied to big data. It also discusses challenges of big data like its huge volume, variety of data types, and rapid growth that require new data management approaches.
The document discusses various aspects of managing virtual teams and doing business in the digital age. It begins by describing how virtual teams rely heavily on communication technologies and how building trust is important given the lack of in-person interaction. It then discusses some of the challenges virtual team leaders face, such as developing trust, effective communication patterns, and managing distributed teams. The document also provides tips for how managers can lead virtual teams effectively, such as picking the right people, focusing on communication, building trust, motivating team members, understanding challenges, and providing support.
The document discusses the concept of "guerrilla research" through exploring various definitions and examples. It considers guerrilla research as potentially unconventional research that challenges accepted norms or paradigms through novel hypotheses or untried techniques. Small grants are discussed as a traditional form of smaller-scale research that still requires considerable overhead to submit bids. The etymology of "guerrilla" suggests guerrilla research may refer to "little research" on a smaller, resistance-like scale compared to larger conventional projects.
Three key points:
1. There are three emerging capability areas for cognitive computing: engagement, decision making, and discovery. Engagement systems change human-computer interaction, decision systems make evidence-based decisions, and discovery systems find new insights.
2. Case studies show how cognitive computing is being used by organizations like USAA, WellPoint, and Baylor College of Medicine to improve customer service, clinical decision making, and medical research.
3. The future evolution of cognitive computing will be influenced by six forces: technology advances, societal acceptance, information growth, perceptions, skills availability, and policies. Balancing these forces will impact adoption rates.
Semantic Web Mining of Un-structured Data: Challenges and OpportunitiesCSCJournals
The management of unstructured data is acknowledged as one of the most critical unsolved problems in data management and business intelligence fields in current times. The major reason for this unresolved problem is primarily because of the actuality that the methods, systems and related tools that have established themselves so successfully converting structured information into business intelligence, simply are ineffective when we try to implement the same on unstructured information. New methods and approaches are very much necessary. It is a known realism that huge amount of information is shared by the organizations across the world over the web. It is, however, significant to observe that this information explosion across the globe has resulted in opening a lot of new avenues to create tools for data management and business intelligence primarily focusing on unstructured data. In this paper, we explore the challenges being faced by information system developers during mining of unstructured data in the context of semantic web and web mining. Opportunities in the wake of these challenges are discussed towards the end of the paper.
How are machine learning and artificial intelligence revolutionizing insurance?
This presentation explains it briefly, including current trends and effects on the business.
This document discusses huge data and data mining. It defines huge data and notes that huge amounts of data are being created daily from sources like social media, sensors, and digital content. It discusses some key aspects of huge data including that it can be structured or unstructured, comes from decentralized sources, and has complexity in relationships within the data. The 3Vs of huge data are also defined as volume, variety, and velocity. The document states that data mining techniques can be used to extract useful insights from huge data by discovering patterns and relationships within large datasets.
The Pew Research Center’s Internet & American Life Project and Elon University’s Imagining the Internet Center asked digital stakeholders to weigh two scenarios for 2020, select the one most likely to evolve, and elaborate on the choice. One sketched out a relatively positive future where Big Data are drawn together in ways that will improve social, political, and economic intelligence. The other expressed the view that Big Data could cause more problems than it solves between now and 2020
This document discusses making data more accessible to society through open data, communication, and technology. It begins by introducing an online discussion on opportunities and challenges of using open data, data visualization, and other technology approaches.
It then discusses three main ways of making data more accessible: open data, which freely shares data for public use; communication, where data is explained through storytelling and visualization to broad audiences; and interactive technology, like apps and crowdsourcing, that enable public participation as data producers. Examples like Mappiness and OpenStreetMap demonstrate how crowdsourced data can benefit society.
The document provides context for an online discussion on these topics from June 11-24, 2014 and invites participation from both experts
I delivered this shorter version of my Gov. Transformation Through Public Data presentation at the Personal Democracy Forum 2008 in June.
(watch in full screen mode to read the narration). While this version concentrates on government, IMHO the same tools are valid for corporations, with similar benefits, as part of an Enterprise 2.0 strategy.
Overview on data collection methods and a deep dive on data (primary Vs secondary, qualitative and quantitative). Bias. Data processing and structured, unstructured, semistructured data. Databases jargon.
Transforming policy skepticism into policy co makershipThei Geurts
This document discusses the increasing complexity of society and government policies and regulations. It argues that the complexity has led to problems like citizens and businesses losing track of which rules apply to them, increased administrative burden, and difficulties for politicians developing effective policies. The document proposes that using technology to convert laws and regulations into formal models could help increase transparency and allow citizens, businesses, and government agencies to better understand how policies apply in specific situations. This may help reduce skepticism about government and engage citizens and businesses in policy development. Four research tracks are proposed to further explore using policy modeling.
The document discusses the promise and perils of big data. It describes how large datasets and advanced correlation techniques are now being used by companies like Google, credit card companies, and medical researchers to gain insights. However, critics worry that big data could be misused and threaten privacy. The Aspen Institute convened experts to discuss these issues, including how to interpret big data, whether scientific models are still needed, and the implications for business, government and society. While big data offers opportunities, its rise also poses new challenges around privacy, ethics and fair competition that need to be addressed.
Linked Data and examples, why they matter. Data driven strategies. Data mining: laws and applications. Data aggregation and fundamentals of data representation (table, bar chart, histogram, pie chart, line graph, scatter plot). Data science definition and job roles (who does what).
Einstein published his ideas and became a pivotal element in shifting the way we think about physics - from the Newtonian model to the Quantum - in turn this changed the way we think about the world and allowed us to develop new ways of engaging with the world.
We are at a similar juncture. The development of computational technologies allows us to think about astronomical volumes of data and to make meaning of that data.
The mindshift that occurs is that “the machine is our friend”. The computer, like all machines, extends our capabilities. As a consequence the types of thinking now required in industry are those that get away from thinking like a computer and shift towards creative engagement with possibilities. Logical thinking is still necessary but it starts to be driven by imagination.
Computational thinking and data science change the way we think about defining and solving problems.
The age of creativity - which increasingly extends its impact from arts applications to business, scientific, technological, entrepreneurship, political, and other contexts.
The document summarizes James LoBuono's interview about the growing demand for data scientists. Some key points:
- Data science skills are in high demand across industries due to increased data availability from sensors and cloud computing.
- Data scientists are needed to extract useful information from messy, unstructured data sources to aid decision making.
- Programming languages like Python and tools like machine learning are commonly used in data science roles.
- Data science can help solve business problems and unlock opportunities by making decisions based on data analysis rather than intuition alone.
ST&I National Information System Platform: the Brazilian case of LattesRoberto C. S. Pacheco
In this presentation we address the issue of why innovation funding data are generally poor to support strategic studies. Generally they come from information systems designed to help only part of the processes of a national (or regional) innovation system. We present the main lessons learned from the Brazilian ST&I system projects, particularly Lattes and Portal Inovação.
The research design adopted for this study is descriptive research design. The population for this study includes employees from Nomura Research Institute, TCS and Indigo. The sampling frame includes employees from different departments and designations of these three companies. The sample size taken is 60 employees, with 20 employees each from Nomura Research Institute, TCS and Indigo. The sampling technique used is convenience sampling technique.
The document discusses a survey conducted on the topic of "Internet & E-Learning" to gain a clear understanding of the internet industry and e-learning platforms. It includes an introduction to the internet and e-learning, overview of major internet companies, recent trends in the internet industry, and the research methodology used for the survey including data collection methods, sampling techniques, and scope of the research.
Infrastructure as Destiny — How Purdue Builds a Support Fabric for Big Data E...Dana Gardner
Transcript of a discussion on how Purdue University provides IT as a service, using big data and the IoT technologies, to support such worthy goals as student retention analysis.
The document summarizes the emerging opportunities and challenges around personal data as a new asset class. It outlines how personal data is being generated at unprecedented scales from various sources. However, the current personal data ecosystem remains fragmented without common standards or principles. The summary identifies key stakeholders in the ecosystem, including individuals, private sector companies, and governments, and notes they each have different and sometimes conflicting needs and interests. It argues a balanced ecosystem can be achieved by adopting an end-user centric approach that empowers individuals and aligns all stakeholders around common goals of trust, transparency and value creation.
La norma técnica ICONTEC 5854 establece tres niveles de conformidad (A, AA y AAA) para los requisitos de accesibilidad que deben cumplir los sitios web en Colombia a fin de eliminar las barreras de acceso a la información para personas con discapacidades. Los principios que rigen la norma incluyen que la información sea perceptible, operable, comprensible y robusta. El documento también describe algunos softwares de asistencia para personas con discapacidades visuales y auditivas.
Semantic Web Mining of Un-structured Data: Challenges and OpportunitiesCSCJournals
The management of unstructured data is acknowledged as one of the most critical unsolved problems in data management and business intelligence fields in current times. The major reason for this unresolved problem is primarily because of the actuality that the methods, systems and related tools that have established themselves so successfully converting structured information into business intelligence, simply are ineffective when we try to implement the same on unstructured information. New methods and approaches are very much necessary. It is a known realism that huge amount of information is shared by the organizations across the world over the web. It is, however, significant to observe that this information explosion across the globe has resulted in opening a lot of new avenues to create tools for data management and business intelligence primarily focusing on unstructured data. In this paper, we explore the challenges being faced by information system developers during mining of unstructured data in the context of semantic web and web mining. Opportunities in the wake of these challenges are discussed towards the end of the paper.
How are machine learning and artificial intelligence revolutionizing insurance?
This presentation explains it briefly, including current trends and effects on the business.
This document discusses huge data and data mining. It defines huge data and notes that huge amounts of data are being created daily from sources like social media, sensors, and digital content. It discusses some key aspects of huge data including that it can be structured or unstructured, comes from decentralized sources, and has complexity in relationships within the data. The 3Vs of huge data are also defined as volume, variety, and velocity. The document states that data mining techniques can be used to extract useful insights from huge data by discovering patterns and relationships within large datasets.
The Pew Research Center’s Internet & American Life Project and Elon University’s Imagining the Internet Center asked digital stakeholders to weigh two scenarios for 2020, select the one most likely to evolve, and elaborate on the choice. One sketched out a relatively positive future where Big Data are drawn together in ways that will improve social, political, and economic intelligence. The other expressed the view that Big Data could cause more problems than it solves between now and 2020
This document discusses making data more accessible to society through open data, communication, and technology. It begins by introducing an online discussion on opportunities and challenges of using open data, data visualization, and other technology approaches.
It then discusses three main ways of making data more accessible: open data, which freely shares data for public use; communication, where data is explained through storytelling and visualization to broad audiences; and interactive technology, like apps and crowdsourcing, that enable public participation as data producers. Examples like Mappiness and OpenStreetMap demonstrate how crowdsourced data can benefit society.
The document provides context for an online discussion on these topics from June 11-24, 2014 and invites participation from both experts
I delivered this shorter version of my Gov. Transformation Through Public Data presentation at the Personal Democracy Forum 2008 in June.
(watch in full screen mode to read the narration). While this version concentrates on government, IMHO the same tools are valid for corporations, with similar benefits, as part of an Enterprise 2.0 strategy.
Overview on data collection methods and a deep dive on data (primary Vs secondary, qualitative and quantitative). Bias. Data processing and structured, unstructured, semistructured data. Databases jargon.
Transforming policy skepticism into policy co makershipThei Geurts
This document discusses the increasing complexity of society and government policies and regulations. It argues that the complexity has led to problems like citizens and businesses losing track of which rules apply to them, increased administrative burden, and difficulties for politicians developing effective policies. The document proposes that using technology to convert laws and regulations into formal models could help increase transparency and allow citizens, businesses, and government agencies to better understand how policies apply in specific situations. This may help reduce skepticism about government and engage citizens and businesses in policy development. Four research tracks are proposed to further explore using policy modeling.
The document discusses the promise and perils of big data. It describes how large datasets and advanced correlation techniques are now being used by companies like Google, credit card companies, and medical researchers to gain insights. However, critics worry that big data could be misused and threaten privacy. The Aspen Institute convened experts to discuss these issues, including how to interpret big data, whether scientific models are still needed, and the implications for business, government and society. While big data offers opportunities, its rise also poses new challenges around privacy, ethics and fair competition that need to be addressed.
Linked Data and examples, why they matter. Data driven strategies. Data mining: laws and applications. Data aggregation and fundamentals of data representation (table, bar chart, histogram, pie chart, line graph, scatter plot). Data science definition and job roles (who does what).
Einstein published his ideas and became a pivotal element in shifting the way we think about physics - from the Newtonian model to the Quantum - in turn this changed the way we think about the world and allowed us to develop new ways of engaging with the world.
We are at a similar juncture. The development of computational technologies allows us to think about astronomical volumes of data and to make meaning of that data.
The mindshift that occurs is that “the machine is our friend”. The computer, like all machines, extends our capabilities. As a consequence the types of thinking now required in industry are those that get away from thinking like a computer and shift towards creative engagement with possibilities. Logical thinking is still necessary but it starts to be driven by imagination.
Computational thinking and data science change the way we think about defining and solving problems.
The age of creativity - which increasingly extends its impact from arts applications to business, scientific, technological, entrepreneurship, political, and other contexts.
The document summarizes James LoBuono's interview about the growing demand for data scientists. Some key points:
- Data science skills are in high demand across industries due to increased data availability from sensors and cloud computing.
- Data scientists are needed to extract useful information from messy, unstructured data sources to aid decision making.
- Programming languages like Python and tools like machine learning are commonly used in data science roles.
- Data science can help solve business problems and unlock opportunities by making decisions based on data analysis rather than intuition alone.
ST&I National Information System Platform: the Brazilian case of LattesRoberto C. S. Pacheco
In this presentation we address the issue of why innovation funding data are generally poor to support strategic studies. Generally they come from information systems designed to help only part of the processes of a national (or regional) innovation system. We present the main lessons learned from the Brazilian ST&I system projects, particularly Lattes and Portal Inovação.
The research design adopted for this study is descriptive research design. The population for this study includes employees from Nomura Research Institute, TCS and Indigo. The sampling frame includes employees from different departments and designations of these three companies. The sample size taken is 60 employees, with 20 employees each from Nomura Research Institute, TCS and Indigo. The sampling technique used is convenience sampling technique.
The document discusses a survey conducted on the topic of "Internet & E-Learning" to gain a clear understanding of the internet industry and e-learning platforms. It includes an introduction to the internet and e-learning, overview of major internet companies, recent trends in the internet industry, and the research methodology used for the survey including data collection methods, sampling techniques, and scope of the research.
Infrastructure as Destiny — How Purdue Builds a Support Fabric for Big Data E...Dana Gardner
Transcript of a discussion on how Purdue University provides IT as a service, using big data and the IoT technologies, to support such worthy goals as student retention analysis.
The document summarizes the emerging opportunities and challenges around personal data as a new asset class. It outlines how personal data is being generated at unprecedented scales from various sources. However, the current personal data ecosystem remains fragmented without common standards or principles. The summary identifies key stakeholders in the ecosystem, including individuals, private sector companies, and governments, and notes they each have different and sometimes conflicting needs and interests. It argues a balanced ecosystem can be achieved by adopting an end-user centric approach that empowers individuals and aligns all stakeholders around common goals of trust, transparency and value creation.
La norma técnica ICONTEC 5854 establece tres niveles de conformidad (A, AA y AAA) para los requisitos de accesibilidad que deben cumplir los sitios web en Colombia a fin de eliminar las barreras de acceso a la información para personas con discapacidades. Los principios que rigen la norma incluyen que la información sea perceptible, operable, comprensible y robusta. El documento también describe algunos softwares de asistencia para personas con discapacidades visuales y auditivas.
El documento presenta el horario escolar de una semana, con actividades como Español, Matemáticas, Exploración de la Naturaleza y Formación Cívica y Ética en las mañanas, y Educación Artística y Actividades para iniciar bien el día en las tardes.
Este documento es una cotización de productos para mejorar la salud de una mascota. La cotización incluye una lista de ítems con códigos, cantidades, descripciones y valores unitarios. Al final se detallan el subtotal, IVA, descuentos y total de la cotización. La cotización es provista por el veterinario Benji, ubicado en Fusagasugá, Colombia.
To create an icon, right click on an image and select "Create Icon". To create a folder, right click on the desktop, select "New", and then select "New Folder".
Haiku Deck is a presentation tool that allows users to create Haiku style slideshows. The tool encourages users to get started making their own Haiku Deck presentations which can be shared on SlideShare. In just a few sentences, it pitches the idea of using Haiku Deck to easily create visually engaging slideshows.
El documento describe una evaluación de un taller de investigación. El taller tuvo como objetivo enseñar habilidades básicas de investigación a los estudiantes. La evaluación encontró que el taller cumplió con sus objetivos de introducir a los estudiantes en el proceso de investigación y desarrollar sus habilidades para formular preguntas de investigación y buscar información relevante.
Lanzarán campaña "Pon de tu parte" para luchar contra el cambio climático Perú 2021
Andina.com es un sitio web de noticias peruano. Cubre temas políticos, económicos y sociales del Perú y América Latina. Ofrece las últimas noticias, análisis y reportajes para mantener a los lectores informados sobre los acontecimientos más importantes de la región.
Este documento presenta información sobre computadoras, incluyendo las principales marcas de computadoras, sus ventajas como herramientas de comunicación e información, y sus desventajas potenciales como efectos en la salud y promoción de actividades ilegales.
Um novo diretor geral foi contratado para tornar uma empresa mais produtiva. Durante uma inspeção no armazém, ele deu o salário de um mês a um rapaz ocioso e mandou-o embora, sem saber que o rapaz entregava pizzas e não trabalhava na empresa. Os operários revelaram a verdade ao diretor.
Colegio Nacional Pomasqui is a school in Ecuador where 10th grade students Fernanda Intriago and Nicole Gavilanez are enrolled in class "B". The document provides the name and grade level of two students at a school in Ecuador.
Este documento describe los servicios de Logisfashion, una compañía líder en logística para la industria de la moda. Logisfashion maneja más de 50 millones de prendas por año en sus centros logísticos y ofrece servicios en España, México, Chile y China. La compañía utiliza tecnología avanzada como software WMS y sistemas RFID para mejorar la eficiencia de los procesos logísticos de sus clientes en la industria de la moda.
This document appears to be a presentation on big data and analytics. It includes slides on topics like how big data is measured, where it comes from, how it will impact learning systems, and examples of big data in areas like social networks, wikis, and recommendations. It also includes slides on techniques like linear regression, stochastic gradient descent, and responses from students on big data and their interest in seeing it incorporated into courses.
This document appears to be a presentation on big data and analytics. It includes slides on topics like how big data is measured, where it comes from, how it will impact learning systems, and examples of big data in areas like social networks, wikis, and recommendations. It also includes slides on techniques like linear regression, stochastic gradient descent, and responses from students on big data and their interest in seeing it incorporated into courses.
A Guide to Data Innovation for Development - From idea to proof-of-conceptUN Global Pulse
‘A Guide to Data Innovation for Development - From idea to proof-of-concept,’ provides step-by-step guidance for development practitioners to leverage new sources of data. It is a result of a collaboration of UNDP and UN Global Pulse with support from UN Volunteers.
The publication builds on successful case trials of six UNDP offices and on the expertise of data innovators from UNDP and UN Global Pulse who managed the design and development of those projects.
The guide is structured into three sections - (I) Explore the Problem & System, (II) Assemble the Team and (III) Create the Workplan. Each of the sections comprises of a series of tools for completing the steps needed to initiate and design a data innovation project, to engage the right partners and to make sure that adequate privacy and protection mechanisms are applied.
Data for Impact Fellowship - SocialCops CareersSocialCops
The Data for Impact Fellowship is a unique opportunity where fellows partner with leaders in government, bilateral organizations, foundations and nonprofits — ranging from Ministers, CEOs and District Collectors — to implement a scalable data intelligence solution. The Fellowship seeks to bring together young, enterprising future leaders with experienced leaders in the development sphere to use the power of data to solve some of India's most critical problems.
For more details about the Fellowship and to get started on your application, visit http://soco.ps/2BHK6Ba!
Crowdsourcing is an online, distributed problem solving and production model that revolutionized the internet and mobile market at present. It turns the customers into designer and marketers. The practice of Crowdsourcing is transforming the web and giving rise to a new field. Today the leading enterprises are embracing the next paradigm shift in the distribution of work by outsourcing to the crowd in the cloud. Everyday millions of people make all kind of voluntary online contribution. With the number of people online approaching 3 billion by 2016 and projected to reach 5 billion by 2020, new workforce has emerged that are now used for different purposes. Available on-demand this workforce has abundant capacity and the expertise knowledge to perform work from simple to complex and solve problems and grand challenges. This paper gives an introduction to Crowdsourcing, its theoretical grounding, model and examples with case study. In this paper we show that Crowdsourcing can be applied to wide variety of problems and that it raises numerous interesting technical and social challenges. Finally this paper proposes an agenda for using Crowdsourcing in NLP.
A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...ijistjournal
As we all know, in the current era, Internet of Things (IOT) word is very booming in technological market and everyone is talking about the term Smart city especially in India and with reference to keyword smart city, IOT comes with it. The Small word IOT but very big responsibility comes on the shoulders of the technical person to Play with it and extract the data from the IOT . IoT its connecting the multiple things this interconnection is in between living as well as non living things and in that communication huge amount of data is generated so tools and technique which are used for knowledge discover we discuss in this paper.
Internet of Things (IOT) and knowledge discovery are the two sides of the coin and both go together. In the absence of one, there is no use of other. This Paper also focuses on types of the data and data generative sources, Knowledge discovery from that data, tools which are useful for the discovery of the knowledge. Technique, which are to be followed for the purpose of discovering meaningful data from the huge amount of data and its impact.
This document discusses how human beings can play an important role in making sense of big data beyond just visualization. It presents a case study where students transformed a large dataset into a visual language and "text" that could be interpreted. The document argues that current sense-making models are too technology-centric and that meaningful interpretation emerges from collaboration between algorithms, data, and human beings. Human perceptual abilities allow them to recognize patterns where computers see only numbers.
This document discusses the opportunities presented by big data for international development. It notes that innovations in technology have led to an explosion in the quantity and diversity of digital data being generated in real-time. This data holds potential to track development progress and understand how policies impact vulnerable populations. However, turning large and complex digital datasets into actionable information requires using computational techniques to identify trends and patterns. While big data presents opportunities, questions also remain regarding its analytical value, policy relevance, and privacy implications when used in development contexts. Overall, big data could complement traditional data sources and help narrow information gaps, but human expertise is still needed to properly analyze and interpret digital data.
The document discusses how big data and analytics tools like Explora can provide valuable insights for organizations. Explora analyzes structured and unstructured data using natural language processing to understand concepts, sentiment, and relationships. It can analyze brand perception, customer attitudes, and competitive intelligence across various data sources. Explora produces insights that help organizations better understand their stakeholders and make more informed business decisions.
THE EMERGENCE OF COGNITIVE DIGITAL PHYSICAL TWINS
AS THE 21ST CENTURY ICONS AND BEACONS
AN IN-PROGRESS VISION, KEY CATEGORIES, APPLICATIONS
AND
A REFERENCE ARCHITECTURE FRAMEWORK
Published in Nov. 2016. However, it evolved over time using my own practical experience as well as the incorporated the different technological advances we achieved since then.
I added the concept of Cognitive Digital Thread as a framework to collect and manage data and knowledge required for the twins. Also, the concept of Cognitive Digital Swarm has been developed to be the HM & MM collaboration framework.
The Matchbox is an accelerator program that provides strategic and material support to advocacy organizations working at the intersection of technology and transparency. It offers idea refinement, project planning, matching with external experts, and fundraising support. The Matchbox focuses on Latin America and Southern Africa and emphasizes a thoughtful process for technology projects that includes assessing assumptions, exploring existing solutions, developing a pilot plan, user testing, and iterative development. It aims to strengthen projects through preparation, an emphasis on local needs over international support, and consideration of both human and technical perspectives.
This document discusses the importance of data fluency skills in the 21st century. It defines key terms like data science, machine learning, data literacy, and statistical literacy. While these fields require extensive training, the document argues that domain expertise combined with basic data analysis skills can solve many problems. These basic skills include understanding data structures, using programming to interact with data, and exploratory data analysis through visualization. The data analysis process involves defining problems, collecting and preparing data, visualization and modeling, and communicating results. RStudio is presented as a tool that can support the entire data analysis process within a single integrated development environment.
Transcript of Webinar: Data management plans (DMPs) - audioARDC
Video and slides available via: http://www.ands.org.au/news-and-events/presentations/2017
Have you implemented a Data Mangement Plan (DMP) tool at your institution or are you currently involved in discussions to implement one? Woudl you like to connect with others who are involved in implementing DMPs? Then this webinar is for you!
This webinar brings together those involved in planning or implementing DMP to exchange information and explore ideas around DMP.
Information System Essay
Essay about Information Is Power
Information Literacy
Essay Information Management
Information and Knowledge Essay
Information Literacy Essay
The Importance of Information Literacy Essay
Reliable Information
Information Technology Essay
Information Technology Essay
Importance Of Information Literacy Essay
Information Literacy
Information Literacy
Information Literacy Examples
Information Literacy Paper
Information Security Essay
Explain The Sources Of Information Literacy
Information Based Decision Making Essay
The document provides a summary of talks and trends from the SXSW 2015 conference related to technology and its impact on society. Key topics discussed include the effect of technology on cognition and memory, quantified self-tracking and health data, extreme bionics, pursuing computer immortality, the future of cybercrime, and debates around transhumanism and human augmentation. Overall the document aims to capture major discussions, innovations, challenges, and implications that emerged around human-technology interactions from the conference events and speakers.
Implementation of Mobile Information Systems in Organizations: Practical StudyVinícius Caixeta
The document discusses mobile information systems, including their definition, evolution using mobile banking as an example, and advantages and disadvantages. It then presents a practical study conducted with a Chinese sports goods company that implemented mobile systems. Questionnaires with employees found the systems increased efficiency by providing real-time access to sales and inventory data from any location. While some employees initially resisted change, training helped adoption and now the systems are relied upon. In conclusion, mobile systems are challenging to implement but provide indispensable benefits by streamlining organization and interpretation of data.
Big Data for International DevelopmentAlex Rascanu
Alex Rascanu delivered the "Big Data for International Development" presentation at the International Development Conference that took place on February 7, 2015 at University of Toronto Scarborough.
Similar to Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science) (20)
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)
1. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
1
Department of Information Systems and Computing
BSc (Hons) Information Systems (Business)
Academic Year 2013 – 2014
Digital Prosumer - Identification of Personas through Intelligent
Data Mining (Clustering)
Adebowale Nadi
1008089
A report submitted in partial fulfilment of the requirements for the degree of
Bachelor of Science
Brunel University
Department of Information Systems and Computing
Uxbridge
Middlesex
UB8 3PH
United Kingdom
T: +44 1895 203397
F: +44 (0) 1895 251686
2. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
2
Abstract
The main objective of the paper is to explore the idea of prosumption and how digital
personhood data that we produce can be extracted, filtered and analysed and given back to
us [prosumers] in a way that is commodifiable, subsequently empowering citizens to utilize
data that they produce. One aspect of this hypothesis is the identification of personas through
clustering which is facet of intelligent data analysis. With the sole aim being of building a
Persona Identification Application (PIA) which sole purpose is to be able to deduce personas
from data stores.
In 2011 it was estimated that 274.2 million Americans were connected to the internet
leading to 81 billion minutes being spent on social networking sites and blogs. In the same
year 117.6 million people visited the internet via a mobile phone accounting for $246 billon
being spent making online purchases (Palis, 2012). Well renowed mangement consultency
firm Boston Consulting Group projects that the Internet Econmoy will contribute $4.2 billion
to G20 total GDP by 2016. This lead co-author David Dein to emphasise that “If it were a
national economy [internet economy], it would rank in the world’s top five, behind only the U.S.,
China, India, and Japan, and ahead of Germany,” (Dein, 2012). With the rise of the internet
economy coupled with the increased rise of mobile devices connected to the internet,
faciliating an unprecedently amount of data being held, intelligent data analysis needs to be
used to be able to isolate the key information thus producing personas that can be later
traded on a futures market.
This paper will look at the rise of the internet economy coupled with the emergance of the
digital prosumer. In addtion clustering will be look at in finite detail, looking at the various
clustering techniques that can be used in the purposed application, looking into the
advantages and disadvantages of each before deciding on which is the appropriate method
for this project. Furthmore this paper will detail the step by step implementation of the
application detailing all the design and requirement analysis that took place before hand.
Finally a detailed evaluation will be explained and executed relaying the findings from the
application and seeing if, infact, the application meets the aim in a coherent and
chomprehensible manner.
3. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
3
Acknowledgements
First and foremost I would like to take this opportunity to thank my Lord Jesus Christ for
guiding me through this project and giving me the strength to be able to conclude this
dissertation. I would also like to thank my Mum & Dad for their indubitable and
unconditional support given to me throughout my time working on this project. In addition,
all the people that helped, supported and assisted me in anyway shape or form in putting this
dissertation together I would like to personally thank and extend my sincere gratitude
towards. (There are too many to name personally but they know who they are). Last but
certainly not least, I would like to personally thank my supervisor Panos Louvieris and his
assistant Natalie Clewley for all their support rendered to me throughout this project. This
dissertation was, no doubt, the biggest challenge I have faced in all my 19 years in education,
but definitely the most rewarding, learning a highly complex topic (data mining) and learning
to code in a completely new software environment with no prior experience. I truly wouldn’t
have been able to complete it without their guidance, assistance and motivation. In closing I
would like to wish Panos and his team the best of luck in completing their EPSRC sponsored
project Digital Personhood: Digital Prosumer.
Total Words: 15,500
I certify that the work presented in the dissertation is my own unless referenced.
Signature Adebowale Olatunde Nadi
Date 24/03/2014
4. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
4
Table of Contents
Abstract...........................................................................................................................................................................2
Acknowledgements.................................................................................................................................................... 3
Table of Contents........................................................................................................................................................ 4
List of Tables.................................................................................................................................................................7
List of Figures............................................................................................................................................................... 7
1 Introduction ........................................................................................................................................................ 9
1.1 Problem Definition..................................................................................................................................9
1.2 Aims and Objectives............................................................................................................................... 9
1.3 Project Approach.................................................................................................................................. 10
1.4 Dissertation Outline ............................................................................................................................ 11
2 Literature Review .......................................................................................................................................... 12
2.1 Personal Data......................................................................................................................................... 12
2.2 Value of Personal Data ....................................................................................................................... 12
2.3 The Internet [Digital] Economy...................................................................................................... 13
2.3.1 Midata .................................................................................................................................... 13
2.3.2 Information Economy Strategy (IES)........................................................................ 13
2.4 What is a Persona?............................................................................................................................... 14
2.5 What is a Prosumer? ........................................................................................................................... 14
2.5.1 The Rise of the Digital Prosumer................................................................................ 15
2.6 Data Mining............................................................................................................................................. 15
2.6.1 Knowledge Discovery from Data [KDD] .................................................................. 16
2.7 Cluster Analysis..................................................................................................................................... 17
2.7.1 Partitioning Technique................................................................................................... 17
2.7.2 Advantages and Disadvantages................................................................................... 17
2.7.3 Hierarchical Technique................................................................................................... 18
2.7.4 Advantages and Disadvantages................................................................................... 18
2.8 Critical Discussion................................................................................................................................ 19
2.9 Summary.................................................................................................................................................. 20
3 Methodology..................................................................................................................................................... 21
3.1 Design Science ....................................................................................................................................... 21
3.2 Positivist Approach (Positivism)................................................................................................... 22
3.3 Interpretive Approach........................................................................................................................ 23
3.4 Critical Discussion................................................................................................................................ 23
3.5 Software Development Lifecycle Models.................................................................................... 24
6. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
6
6.7 Evaluation Summary........................................................................................................................... 48
7 Conclusion......................................................................................................................................................... 49
7.1.1 Aim - Identify individual personas from prosumers personal information.
49
7.1.2 Objective 1 - Undertake a state-of-the-art literature review to inform,
create a design specification for an identifying personas/Investigate in greater detail the
pros and cons of clustering with reference to appropriate literature ..................................... 49
7.1.3 Objective 2 - Build a persona identification application................................... 50
7.1.4 Objective 3 - Evaluate the application...................................................................... 50
7.2 Future Development ........................................................................................................................... 50
Appendix A Personal Reflection........................................................................................................... 51
A.1 Reflection on Project........................................................................................................................... 51
A.2 Personal Reflection.............................................................................................................................. 51
Bibliography............................................................................................................................................................... 53
A.3 Appendices.............................................................................................................................................. 57
A.4 Appendices.............................................................................................................................................. 57
7. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
7
List of Tables
Table 1 – User Requirements.............................................................................................................................. 31
Table 2 - Functional Requirements.................................................................................................................. 32
Table 3 - Non-Functional Requirements........................................................................................................ 32
Table 4 - Use Case Narrative ............................................................................................................................... 33
List of Figures
Figure 1 - Fayyad KDD representation ........................................................................................................... 16
Figure 2 - Example of a word sorting dendrogram output from:
http://www.macs.hw.ac.uk/texturelab/people/thomas-methven/ ....................................... 18
Figure 3 - Design Science Guideline from MIS Quarterly Research Essay. ...................................... 21
Figure 4 - The Engineering Cycle ...................................................................................................................... 22
Figure 5- Epistemological Assumptions for Qualitative and Quantitative Research from
http://dstraub.cis.gsu.edu:88/quant/2philo.asp............................................................................. 23
Figure 6 - RAD Diagram......................................................................................................................................... 25
Figure 7 - Waterfall Model ................................................................................................................................... 26
Figure 8 - Activity Diagram of Persona Identification Application..................................................... 34
Figure 9 - Use Case Diagram of Persona Identification Application................................................... 34
Figure 10 - Import csv file plus description.................................................................................................. 36
Figure 11 – Choose variables plus description............................................................................................ 36
Figure 12 – Standardize data and run k-means plus description........................................................ 37
Figure 13 – Choose K function plus description ......................................................................................... 37
Figure 14 – Show analysis results plus description .................................................................................. 38
Figure 15 – Download results csv file plus description........................................................................... 38
Figure 16 - Screenshot of Persona Application Interface 1.0................................................................ 39
Figure 17 – Screenshot of Persona Identification Application 2.0...................................................... 39
Figure 18 – Evidence of data pre-processing Results............................................................................... 41
Figure 19 - Screenshot of results out CSV file.............................................................................................. 42
Figure 20 - Identifying Personas Breakdown .............................................................................................. 42
Figure 21 –Percentage Calculator Example.................................................................................................. 43
Figure 22 - Persona Percentage Results (Test 1) ....................................................................................... 43
Figure 23- Persona Percentage Results (Test 2) ........................................................................................ 44
Figure 24 - System Usability Questionnaire................................................................................................. 45
Figure 25 - Graph showing the optimum number of evaluators.......................................................... 46
Figure 26 - Functional Test Questionnaire.................................................................................................... 47
8. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
8
Figure 27 - Table of Usability Questionnaire Results ............................................................................... 47
Figure 28 - Bar Chart of Usability Questionnaire Results....................................................................... 47
Figure 29 Bar Chart showing average usability questionnaire results............................................. 48
Figure 30 - Results of System Functionality Questionnaire................................................................... 48
9. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
9
1 Introduction
This dissertation will be looking at the digital prosumer; in particular, concentrating on the
identification of personas gained from wholesome prosumer data stores which can be used
as valuable commodities to sell on the ‘futures’ market. I plan to execute this by identifying
specific personas from a digital vault of prosumer personal information by using intelligent
data analysis, in this case, clustering. During the course of this dissertation I expect to isolate,
analyze and categorize raw prosumer data and present it in a way were I can link it to a
persona. Also I expect to find the best clustering technique, through an extensive literature
review analyzing both the advantages and disadvantages of each selected method before
coming to a conclusion on the best technique to use. I will also develop a persona
identification application, which will be used to analyze the data and set them into clusters
which can then be classified into personas. Then finally I will be undertaking a
comprehensive evaluation of the app to scope the overall effectiveness of the application.
1.1 Problem Definition
Personal data can generate unprecedented economic and social value for governments,
organizations and individuals in many ways. By 2020 it is estimated that more than 50 billion
devices may be connected to the Internet (Nagel, 2013) and more than 40 times as many
personal data records stored. With the large amounts of data collected from prosumers,
smarter data mining techniques need to be employed to efficiently analyze the data and
identify personas for which data can be traded on a data exchange.
Data mining is the search for valuable information within large volumes of data by
systematically exploring underlying patterns, trends, and relationships hidden in available
data. Data mining techniques can generally be categorized into: (i) classification and
prediction; (ii) clustering; (iii) outlier prediction; (iv) association rules; (v) sequence
analysis; (vi) time series analysis; and (vii) text mining.
1.2 Aims and Objectives
The aim of this project is to identify individual personas from prosumers personal
information stored in a digital vault using an intelligent data analysis technique, Clustering.
To aid me in achieving this aim within this project I have set out a list of objectives that will
help develop the body of this dissertation as well as assist me in determining whether the
project aim has been successfully satisfied.
10. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
10
• Undertake a state-of-the-art literature review to inform, create a design specification
for an identifying personas from digital personhood data using intelligent data
analysis techniques (Clustering).
• Investigate in greater detail the pros and cons of clustering with reference to
appropriate literature
• Build a persona identification application (e.g. using MatLab or R).
• Evaluate the application.
1.3 Project Approach
In order to successfully complete this project I have adopted a five-step approach. At each
stage there will be a set of deliverables I will set that will help achieve my aims and
objectives and also to assist me in completing this project on time.
The first step will be to conduct a state-of-the-art literature review. This review will look at
different cluster analysis techniques from a variety of different physical and online sources.
This will enable me to inform the design of my application, which is the cornerstone of this
project. In addition I will look at what has been done in terms of cluster analysis and try to
synthesize that information and relate it back to my project. The second step will be to
looking at different methodology principles and models, picking the most appropriate
method for this project with appropriate reference to literature. Selecting the right
methodology is pivotal to the success of this project. The third stage will be to analyses the
user requirements and talk about the design of my application and evaluating the GUI. After
this has been discussed and illustrated then I will proceed in coding my application, which
will be done in R-Studio. The fourth stage will be ascertaining the results of the application
and trying to find personas out of the dataset clustered. The way I went about de-cyphering
the information and deducing personas will be shown and explained at this stage. The final
stage of this project will involve evaluating the application and the project as a whole. This
will be coupled with personal reflection on my experiences on putting together this project
11. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
11
1.4 Dissertation Outline
Chapter 2: Literature Review – This chapter will look into pervious literature that will
equip me to gain a deeper understanding into my research problem. Subsequently it will
help inform my design of my application.
Chapter 3: Methodology - This chapter will look at different methodologies principles as
well as software development lifecycle models and critically discussing each of their
strengths as well as weaknesses before isolating a principle and SDLC that will be the most
appropriate for my project.
Chapter 4: Requirement Analysis and Design – This chapter will look at the requirements of
the application set out by the user and analyzing the functional and non-functional
requirements. In addition I will be going through the design process of my application and
how I intend to put it all together.
Chapter 5: Implementation – This chapter will demonstrate the coding of the logic of my
application in R and the coding of the interface using R-Shiny. I will be including fully
annotated screenshots depicting evidence of implementation.
Chapter 6: Results and Evaluation – This chapter will be showing the results of the
application as well as showing how I went about deducing personas from the application. I
will also be looking into evaluating the app and seeing if it has met the aims and objectives
set out at the beginning.
Chapter 7: Conclusion – This chapter will be drawing conclusions to all the findings brought
about in this project. I will be concluding my aims as well as all 3 of my objectives. In addition
I will be evaluating my application from a subjective point of view as well as the project in
its entirety. I will also be suggesting future work to make my application even better.
12. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
12
2 Literature Review
In this chapter I will be discussing and reviewing the different clustering methodologies
available, analyzing the advantages and disadvantages of each technique with reference to
the appropriate literature. This, along with personal evaluation, will fortify me in concluding
which chosen technique is the most appropriate in executing this project by giving me the
adequate justification for that chosen method. In addition to this I will be looking into further
detail into what personal data is as well as how it has metamorphosed into being an
increasing important aspect of a to economic growth and corporate supremacy, consequently
delivering a new breed of prosumers, the digital prosumer.
2.1 Personal Data
If we look at the European Data Protection Directive [Article 2] we see that personal data is
defined “by reference to whether information relates to an identified or identifiable individual”
(Information Commissioner Office, 2010) in other words personal data is any given piece of
information that can be used to in identify and individual or individual characteristic. The
Data Protection Act of 1998 adds a different dimension to the EDPD definition of ‘data’ by
taken into account the way the information was processed before it can be regarded as data
e.g. processed automatically or processed non automatically. The EDPD and Data Protection
Act have a common consensus on what personal data/information is;
- Information processed, or intended to be processed, wholly or partly by
automatic means (that is, information in electronic form) (ICO, 2010)
- Information processed in a non-automated manner which forms part of, or is
intended to form part of, a ‘filing system’ (that is, manual information in a
filing system) (ICO, 2010)
2.2 Value of Personal Data
Personal information is an increasingly important asset in the twenty-first century, both in
terms of corporate monetary value and government efficiency as well as economic prowess.
Coincidentally, corporate companies around the world have begun the transition into
investing greatly in software that helps facilitate the collation of consumer data (Schwartz,
2003). It’s estimated that everyday people across the world send 10 billion text messages
daily, coupled with that 1 billion posts to a blog or social media sites are made leading to a
new type of economy emerging, The Internet economy. It is estimated that that the Internet
economy within the G20 amounted to $2.3 trillion or 4.1% total GDP in 2010 (Group, 2012).
13. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
13
2.3 The Internet [Digital] Economy
Sometimes called the digital or web economy the Internet Economy is a concept based on
digital technologies fusing with the traditional economy. First established by Don Tapscott in
his critically acclaimed book; The Digital Economy: Promise and Peril in the Age of
Networked Intelligence’’, it is widely believed that the internet economy is positioning itself
as the new cornerstone for any emerging or established economy (Tapscott, 1997) This is
evident by the recent figures released by the Boston Consulting Group their Digital Manifesto
Report which states that currently the value of the internet economy is larger than that of
countries like Brazil and Italy and that by the year 2016 the Internet economic value is
expected to double to $4.2 trillion. The report also goes on to say that ‘’no company or country
can afford to ignore this [Internet economy] phenomenon’’. (David Dean, 2012) The rise in
the amount of data being produced is strongly linked to the innovation of mobile technology,
from the turn of the millennium, allowing more devices than ever to be able to make a
connection with the cyber-world that is the Internet. Steve Wojtowecz, Vice President of
storage software development at IBM, stated that by the year 2015 over a trillion devices
would be connected to the internet (King, 2011). As a consequence the UK government has
started up two initiatives, Midata and Information Economy Strategy (IES) to aid prosumers
with improved and sufficient access to their own personal data that companies hold about
them. (BIS, 2011).
2.3.1 Midata
These are the key principles [aims] of the Midata initiative outlined in its government report:
(Department for Business, Innovation & Skills , 2013)
- Get more private sector businesses to release personal data to consumers
electronically
- Make sure consumers can access their own data securely
- Encourage businesses to develop applications (apps) that will help
consumers make effective use of their data
2.3.2 Information Economy Strategy (IES)
These are the key principles [aims] of the IES project outlined in its government report:
(Department for Business, Innovation and Skills, 2013)
- A strong, innovative, information economy sector exporting UK excellence to the
world
14. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
14
- UK businesses and organizations, especially small and medium enterprises
(SMEs), confidently using technology, able to trade online, seizing technological
opportunities and increasing revenues in domestic and international markets
- Citizens with the capability and confidence to make the most of the digital age
and benefiting from excellent digital services.’’
Long-term success will be underpinned by:
- A highly skilled digital workforce (whether specialists who create and develop
information technologies, or non-specialists who use them)
- The digital infrastructure (both physical and regulatory) and the framework for
cyber security and privacy necessary to support growth, innovation and
excellence.’’ (Department for Business, Innovation and Skills, 2013)
It’s important to remember that both these government initiatives are being reinforced by
reviews and changes to legislation such as the Data Protection Act, Consumer Rights Bill
[Both UK and EU level] and the Enterprise and Regulatory Reform Act 2013. Reason being is
that this will necessitate companies to disclose customers’ personal data to them if they opt
not to do so voluntarily. (Department for Business, Innovation & Skills , 2013)
2.4 What is a Persona?
Typically used as marketing tool and human centered design [HCD] personas are
hypothesized groups of users that illustrate similar behavioral patterns in their use of
technology, lifestyle decisions, customer service preferences as well as their purchasing
decisions. Angus Jenkinson first came up with a top down analytical approach that works by
‘grouping’ focusing on a synthetic, clustering process leading to ‘customer communities’ and
the creation and preservation of loyalty within these communities in his 1994 journal
Beyond Segmentation (Jenkinson, 1994). This concept was refined five years later by Alan
Cooper in his pioneering book The Inmates Are Running the Asylum in which Cooper creates
the actual concept called ‘persona’ that is used today to identify customer relative behavior
and consumption patterns. (Cooper, 1998)
2.5 What is a Prosumer?
It is widely considered that Alvin Toffler is the creator of concept of prosumption, he goes on
to define it in his book ‘The Third Wave’ as people who “produce some of the goods and
services entering their own consumption” (Toffler, 1980) (Kotler, 1986). In other words
people that produce and consume their own products and services are prosumers. In the 21st
15. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
15
century the prosumer has become more and more prominent replacing the traditional
consumers of the Industrial Age, this lays credence to Toffler’s own prediction that; as society
moves to towards the Post-Industrial Age the number of pure consumers will decline being
replaced with “prosumers” (Toffler, 1980).
2.5.1 The Rise of the Digital Prosumer
Consequently as we divulge deeper into the Information Age and the Internet Economy
continues to evolve into an economic juggernaut, a new type of prosumer has emerged, the
digital prosumer. The digital prosumer is a person that creates and consumes his or her own
data. As of today the biggest benefactors of personal data produced are the depicted as the
big 3 data companies, which are; Google, Facebook and Twitter making upwards of $1200
from a user profile. (Madrigal, 2012)
2.6 Data Mining
Data mining is the iterative process of extracting or “mining” knowledge from excessive
amounts of data stores, which can be put into perspective and exported into useful
information. Data mining is thought to involve six common classes of that lead to prediction
and description, which is one of the primary goals of data mining: (Wikipedia, 2011)
(Kamber, 2006)
• Classification – is learning a function that classifies a single data item into one of
several predefined classes. Examples of classifications techniques:
- Bayesian classifiers
- K-nearest neighbor
- Linear classifiers
• Regression – is learning a function that maps a data item to a prediction variable.
In other words regression estimates the relationship between any two variables.
Some examples of regression models are:
- Percentage regression
- Bayesian linear regression
- Nonparametric regression
• Clustering- is a descriptive task that works by aiming to identify cluster or
categories that seek to describe data. Examples of clustering techniques are:
- Hierarchical
- Partitioning
- Density-Based
- Centroid-Based
16. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
16
• Summarization – is a method for finding a cohesive description of a data set, this
includes analytical representation such as visualization and report generation
• Dependency modeling – is a method that consists of finding a model that depicts
significant dependencies between variables
• Change and deviation detection – is a method that focuses on finding the most
significant changes from previously measured data. (Usama Fayyad, 2008)
2.6.1 Knowledge Discovery from Data [KDD]
KDD can often be misconstrued as data mining in itself; however it’s safe to say that data
mining is an essential part of the knowledge discovery. Usama Fayyad purposed the
methodology of KDD in 1995 with the purpose of making data produced by companies useful
to their business needs. (Deutsch, 2010)
Figure 1 - Fayyad KDD representation
Knowledge discovery takes an iterative sequence approach to its philosophy, which consists
of; (Kamber, 2006)
• Data Cleaning – to remove noise and inconsistent data
• Data Integration – where multiple data sources may be combined
• Data Selection - where data relevant to the analysis task are retrieved from the
database
• Data Transformation - where data are transformed or consolidated into forms
appropriate for mining
• Data Mining – an essential process where intelligent methods are applied in order to
extract data pattern
• Pattern Evaluation – to identify the truly interesting patterns representing
knowledge
• Knowledge Presentation – where visualization and knowledge representation are
used to present the finished knowledge to the user
17. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
17
2.7 Cluster Analysis
Cluster analysis can be defined as the process of grouping a set of physical or abstract objects
into classes that have similar objects. In other words a cluster can be depicted as collection of
data objects that a similar to object within the same cluster or dissimilar to objects in another
cluster. An advantage of clustering or cluster analysis is that it can single out useful features
that define characteristics within different groups, which, in turn, will help me in my aim of
identifying personas from prosumer data (Kamber, 2006). They’re a various different
cluster analysis techniques such as; Partitioning, Hierarchical (Agglomerative and Divisive)
and The Single Link Method (Raza Ali, 2004)
2.7.1 Partitioning Technique
Partitioning methods aims to relocate clusters of data from one cluster to another; this is
usually started by the initial partitioning. The method also requires the number of clusters to
be pre-set by the user. It is also commonly cited that to achieve global optimality in this type
of clustering an exhaustive enumeration process of all possible partitions is needed, because
of this necessity most applications choose one of two popular algorithms, K-means and K-
medoids algorithms (Kamber, 2006):
• K-Means Algorithm
K-means enables the user to mine data by representing each cluster
by the mean value (usually K) of the objects present in the cluster
• K-Medoids Algorithm
K-medoids on the other hand, enables each cluster to be represented
by one of the objects located nearer to the center of the cluster.
2.7.2 Advantages and Disadvantages
Now the K-means technique has advantages as well as disadvantages, one of the main
advantages is that k-means work well for finding spherical-shaped clustering within small
to medium-sized data stores. Another advantage of k-means is that the method tends to
produce tighter, more compact clusters than say hierarchical clustering. (Lior Rokach,
2010)
However there are also disadvantages to this technique, one of them being that it is very
limited to the type of cluster model the algorithm is applied to. The effectiveness of the
algorithm is predicated on the spherical shaped clusters, sometimes called globular, as this
enables the mean value to be positioned closer towards the center of the cluster. This
consequently means that clusters that aren’t a similar size or have large datasets won’t work
18. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
18
well with this algorithm. Another disadvantage to this algorithm is that it is very sensitive to
noisy data and outliners, which can increase the squared error significantly; this leads to
the user mandated to know the number of clusters beforehand, which is a very tedious task.
(Improved Outcomes Software (ios), 2009)
2.7.3 Hierarchical Technique
Hierarchical methods aim to create a hierarchical decomposition of the given sets of data
objects. This method can be sub-partitioned into two techniques; Agglomerative and
Divisive. The agglomerative method, which is also called the bottom up approach, works by
each data object forming a separate group, after this is done the clusters are successively
merged until the desired cluster structure is achieved. The divisive method, which is also
called the top-down approach, works by all the data objects being in the same cluster then
partitioned into sub-clusters, which in turn is partitioned further sub-clusters. This
sequential process is repeated until the desired cluster structure is obtained. One of the
intriguing things about hierarchical clustering is that it provides a decipherable visual of the
algorithm plus data; this is called a Dendrogram. This is a resourceful summarization tool
that makes hierarchical clustering extremely popular. (Lior Rokach, 2010)
Figure 2 - Example of a word sorting dendrogram output from:
http://www.macs.hw.ac.uk/texturelab/people/thomas-methven/
2.7.4 Advantages and Disadvantages
It’s important to remember that hierarchical techniques have many advantages as well as
disadvantages. One of the advantages is that it is very versatile; methods like single-link
work maintain a strong performance on datasets delivering well-separated, chainlike and
concentric clusters. Another advantage to hierarchical methods is the fact that they produce
multiple partitions, this is particular resourceful for users that want to choose different
19. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
19
partitions from those already nested in the overall cluster according to the desired similarity
level chosen by the user.
On the other hand the disadvantages to this particular technique are quite evident.
Hierarchical algorithms are notorious for their inability to scale well; the algorithm is also
accredited to causing high I/O costs when trying to cluster a large number of objects. Another
disadvantage to the hierarchical technique is that its rigidity, simply put, once one step is
done in the sequence it can never be undone or modified. (Lior Rokach, 2010)
2.8 Critical Discussion
Having reviewed the advantages and disadvantages of hierarchal and partitioning techniques
it’s important to offer an analysis of both techniques, in relation to this project, in order for to
be able to distinguish the most appropriate technique for clustering. From my research I can
see that partitioning clustering works well on small sized data sets as opposed to bigger data
sets, the dataset used in this project is fairly large containing data from 2,500 household’s
weekly shop. Partitioning clustering also goes about making tighter, more cohesive, clusters
through its k-means algorithm, which makes it easier to depict the key features within the
cluster, which in turn defines persona characteristics. On the other hand, for users not to
encounter noisy data while clustering it is advantageous for them to know the number of
clusters in advance, this is near on impossible with the size of the database in question.
Looking on the other side of the coin we see that the Hierarchical technique is very versatile
offering different methods such as single link, complete link and average link, which,
consequently, delivers separate clusters. This I believe will work well in this project, as it will
aid in presenting persona’s from the dataset provided. In addition to this the hierarchical
technique has a very good quality assurance type algorithm to ensure quality of cluster such
as Chameleon which will be good in ensure that the personas defined are validated. On the
other hand the hierarchical technique is very rigid so if erroneous decisions occur it is nearly
impossible for it to be corrected which provides a big disadvantage to this project as
identifying personas will need a great deal of flexibility as parameters for personas can
change at any given time.
In light of all the information reviewed it’s fair to say there are a number of advantages and
disadvantages that both offer however in order to obtain the best and more concise results I
believe consensus clustering would be the best option. However due to time constraints and
lack of expertise in coding, I have decided to use the K-Means algorithm to provide the logic
to my application. I intend to then build an interface, which simplifies the steps of the K-
Means algorithm and puts it in a way that is easy to administer for the user. The choice of
20. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
20
which software environment I will use to code the interface as well as the justifications for it
will be made in Chapter 5.
2.9 Summary
In this chapter I have spoken about personal data and its value, I have also looked into the
definition of personas coupled with the rise of the prosumer and Internet economy.
Furthermore I have discussed in detail what is cluster analysis is looking in particular at two
clustering techniques (Hierarchical and Partitioning), offering an in-depth critical discussion
about my chosen technique to take forward into my application. The findings of the chapter
will further equip me into meeting my aims and objectives set out for this project. In addition
it will assist me in constructing a design specification for my application
21. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
21
3 Methodology
This chapter will be exploring different research methodologies and coming up with the
appropriate justification for applying the chosen methodology to this project the three
methods in question will be; Design Science, Positivist and Interpretive. The methodology I
have decided to use is the design science approach. The justification will be validated through
the appropriate reference to literature sourced, as well as a personal analysis of the different
approaches.
3.1 Design Science
As previously mentioned the design science approach is my chosen methodology for this
project. Design science simply put is the methodical form of designing or research design.
First established by American inventor Richard Buckminster Fuller in 1963, the concept of
design science proceeded to be further developed by Gregory in his 1966 book “The Design
Method” in which he demarcates the relationship between design method and scientific
method. He further accentuates his view that design is not inherently a science and that the
actual term design science pertains to the scientific study of design. As technology continued
to evolve at the turn of the century design science started becoming more integrated into
Information systems research and software design projects. Alan Hevner in 2004 produced a
seven-guideline framework, with the aim to assist information system researchers to;
conduct, evaluate and present design-science research. (Alan R. Hevner, 2004)
Figure 3 - Design Science Guideline from MIS Quarterly Research Essay.
Further refinement this framework by Peffers, was later made in order to explain how the
regulative cycle fits into the design science research framework.
22. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
22
Figure 4 - The Engineering Cycle
This framework is widely used today by information system researchers as it provides
researchers a medium to analyze and de-cipher an existing problem and offer a solution
design or solution hypotheses. After which they can then look at whether their solution or
hypotheses is effective or meets the specified criteria, this can be executed through a pilot
scheme or prototyping after which the full implementation can take place. (Roel Wieringa,
2010). This principle in particular would suit my project the most in my opinion, as I aim to
design a software solution (clustering program), design it, and then evaluate the
effectiveness of the solution.
3.2 Positivist Approach (Positivism)
The positivist approach is a methodology based on an objective hypotheses based on
introspection or intuition validated or dis-proved by scientific testing and experimentation
(Sage Publications, 2009). In other words a positivist approach will have a hypotheses
validating a subject area or discrediting it then going on to prove the hypotheses by
experimentation or building a solution (University of the West of England, 2007). The
origins of the method lie with sociologist Auguste Comte who coined and developed the term
in the early 19th century. Today the positivist approach is used increasingly in IS and
software engineering projects (Sociology Guide, 2008). Some of the advantages of the
positivist approach are that it relies heavily on quantitative data as opposed to qualitative
data which is seen as more scientific thus being a more reliable source to base hypotheses on.
Another advantage to the positivist approach is the fact that it follows a very stringent
structure, as the positivist approach believes that there are guidelines in place that need to
be adhered to, which as a consequence should minimize room for error. This ideology makes
positivist believe that the reduced room for error will make the whole approach more
accurate when it pertains to experiments and applications. However on the other hand there
23. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
23
are drawbacks to the approach one of them being human behavior. Positivists strongly
believe in objective based assumptions however there is no guarantee that bias or subjective
analysis won’t corrupt the study. (Johnson, 2010) (Wikipedia, 2014)
Figure 5- Epistemological Assumptions for Qualitative and Quantitative Research from
http://dstraub.cis.gsu.edu:88/quant/2philo.asp
3.3 Interpretive Approach
The Interpretive approach is a qualitative research method that is based on subjective
assumptions with the knowledge derived from value-laden socially constructed
interpretations (Packer, 2007). In a stark contrast to the positivist approach interpretivist
researchers aim to understand and interpret human behavior as opposed to generalizing and
predicting cause and effect. The impact this has on information system and software design
projects is that the researcher will aim to ask several open ended questions generally
through questionnaires or unstructured / semi-structured interviews and sometimes
observations to gather as much primary information as possible once the scope of the project
has been defined (WordPress, 2012). This particular approach also enables the researcher
to open to new ideologies throughout the duration of the project as opposed to that of the
positivist approach who believe in a pre-ordained rules and guidelines. With that being said
there are many advantages as well as disadvantages to this approach. One advantage is that
the research methodology is highly qualitative based meaning that the data gathered will be
in more depth. However a drawback will be that interpretivists have a subjective view about
the project this into which will lead to bias getting in the way of ascertaining the correct
results or the best methods to apply in completing the project. (Institute of Public &
International Affairs, 2009) (Slideshare, 2013)
3.4 Critical Discussion
Having looked out all three research approaches in appropriate detail, highlighting the
advantages and disadvantages of each, it’s safe to say that all have adequate potential in
being the framework for any information systems project. However I believe that the best
approach to adopt for this particular project will be the Design Science approach as this
offers the strongest correlation between what I am trying to achieve in this project and the
24. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
24
actual design science approach itself (design, build evaluate). However with that being said I
believe that I can still look at this project from a positivist point of view. The reason I say this
is that the idea of using data mining to develop ‘personas’ is a relatively novel idea, so using a
hypotheses I am trying to positively prove that it is possible and can be done.
3.5 Software Development Lifecycle Models
There are many models that can be used to develop a software project. All of these models
follow the design science principle of design, build evaluate. What I aim to achieve in this
section will be to identify and describe two common models, offering adequate analysis on
each. After which I will isolate the best model that can be adopted to my project.
3.5.1 Rapid Application Development (RAD)
Rapid Application Development is an iterative model that favors rapid, early software
prototyping as opposed to traditional planning. This approach consequently allows the
development of software to take place much sooner. It also keeps stakeholders at the heart of
the development process and allows requirement changes to take place easily. RAD typically
follows four phases in it model; Requirements Planning Phase, User Design Phase,
Construction Phase and Cutover phase. (Wikipedia, 2014) (David C. Yen, 1999)
1. Requirements Planning Phase – The inaugural phase of the project were the
project team meet with the stakeholders to go over the business needs of the client,
the project scope, system requirements and constraints. This is then preceded by an
agreement of the key issues that need to be addressed after which the relevant
authorization needs to be obtain in order to proceed
2. User Design Phase – The second phase of the project aims for the stakeholders to
maintain dialogue with the project analysts to develop prototype models of the
system that shows clear representation of all system input and output features plus
all the processes within the system. This phase of RAD is perceived to be a continuous
interactive process that allows the stakeholders to play an active role in
understanding, modifying and consequently approving a working prototype model
once they see a model that caters to their business needs
3. Construction Phase – The penultimate phase of project continues to focuses on
program and application development. Stakeholders further participate in suggesting
changes and improvement to any user interfaces or reports that are typically
developed at this phase. Unit-integration, system testing, programming and
application development is done at this phase of RAD.
25. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
25
4. Cutover Phase – The final phase of RAD is typically when the whole project is
brought to a head. Tasks such as testing, data conversion, user training and system
changeover is done at this stage. The compression of all this tasks that the final stage
enables the new system to be delivered back to the stakeholders in a much quicker
timeframe.
Figure 6 - RAD Diagram
3.5.2 Analysis
The RAD model comes with many advantages as well as disadvantages. However the key is to
be able to synthase them and be relate it back to my project. One of the common advantages
of the RAD model is that it drastically reduces the time need for requirement analysis and
software requirement software requirement. Also all prototypes created can be stored for
future use; this will consequently speed up the software development of the product.
Relatively speaking heavy prototyping is not necessary for my project as it’s a fairly short,
small project with strict user requirements. (Rouse, 2007) (ISTQB Exam Certification,
2012)
3.6 Waterfall Model
The waterfall model is a sequential design model that establishes software development
through downward flow of task/activities through several phases (reminiscent of an actual
waterfall). It differs from conventional agile development models as it seeks to fully describe
the application through written documents before actual software development commences.
Originally developed by Royce in 1970 the waterfall model follows seven sequential phases.
(The Waterfall Development Methodology, 2012)
1. Requirements Specification – The requirements are gathered from the
stakeholders and agreed on in principle with development team.
2. Design – The blueprint of the project is drawn up and given to the developers to
commence coding and start implementation
26. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
26
3. Implementation - The actual system is developed at this stage, all coding is
completed resulting in the actual program being developed
4. Integration – The system created is integrated in the environment agreed on in the
preliminary phase
5. Testing – Full testing of the integrated system is performed at this stage, debugging
also happens at this stage with the view of determining any bugs and work on
potential fixes and patches
6. Installation – Installing of the system including the removal of the old system is done
at this stage. This stage also includes training for all stakeholders and staff members
7. Maintenance – The installed system is maintained through continuous updates and
patches being developed and installed.
The waterfall model follows a strict principle that you can only move forward to the next
phase once the existing phase has been completed and worked to perfection meaning that
once a phase is completed it cannot be looked at again. (ISTQB Exam Certification, 2012)
Figure 7 - Waterfall Model
3.7 Analysis
The waterfall model comes with many advantages. One of the most common is that
sequential nature of the model, which makes it very easy to understand and execute. Another
advantage is that it works well on projects that are fairly small with strict set-in-stone
requirements, which suit my project adequately. Another reason I favor this SDLC is that it
seems to go hand in hand with the design science approach (design, build & evaluate). (
Select Business Solutions, Inc., 2010)
3.8 User Interface Evaluation
One of the most integral parts of any software project is to be able to coherently evaluate the
design of the artefact. Like previously stated the user requirements are used to inform the
design of the application, once this is done a framework or principle needs to be
implemented in order to evaluate it. One of the most popular techniques for usability
27. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
27
evaluation is the Nielsen Heuristics. In this section of the report I aim to talk about the
Nielsen Heuristics in detail as well as another usability inspection method, The Cognitive
Walkthrough, in order to draw qualitative comparisons to both methods. This in turn will
help me decide on the most suitable approach in evaluate the usability of the Persona
Identification Application.
3.8.1 Nielsen Heuristics
As previously stated the Nielsen Heuristics is one of the most popular usability evaluation
techniques and one of the most used today. It’s important to remember that heuristic
evaluation bridges the gap between conventional user testing. This is achieved by providing a
template or set of principles that help uncover problems a user will likely come across does
this. Looking back it was Jakob Nielsen work with Rolf Molich in the 1990’s that helped
originate the heuristics that is widely used today. However it was in his 1994 publication
Usability Engineering that the actual ten heuristics were published for the first time.
(Nielsen, 1994)
(Some of the heuristics have been shortened for brevity)
1. Simple and Natural Dialogue – The dialogue should not contain information that is
irrelevant or rarely needed
2. Speak the User’s Language – The dialogue should be expressed clearly in words,
phrases, and concepts familiar to users rather than in system oriented terms
3. Minimize the User’s Memory Load – The user should not have to remember
information from one part of the dialogue to another
4. Consistency – Users should not have to wonder whether different words, situations
or actions mean the same thing
5. Feedback – The system should always keep users informed about what is going on,
through appropriate feedback within reasonable time.
6. Clearly Marked Exits – Users often choose system functions by mistake and would
need a clearly marked ’emergency exit’
7. Shortcuts (Accelerators) – Unseen by the novice users by often speed up the
interaction for expert users.
8. Good Error Messages – They should be expressed in plain language (no code) to
precisely indicate the problem
9. Prevent Errors – Even better than good error messages is a careful design that
prevent a problem from occurring in the first place
28. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
28
10. Help and Documentation –Even though it is better if the system can be used
without documentation, it may be necessary to provide help and documentation. Any
such information should be easy to search, be focused on the user’s tasks, list
concrete steps to be carried out and not be too large
3.8.2 Advantages and Disadvantages
Nielsen heuristics comes with many advantages as well as disadvantages. Some of the
advantages to this principle are that it’s a very useful and relative inexpensive way of
providing some quick feedback to designers, which can reduce the overall turnover time that
a product is in the usability evaluation stage. Furthermore it can be a good way of obtaining
qualitative feedback EARLY in the design process. Another advantage to the heuristics
evaluation is that it can help immensely in suggesting the best corrective measures for
designers provided that the correct heuristic has been assigned in the first place. This would
prove to be helpful when designing the user interface for the Persona Identification
Application (PIA). Looking deeper into Nielsen Heuristics there is a few disadvantages to this
evaluation principle. One being that it requires specialist knowledge and competent
experience for it the application of the heuristics to be effective. Moreover usability experts
trained to administer the heuristics effectively and hard to come by and can be relatively
expensive to source. Another disadvantage to the heuristics is that it can tend to be
misleading in that it can identify more of the minor issues and less of the actual major issues
with the design. (Usability.Gov, 2010) (Nielsen, 1994)
Moving forward it is important to remember that heuristic evaluation does not replace
conventional usability testing and it should not be seen as an alternative to it. Many of the
benefits and drawbacks have been highlighted above and with all being discussed I’m in no
doubt that Nielsen Heuristics is the perfect evaluation metric for evaluating the user interface
for the application. Reason being is that, in essence, it evaluates all the basic requirements set
by the stakeholders and also it gives me things to consider while designing the app i.e.
accelerators and consistency etc. as well as things to evaluate on at the end of the design
process
3.9 Critical Discussion
Nielsen heuristics comes with many advantages as well as disadvantages. Some of the
advantages to this principle are that it’s a very useful and relative inexpensive way of
providing some quick feedback to designers, which can reduce the overall turnover time
that a product is in the usability evaluation stage. Furthermore it can be a good way of
obtaining qualitative feedback EARLY in the design process. Another advantage to the
29. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
29
heuristics evaluation is that it can help immensely in suggesting the best corrective
measures for designers provided that the correct heuristic has been assigned in the first
place. This would prove to be helpful when designing the user interface for the Persona
Identification Application (PIA). Looking deeper into Nielsen Heuristics there is a few
disadvantages to this evaluation principle. One being that it requires specialist knowledge
and competent experience for it the application of the heuristics to be effective. Moreover
usability experts trained to administer the heuristics effectively and hard to come by and
can be relatively expensive to source. Another disadvantage to the heuristics is that it can
tend to be misleading in that it can identify more of the minor issues and less of the actual
major issues with the design. Moving forward it is important to remember that heuristic
evaluation does not replace conventional usability testing and it should not be seen as an
alternative to it. Many of the benefits and drawbacks have been highlighted above and with
all being discussed I’m in no doubt that Nielsen Heuristics is the perfect evaluation metric
for evaluating the user interface for the application. Reason being is that, in essence, it
evaluates all the basic requirements set by the stakeholders and also it gives me things to
consider while designing the app i.e. accelerators and consistency etc. as well as things to
evaluate on at the end of the design process. The way I intend to go about this heuristic
evaluation is to construct a usability questionnaire as well as system functionality test in
order to be able to coherently ascertain the usability of the system, also to be able to test
the functionality of the system, thus validating the user requirements.
3.9.1 Cognitive Walkthrough
In order to balance the argument for which evaluation technique to use it’s imperative to
draw on a comparison. One of the direct comparisons to the Nielsen Heuristics is the
Cognitive Walkthrough approach. Cognitive Walkthrough was developed as an additional
tool in usability engineering. The technique involves a group of evaluators undertaking a set
of tasks on the interface to evaluate its ease of learning and understandability. Lewis and
Polson first set out the concept of cognitive walkthrough, and it works by tasking the
evaluators with four questions; (usabilityfirst, 2011) (Cathleen Wharton, 1994)
• Will the user try to achieve the right effect?
• Will the user notice that the correct action is available?
• Will the user associate the correct action with the effect to be achieved?
• If the correct action is performed will the user see that the progress is being made
toward solution of the task?
After all these questions are ascertained the evaluator attempt to conjure a ‘success story’ for
each incremental step of the process. If this turns out to be impossible then the evaluator will
30. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
30
then create a ‘failure story’, which aims to assess why the user cannot accomplish the task
based on the GUI. The findings from the walkthrough are later aggregated and used to make
improvements on the application, in this case the Persona Identification App. Like the
heuristics stated earlier cognitive walkthrough has many advantages as well as
disadvantages. One of the main advantages is that it’s useful for identifying problems early in
the design phase as well as help define users goals and assumptions with fewer resources’
that say full user testing would demand. This technique fits well with the scope of my project
as it provides a short and concise evaluation of the user interface I will be designing it also
provides a user centered perspective similar to what the heuristics offer in comparison.
However one of the main issues with cognitive walkthrough is more susceptible to subjective
bias from the evaluators, which may hinder the main issues not being covered. Another issue
is that it can be very difficult for a seasoned evaluator to assume the perspective of an
inexperienced user of the system. (Lewis, 1997)
3.10 Critical Discussion
Like the heuristics stated earlier cognitive walkthrough has many advantages as well as
disadvantages. One of the main advantages is that it’s useful for identifying problems early in
the design phase as well as help define users goals and assumptions with fewer resources’
that say full user testing would demand. This technique fits well with the scope of my project
as it provides a short and concise evaluation of the user interface I will be designing it also
provides a user centered perspective similar to what the heuristics offer in comparison.
However one of the main issues with cognitive walkthrough is more susceptible to
subjective bias from the evaluators, which may hinder the main issues not being covered.
Another issue is that it can be very difficult for a seasoned evaluator to assume the
perspective of an inexperienced user of the system.
3.11 Summary
In this chapter I have looked in depth at three design principles, evaluating each of them
and choosing the most appropriate one for my project. In addition I looked into software
development lifecycle and picked out the waterfall model as the most efficient lifecycle for
this project. Finally I looked into user interface evaluation choosing Nielsen heuristics as
my way of evaluating the application interface. The findings of this chapter have helped me
choose the appropriate methodology and evaluation for this project.
31. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
31
4 Requirements Analysis and Design
In this chapter I will be reviewing and discussing the fundamental requirements of this
project. There are many types of requirements categories that can be used. In this project I
will be using three; Customer requirements, Functional and Non-Functional requirements.
In addition to this I will be discussing the design process of my project making use of
activity diagrams, use case diagrams and narrative to help illustrate the design of my
application
4.1 Customer Requirements
Customer requirements are direct statements or expectations that come from the principle
stakeholders or the prime actors of the project being developed. They directly impact scope
of the project and have unequivocal ramifications on the key features of the system being
developed. In this particular case I spoke directly to some of the principle stakeholders for
the Persona Identification Application who told me directly what their mission
statement/requirements were the following:
1. To be able to use wholesome dataset (Excel)
2. To be able to cluster the dataset through an application
interface
3. Be given back a visual representation of the clustering results
through the application interface
4. To be able to download a CSV table that show the clustering
results which can help facilitate the identification of personas
Table 1 – User Requirements
4.2 Functional Requirements
Functional requirements are the mandatory tasks and activities that need to be fulfilled in
order to exert the full functionality of the app. In others words it should depict what the
system should do and the features it should provide to its users. The table below shows the
functional requirements for the Persona Identification Application.
32. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
32
Table 2 - Functional Requirements
4.3 Non-Functional Requirements
Non-functional requirements are the requirements that depict the functionality of the
system, in this case the Persona Identification Application. The table below shows the non-
functional requirements for this system.
Table 3 - Non-Functional Requirements
4.4 Requirements Summary
Thus far, one of the key things to remember is that requirement gathering and analysis is
that it plays a crucial role in informing the design of the software solution. The
requirements along with research conducted in the literature review will assist me in
putting together an adequate design of the system, which will be shown in the second half
of this chapter.
4.5 Design
In this part of the chapter I will be concentrating on the design aspect of the Persona
Identification Application. As previously stated the outcomes of my literature review
33. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
33
coupled with the results from the requirement analysis have helped put this part of the
chapter together. I will draw up different diagrams such to clearly show the interaction
with the user and the system. I will also be providing reasoning behind why each method
was chose.
4.6 Activity Diagram
One of the important UML models, an activity diagram illustrates the workflow of a
business process. In this case the diagram below shows the set of incremental steps that an
end user would need to achieve to get to attain his or her end goal. Along the way there are
different decision points that a customer will face which will ultimately lead them to the
same main deliverable. One of the reasons I opted to construct an activity diagram it is one
of the most comprehensible diagrams offering a clear understanding of the business flow
within the system not only to the developers but to them stakeholders as well. (Wang
Linzhang, 2004
34. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
34
Figure 8 - Activity Diagram of Persona Identification Application
4.7 Use Case
Another important UML model the use case aims to offer the simplest way of demonstrating
the user’s interaction with the proposed system. The diagram below shows the user
interactions with the Persona Identification App. In addition to the diagram I put together
a use case narrative, which basically provides a more in depth description to the use case
diagram. The reason I chose to implement a use case diagram and narrative is that it
provides an abstract view of the application from the user perspective. (Elenburg, 2005)
Figure 9 - Use Case Diagram of Persona Identification Application
35. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
33
Table 4 - Use Case Narrative
36. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
34
Summary
This chapter has looked at the requirements set out by the user setting out the functional and
non-functional of the application. Also this chapter has shown how I went about designing
the application; in addition to this I have been able to discuss different techniques in
evaluating the usability of the application interface and functionality. The findings in this
chapter will help me greatly in implementing the application taking into consideration the
requirements from the users; equally it will help me evaluate the application as a whole. This
will be explained more in Chapter 6.
37. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
35
5 Implementation
In this chapter I will be discussing the implementation of the Persona Identification App. In
particular I will be looking into the software environment I chose to implement the
application in, which in this project is R, providing adequate justification for why my selected
software environment was chosen. In addition to this I will be detailing the full functionality
of the application by way of screenshots with adequate description of each point.
5.1 Software Environment – R
R is a free command line based programming language specifically for statistical computing
and data mining. Its software environment enables its users to construct statistical software
as well as graphical user interfaces. As previously stated R is a command-based line
programming language meaning it runs through a MS-DOS style display; however several GUI
platforms have been developed to use alongside R such as R-Studio. One of the main reasons
I decided to use R to implement this system is that it was a free meaning that I could use it at
will as opposed to having to obtain a license. Another reason I chose to use it was because I
felt quite comfortable using a command line based system due to my prior experience with
MS-DOS. Subsequently R offers a good and easy to understand package in developing
interactive web-based interfaces (R-Shiny) which I used to develop the interface.
5.2 Software Environment - MatLab
MatLab is a high level, interactive programming environment written in a bevy of
programming languages such as Java, C and C++. One of the advantages of MatLab is that it
allows its users to access a world of different features such as plotting and mapping functions
and data, implementing algorithms and using built in math functions. Furthermore MatLab
allows its user to create graphical user interfaces to work hand in hand with the programs
coded in its environment. One of the main reasons I chose not to use MatLab to develop and
implement the Persona Identification App was because I was unable to obtain a license to use
it at home from the university, meaning that every time I wanted to work on development I
would have to come onsite which is not feasible or indeed efficient.
5.3 Persona Identification Application Implementation
As previously stated I developed the persona identification program in R then subsequently
developed the interface using R’s own package Shiny. In order to do this I had to code in
different functions then put it together in Shiny based application. I have enclosed below
screenshots of the coding of the most important functions with annotations to help depict
what each function is doing. For convince sake I have also listed the functions below:
38. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
36
5.3.1 Application Coding Screenshots
1. Import CSV File
Figure 10 - Import csv file plus description
2. Choose variables
Figure 11 – Choose
variables plus
description
1. Import CSV file and convert to data matrix
2. Choose variables
3. Standardize data option and cluster data
4. Show within groups sum of errors squared (Number of
clusters)
5. Show results
39. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
37
3. Standardize data and run K-Means algorithm
Figure 12 – Standardize data and run k-means plus description
4. Show within group’s sum of errors squared (Number of clusters)
Figure 13 – Choose K function plus description
40. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
38
5. Show Analysis Results
Figure 14 – Show analysis results plus description
6. Download cluster results CSV file
Figure 15 – Download results csv file plus description
41. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
39
5.3.2 Application Interface Screenshots
This part of this chapter I will be presenting screenshots depicting the actual interface of the
application. This will add a visual impression to the lines of code explained earlier. The
screen shots will further be annotated to provide more in-depth descriptions on what is
transpiring within the application.
Figure 16 - Screenshot of Persona Application Interface 1.0
Figure 17 – Screenshot of Persona Identification Application 2.0
42. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
40
5.4 Assumptions
In order to run the application successfully there needs to be some prerequisites that need to
be adhered to. One of them is that all the data that is in the csv file needs to be numeric else
the K-Means algorithm will just throw errors. In addition the data imputed has to be pre-
processed in order to gain tangible results. This will be further discussed in chapter 6. Finally
when running this application in R the shiny library needs to unpackaged and run after this is
done a simple command line of runApp(“.”) needs to be entered to run the application.
5.5 Summary
This chapter has shown the implementation of the application as well as the reasoning
behind why I chose the software environment to code it in. I have also discussed the
prerequisites that need to be fulfilled in order for the application to work. The findings in
this chapter have demonstrated my ability to code an application and present it in a user-
friendly manner.
43. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
41
6 Results and Evaluation
In this chapter I will be looking at the results gained from the application developed. I will
also be detailing how I went about gaining personas from the results data. It’s important to
remember that this application can work with any dataset as long as its numeric and for the
purposes of this project I have focused on a dataset containing 500 families weekly shop over
a 2 month period. Furthermore I will be evaluating the application usability through the
Nielsen Heuristics principle and conducting black-box testing to test the system functionality.
6.1 Data Pre-Processing
As previously stated data preprocessing is an essential part of the data mining process as it
helps lay the foundation for more concise result analysis. It also helps clear up the so-called
‘garbage’ data that may spew the results. To pre-process the data used for this project I first
choose the two most important variables that will help me identify personas from the
Dunhummby dataset, which in this case was household key (hkey) and product category
(prodcatID). I used a technique called “Quota Sampling” to select which data I wanted to use
for this analysis (Riley, 2012). After which I created my own data subset to make with the
two variables only in the CSV file. Finally, to adhere to the rule of K-Means, I assigned each of
the 22 product categories to a numeric value and inputted them into the data subset keeping
a reference of the category and the numeric value its assigned to which can be seen below.
For ease of understanding I used the product category as the “personas” e.g. GROCERY will be
a grocery persona etc.
Figure 18 – Evidence of data pre-processing Results
44. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
42
Once the results CSV file is downloaded the contents show four columns; kclust, which shows
how, many clusters there are hkey and prodcatID, these are the two variables we chose to
analyze and finally fit.cluster which show where each of the variables assigned fit in each
cluster.
Figure 19 - Screenshot of results out CSV file
I can see from here that the prodcatID and hkey have been assigned to a fit.cluster, which has
been set by the user already (see. From this I can then filter the rows in the csv file to see how
many numeric variables e.g. 1001, 1002 are in each cluster. Once I have found out how many
of each variable are in each cluster, I aggregate the total amount, which in turn helps me
work out a persona percentage on each category in each cluster. I make sure all the results
are documented which can be seen below.
Figure 20 - Identifying Personas Breakdown
45. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
43
The formula I used to work out the percentage was relatively straightforward. After I
aggregate the total amount a calculated the instances of variables against the total amount
within the cluster. For example 1001(Grocery) has 2050 instances in cluster 1, I run that
number against the total amount of instances in cluster one using an online percentage
calculator.
Figure 21 –Percentage Calculator Example
6.2 Results Summary
To be able to identify personas, thus meeting my aim, I conducted some tests on my own data
sub-set (Figure 11). The first test I ran was with K (Number of Clusters) set to 3, which is the
optimum number of clusters for this dataset (see Figure 10). After mining the raw data
based on the method stated above, the following results were found:
Figure 22 - Persona Percentage Results (Test 1)
46. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
44
From the results found I can say that the GROCERY persona was the most consistent and
populous persona found in the data set averaging around 60-65% in terms of persona
percentage. The next best persona found was the DRUG GM persona, averaging around 10-
11% persona percentage. This tells me that the dataset is heavily populated with GROCERY
Personas with very little other variances of personas following. To validate this finding I ran
the application again on that same dataset, however this time with K = 4. The results were as
follows:
Figure 23- Persona Percentage Results (Test 2)
47. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
45
From this particular test I can see some sort of correlation with the first test I conducted with
K set at 3. I can deduce that the GROCERY persona is averaging between 63-66% persona
percentages spread across 4 clusters, which is very similar to the first test run. The DRUG GM
persona keeps its mark with around 10% persona percentage, with PRODUCE coming in at
around 9-10% average in terms of persona percentage. This indicates to me that the dataset
is densely populated with GROCERY personas
6.3 Evaluation
As previously mentioned in chapter 3.8.1 I have chosen to use the Nielsen heuristics to
evaluate the usability of the application interface. To go about this I have used a System
Usability Scale questionnaire, which was developed by John Brooke (Brooke, 2011). The
questionniare itself is ten questions long based on a likert scale scoring system (1= Strongly
disagree, 2= Strongly agree) if the particitpant is uncertain of an answer than they will select
3. The reason for me choosing this questionnarie is that the questions asked are similar to
that of Nilesen 94’ huerisitcs which is what I planned to use to evaluate the system with to
begin with. In addtion using a likert scale system makes it more choerent and easier for the
participents to complete, thus saving time (Dane Bertram, 2012). Below is an example of
the questionniare that will be given to the participants;
Figure 24 - System Usability Questionnaire
48. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
46
6.3.1 Participant selection
Selecting the number of participant to evaluate the application is very important especially
when it pertains to this project. In an ideal world the more evaluators I have the better as
different evaluators can pick up different usability issues. However according to Nielsen the
most optimum number for evaluating a software system are 5 evaluators or at least 3.
(Nielsen, 1995).
Figure 25 - Graph showing the optimum number of evaluators
The above figure (23) shows that optimum number of evaluators against the proportion of
usability problems found. I can see here that 5 evaluators can find 75% of usability problems.
6.4 Black-Box Testing
Black box testing is a form of functional testing which aims to test if the software developed
does what it is supposed to do. The way I went about this was to create a questionnaire
which is based on the functional requirements, which the same participants that are testing
the usability would have to fill out. (Williams, 2006)
49. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
47
Figure 26 - Functional Test Questionnaire
The reason I chose to design the questions this way (figure 24) was to be able to gauge
whether or not the functional requirements have been met with a straightforward yes or no
response. This directly has a knock on effect as the outcome of this questionnaire will
indicate to me how far I have gone in meeting the user requirements.
6.5 Evaluation Results
After the evaluation was completed I put all the results from the questionnaire and deduced a
bar chart from it to add a visual representation to the evaluation results. The first thing I did
was to put all the answers from each participant in a table which can be seen below (Figure
25). After this I was able to construct a bar chart using Excel.
Figure 28 - Bar Chart of Usability Questionnaire Results
To make the output more meaningful to me I aggregated the results and draw up a bar chart
to give a visual representation of the average score of the usability questionnaire
Figure 27 - Table of Usability Questionnaire Results
50. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
48
Figure 29 Bar Chart showing average usability questionnaire results
6.6 Black Box Testing Results
As previously stated the system functionality testing (black box) was conducted concurrently
with the usability testing. Everyone that took part reported back that they execute all the
functionalities that the system offered. The results is illustrated below in figure 28
Figure 30 - Results of System Functionality Questionnaire
6.7 Evaluation Summary
To conclude this chapter I can say that the usability and system evaluation was highly
successful, in particular the black box testing. From all 5 subject experts who conducted the
evaluation, their response was highly positive which tells me that, from an expert point of
view, the application is very useable and does what its set out to do. On the functionality side
5/5 evaluators answered YES to all 7 functionality questions (Figure 28). This tells me that
the system functionality is fit for purpose and crucially it validates the customer
requirements set out in Chapter 4.
51. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
49
7 Conclusion
This dissertation has covered a lot of topics as well as fresh, novel ideas i.e. persona
identification. However it’s important to be able to competently draw conclusions from the
findings of this project, offering appraisal on the positives found and being able to offer
constructive critique on the weaker aspects of the dissertation.
7.1.1 Aim - Identify individual personas from prosumers personal information.
To answer this question I can say that I was able to identify individual “personas” from
prosumer data, however there were issues that I came across during in regards to this.
The first issue was the strength of the persona. The main personas found on the dataset
tested were the GROCERY “persona” however this could be deemed by some analyst as too
vague or not in depth enough. Thorough my own investigation into this perception I found
out that a much deeper pre-processing method, e.g. using sub-product categories instead of
main product categories, would be required in order to fish out much more ‘features’ within
the clusters. This will help facilitate more diverse and meaningful “personas”. It’s important
to stress that this could have been achieved within the boundaries of this particular project
however I believed that deriving personas from main product categories i.e. grocery,
produce, nutrition etc. would be a much better way of obtaining good individual personas.
However from hindsight I believe a deeper pre-processing method would have produced
more meaningful persona. Nevertheless I believe this shouldn’t take away from the fact that I
was able to identify individual “personas” which was the ultimate aim of this dissertation.
7.1.2 Objective 1 - Undertake a state-of-the-art literature review to inform, create
a design specification for an identifying personas/Investigate in greater
detail the pros and cons of clustering with reference to appropriate
literature
To conclude this objective I can confidently say that a state-of-the-art literature review was
undertaken (See Chapter 2) carefully analyzing two of the main clustering methods
(hierarchical and partitioning) drawing advantages and disadvantages and relating it back to
how it would impact my aim of this project. In addition I looked into the importance of
personal data and how it has risen to be the new “oil”, also I looked at the rise of the digital
prosumer, in particular, how prosumption is poised to take over typical consumption laying
credence to Toffler prediction on how prosumption is going to take over consumption by the
turn of the 21st century. This all provided the necessary justification for undertaking the
project and exposed the potential value in building an application that can identify personas.
52. Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)
50
In essence I believe this objective was met at a high standard making use of various white
literatures. This subsequently enabled me to create a design specification for my application.
7.1.3 Objective 2 - Build a persona identification application.
The particular part of the project was by far the most challenging yet the most rewarding.
First off I was tasked with choosing the appropriate software environment in which the
application will be coded in, after this was ascertained then the code development begun.
Although this was a very tedious task, involving numerous failed attempts and heavily
bugged versions, a final version was created bringing to life all the research and personal
hypotheses set out at the beginning of the project. (See Chapter 5) Overall I was hugely
satisfied with the implementation of the application despite the fact that it took a huge
amount of time and resources to put together, I believe it was a very strong and well put
together application that was indeed fit for purpose
7.1.4 Objective 3 - Evaluate the application.
The final part of this dissertation required me to evaluate the application, to not only provide
validation against my aim but to validate the customer requirements defined in Chapter 4. I
went about this by, first evaluating the usability of the system; this was done via a
questionnaire which was very heavy influenced by the Nielsen heuristic principle. After this a
black-box test was put together to evaluate the functionality of the application. Both test
were a huge success, as I was using experts to evaluate the system, there was a lot of extra
scrutiny laid on both the usability and functionality. The feedback was highly positive which
went a long way in validating my aim and user requirements. (See Chapter 6)
7.2 Future Development
One of the most underrated aspects of any project is to negate things that haven’t been done,
due to time or resources, and over-emphasis the things that have been achieved in a project. I
believe that there is a world of benefits to be unlocked once we can sit back and look at what
can be developed in the future to make this project even better.
There are a number of things that can be achieved with future work/development that would
enhance the application even further. The first is obviously a much deeper pool of personas
which was explained in the chapter. Another future development would be adding more
algorithms to the application instead of just the single K-Means. This was explained in more
detail in Chapter 2.8. Another development would be the ability to but the application on a
server and connect it to a database, this will enhance the application even more as it would
mean that data from the data lockers could be stored on the databases and be called into the
application via a database query etc. making the application more robust, expanding the