Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Health data mining


Published on

collecting digital footprint data using health data mining

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Health data mining

  1. 1. Sidra ali
  2. 2. Collecting a Citizen’s Digital Footprint for Health Data Mining Oguzhan Gencoglu, Heidi Simil, Harri Honko, Minna Isomursu
  3. 3. Abstract  This paper describes a case study for collecting digital footprint data for the purpose of health data mining.  The case study involved 20 subjects residing in Finland who were instructed to collect data from registries which they evaluated to be useful for understanding their health or health behavior, current or past.  11 subjects were active, sending 100 data requests to 49 distinct organizations in total.  Our results indicate that there are still practical challenges in collecting actionable digital footprint data.
  4. 4. Abstract  Out of the received data, 44 datasets (72.1% were delivered in paper format.  4 (6.6%) in portable document format .  13 (21.3%) in structured digital form.  The time duration between the sending of the information requests and reception of a reply was 26.4 days on the average.
  5. 5. Introduction  Digital footprint or digital shadowrefers to one's unique set of traceable digital activities, actions, contributions and communications that are manifested on the Internet or on digital devices  There are two main classifications for digital footprints:  Passive digital footprints . A passive digital footprint is created when data is collected without the owner knowing, it can be stored in many ways depending on the situation. In an online environment a footprint may be stored in an online data base as a "hit". This footprint may track the user IP address, when it was created, and where they came from; with the footprint later being analyzed. In an offline environment, a footprint may be stored in files, which can be accessed by administrators to view the actions performed on the machine, without being able to see who performed them.  Active digital footprints active digital footprints are created when personal data is released deliberately by a user for the purpose of sharing information about oneself by means of websites or social media.
  6. 6. Introduction  digital footprints can tell a lot about the behavior, characteristics and preferences of an individual [2] [3] [4] [5] [6], provided it’s accessible in digitally digestible, machine-readable form.  Increasingly the data sets, open or closed are being made available over an application programming interface, API. Where accessible, the person’s digital footprint is used today, for example, for personalized recommendation services, person-, income- and even location-context[7].  There are ideas promoting that digital footprint data, when properly gathered and analyzed with modern data analytics could provide significant opportunities for providing new, more personalized and timely health services.  Aggregated and analyzed data can help individuals themselves learn about their health condition [10] [11].
  7. 7. Introduction  Better access to electronic health records can help communication between careers, health professionals and other service providers [12].  This can create opportunities for totally new kind of health and wellbeing services, which create new business opportunities for companies, and help increasing efficiency of health interventions through targeted care.  In this paper, we examine the state-of-the-practice of collecting 2010’s citizen’s personal footprint for the purpose of health data mining.
  8. 8. Introduction  Our research question is ”Can digital footprint of an individual be collected successfully today for health data mining?”.  For the purpose of the study, we hire some individual to send information to different organizations of their own choice. they tried to maximize the number of responses.  Our results summarize how successful our case subjects were in collecting their digital footprint data.  did the organizations provide them access to their personal footprint data?  in what format the data was presented to them?
  9. 9. Introduction  and what procedures roughly would be needed to make that data actionable so that it could be used for computerized health data mining by anyone attempting to refine and analyze the data to provide insights and health related value.  Our discussion summarizes our experience and suggests further work on how such data can be examined to reveal health behavior patterns.
  10. 10. METHODOLOGY  Total of 20 volunteer participants were hired among active researchers in this study.  The participants were instructed to print, sign and mail the information request with the covering letter to 5-10 target organizations of their own choice.  A preliminary list of candidate sources for digital footprint information was collected to serve as an example for the participants, although they were instructed to decide themselves which data sources could be valuable for health data analytics.  In order to follow the process, the participants kept a record of dates when the information requests were sent, when the replies were received and in which format.
  11. 11. METHODOLOGY  The data was asked to be delivered to each participants home address or email.  In the information request form it is stated that data is preferred to be delivered via an API, a memory stick or DVD, instead of printed paper documents.  After receiving the data, the participants were instructed to go through the data and decide which representative set of the individual registers data they were willing to donate for the research program.  The sensitive personal information was removed or edited when needed. Each participant signed an informed consent while handing over the data.
  12. 12. RESULTS AND DISCUSSION  The number of voluntary participants, all residing in Finland, was 20 (18 natives, 2 foreigners) for the study.  11 (55.0%) individuals were active during period of five months (11/2014-03/2015), sending 100 information requests (9.09 per person) to 49 (2.04 per registry) distinct data sources in total.  With respect to their content, these data sources were classified by researchers into 15 categories, i.e., banking, education, energy, fitness, groceries, healthcare, housing, insurance, library, mobility, municipality, police, retail, telecommunication and web.  The average number of distinct data sources and number of sent requests per category is 3.27 and 6.67, respectively.  Maximum number of distinct data sources along with maximum number of sent requests belongs to health category with 30 requests from 13 data sources.  For each category, a detailed summary of number of data sources, number of sent requests, number of received replies and number of replies resulting in an access to data can be seen from Table I.
  13. 13. RESULTS AND DISCUSSION  Overall response rate and data response rate of the study was 75.0% and 61.0% respectively.  As the main purpose of a digital footprint collection process eventually is to perform data analysis on each individual’s data.  the amount of collected data has a great effect on the analysis performance.
  14. 14. RESULTS AND DISCUSSION  The format of the collected data is crucial as well for the analysis to be conducted properly.  Even though more than half of the data sources provided some data to the individuals, most of the cases the format of the returned data is not analysis-friendly, even not digitized.  The format of the delivered data can be categorized into three groups as paper format (hard copy), portable document format (PDF) and spreadsheet/structured format which includes formats such as comma-separated values (CSV), Microsoft Excel file formats (XLS/XLSX), JavaScript object notation (JSON).  The listed order is from least analysis-friendly to the most. A detailed view of the format of the collected data for different categories can be seen from Table II.  Hard copy, i.e., paper format, corresponds to the majority of the collected data with 72.1%. Only 21.3% of the collected data can be considered as structured. None of the data sources had APIs for such data ingestion process.
  15. 15. RESULTS AND DISCUSSION  When the process of transforming non-analysis-friendly data into analysis- friendly form is considered, the drawbacks become more obvious.  Data delivered in paper format, first of all, has to be printed and mailed, which comes at a cost.  As an individual can easily own hundreds of pages of data residing in several data sources; logistics, security and storing problems arise.  Then, the data has to be digitized by the recipient, for example by scanning. Such a process is not only burdensome but also error-prone.  After digitization, data is in the form of PDF or digital images which has to be fed into an optical character recognition (OCR) algorithm.
  16. 16. RESULTS AND DISCUSSION  As the paper-form data is likely to contain artifacts (lines, logos, bright/dark spots due to scanning, irrelevant text, folded/torn down parts) acting as noise to the OCR system, the likelihood of error increases.  Furthermore, the OCR system had to be tuned specifically for the structure of the text in paper; thus, parsing the relevant information becomes even more demanding.  In addition, as there is no guarantee of the data source delivering the data on the paper in the same format in the future, such tasks are discouraged with respect to the reproducible research paradigm.
  17. 17. RESULTS AND DISCUSSION  Another interesting aspect of the data collection process is the analysis of quickness of the data sources, i.e., how quick each registry replies to the requests.  56 of the requests have both sending and reply dates recorded.  On the average, a reply (providing data or not) took 26.4 days to arrive.  Average reply times for different categories can be seen from Table III.  The average durations for the data registries with small number of recorded times are given for the sake of completeness rather than conclusion determined.  The average reply time for requests resulting in data reception was 29.6 days while replies failing to do so came in 14.8 days on the average.
  18. 18. CONCLUSION  One’s behavior is reflecting to his/her actions and those actions are recorded in great amounts in today’s world as digital footprint.  As the advancing data mining algorithms enable efficient harmonization of multi-modal data to perform inferential, predictive and even causal analysis of people’s behavior, these digital footprints are of considerable value for health data mining purposes.  An expected rise in the demand of personal data from various data registries is likely to change the current situation of such information retrieval process which is presented in this paper.  Our results show that currently utilization of digital footprint in services has practical challenges. Companies and institutions in control of the data of individuals are not responsive and attentive to the emerging value of digital footprint.  Even in the Finnish context, where the individuals have right by law to access their personal data, many organizations ignored the request or refused the access to the data.  Very few provided data in format which could be easily digested by digital tools.
  19. 19. CONCLUSION  Providing high quality data to the cutting-edge data mining and machine learning systems is essential for high performance predictive analysis, health behavioral modeling and personalized services.  In order to achieve this goal, controlled and secure data access via service web portals, or even better, through machine readable APIs are needed.  Our work continues with exploration of the collected datasets in terms of validity, suitability and information value for health data mining, leading to in-depth analysis of how the digital footprint can be used in health services.
  20. 20. REFERENCES  [1] A. Sellen, Y. Rogers, R. Harper, and T. Rodden, “Reflecting human values in the digital age,” Communications of the ACM, vol. 52, no. 3, pp. 58–66, 2009.  [2] “World economic forum - rethinking personal data: Strengthening trust,” 2012.  [3] D. Zhang, B. Guo, B. Li, and Z. Yu, “Extracting social and community intelligence from digital footprints: an emerging research area,” in Ubiquitous Intelligence and Computing. Springer, 2010, pp. 4–18.
  21. 21. REFERENCES  [4] C. Moiso and R. Minerva, “Towards a user-centric personal data ecosystem the role of the bank of individuals’ data,” in Intelligence in Next Generation Networks (ICIN), 2012 16th International Conference on. IEEE, 2012, pp. 202–209.  [5] A. Malhotra, L. Totti, W. Meira Jr, P. Kumaraguru, and V. Almeida, “Studying user footprints in different online social networks,” in Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012). IEEE Computer Society, 2012, pp. 1065–1070.
  22. 22. REFERENCES  [6] N. Eagle and A. Pentland, “Reality mining: sensing complex social systems,” Personal and ubiquitous computing, vol. 10, no. 4, pp. 255– 268, 2006.  [7] M. Venkataramanan, “My identity for sale,” /magazine/archive/2014/11/features/my-identity-for-sale/viewall, accessed: 2015- 27-03.  [8] “Mac basics: Notifications keep you informed,” lb/HT204079, accessed: 2015-27-03.  [9] “Google now,”, accessed: 2015-
  23. 23. REFERENCES  [10] J. H. Frost and M. P. Massagli, “Social uses of personal health 27-03. information within patientslikeme, an online patient community: what can happen when patients have access to one anothers data,” Journal of Medical Internet Research, vol. 10, no. 3, 2008.  [11] S. Kumar, W. Nilsen, M. Pavel, and M. Srivastava, “Mobile health: Revolutionizing healthcare through transdisciplinary research,” Computer, no. 1, pp. 28–35, 2013.  [12] C. Pagliari, D. Detmer, and P. Singleton, “Potential of electronic personal health records,” BMJ: British Medical Journal, vol. 335, no. 7615, p. 330, 2007.  [13] “Finnish legislation - personal data act, 523/199,” translation completed: 2001-31- 03.