Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Advancing “Info-demiology”: Methods and Applications for Cancer Research

151 views

Published on

2016 ORAU Annual Meeting of the Council of Sponsoring Institutions
Georgia Tourassi, PhD
Health Data Sciences Institute

Published in: Government & Nonprofit
  • Be the first to comment

  • Be the first to like this

Advancing “Info-demiology”: Methods and Applications for Cancer Research

  1. 1. Advancing “Info-demiology”: Methods and Applications for Cancer Research Georgia Tourassi, PhD Health Data Sciences Institute Presented at the 2016 ORAU Annual Meeting March 9, 2016
  2. 2. “Infodemiology” • The epidemiology of digital (mis)information – Collect, describe, and analyze health information and communication patterns using online sources 2
  3. 3. Next-Gen Epidemiology 3 • Digital Epidemiology – Conducting epidemiological studies using data coming from digital tools and sensors, e.g. the internet or smart phones – Different data collection methods online : • Recruit patients or collect openly available content from candidate subjects
  4. 4. Cancer Community • One in five internet users with cancer • A growing number of cancer patients share online – their personal stories regarding their symptoms, treatments, emotional and physical concerns, and many other issues arising throughout the cancer diagnosis, treatment, and recovery phases. • Promising potential of knowledge discovery via analyzing user generated content in online cancer communities 4
  5. 5. Why Digital Epidemiology 5 • Advantage – Fully automated process for dynamic monitoring and continuous discovery • Major Challenge – Variable amount of detail and quality of information provided by each subject. • Big BUT Dirty Data
  6. 6. Computational efficiency for data collection Scalability of information extraction algorithms Important Requirements
  7. 7. Case Studies Cancer Surveillance Parity and Cancer Risk Air Quality and Lung Cancer Risk Mobility and Lung Cancer Risk
  8. 8. Case Study 1: US Cancer Mortality Trends (2008-2012)
  9. 9. Case-Control Study Using Online Obituaries
  10. 10. Information Retrieval – Age / Gender / Cause of Death
  11. 11. Data Collection Websites (e.g. online US newspapers) Web Crawler Parser Age Gender Cause of Death
  12. 12. Intelligent Web Crawling Technologies • A self-supervised, adaptive crawler using a utility function based on supervised learning • Acquire online content matching the user’s needs without predefined topic ontology or the manual effort of composing explicit search queries. • Balance the time cost between repeatedly training the utility predictor and crawling the web S. Xu, H.J. Yoon, G.D. Tourassi. "A user-oriented web crawler for selectively acquiring online content in e-health research.” Bioinformatics 30.1 (2014): 104-114.
  13. 13. Data Source: Obituaries
  14. 14. Results: Information Extraction Age Gender Cause of Death Precision 0.94 0.98 0.86 Recall 0.98 0.97 0.90 F-score 0.96 0.98 0.88
  15. 15. • Implementation of text understanding module based on CoreNLP and MPI for parallel computing • Due to the computational demands of the text parsing stage, the NLP platform used resources of the Oak Ridge Leadership Computing Facility which is supported by the Office of Science, DOE. 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1 2 3 4 5 6 7 8 Numberofpagespersec. Number of Processors Throughput Ideal Number of pages to be processed 5,000,000 Estimated processing time 60 seconds / page Required machine hours 83,333 hours Scaling Up to Leverage OLCF TITAN: Number of Nodes = 18,688, Number of Cores per Node = 16 Total Number of Cores = 18,688 * 16 = 299,008
  16. 16. Distribution of Cancer Deaths by Age Correlation between number of cancer obituaries vs. SEER mortality Breast Cancer Lung Cancer Correlation: 0.981 Correlation: 0.994 http://seer.cancer.gov/csr/1975_2012/results_single/ sect_01_table.13_2pgs.pdf
  17. 17. Distribution of Cancer Deaths by StateBreast Cancer Correlation: 0.939 Breast Cancer Correlation: 0.939 Lung Cancer Correlation: 0.881
  18. 18. Distribution of Cancer Deaths by YearBreast Cancer Breast Cancer Correlation: 0.611 Lung Cancer Correlation: 0.839 (all) Correlation: 0.673 (males) Correlation: 0.455 (females)
  19. 19. Use Case 2: Parity and Cancer Risk breast lung colon ovarian pancreatic
  20. 20. Traditional vs. In Silico Studies 20 Disease Population Exposure
  21. 21. Websites (e.g. online US newspapers) Web Crawler Parser Age Gender Offsprings Cause of Death Data Collection
  22. 22. Information Retrieval – Offsprings
  23. 23. Age Gender Offsprings Cause of Death Precision 0.94 0.98 0.95 0.86 Recall 0.98 0.97 0.93 0.90 F-score 0.96 0.98 0.94 0.88 Results: Information Extraction
  24. 24. Collected Data: 51,911 cases & 27,483 controls Breast Lung Colon Ovarian Pancreatic Control 27,330 9,470 2,273 6,342 6,496 27,483 Age-Adjusted ORs with 95% CIs
  25. 25. Summary • The positive association between increased parity and lower cancer risk was significant for all cancers. • All linear trend tests were statistically significant. • The trends were: – more pronounced for breast (χ2 = 301.60, p < 2.20e-16) & ovarian (χ2 = 121.45, p < 2.20e-16) cancers; – less pronounced for pancreatic cancer (χ2 = 38.95, p = 4.35e-10); – least pronounced for colon (χ2 = 24.69, p = 6.75e-07) and lung cancer (χ2 = 21.33, p = 3.87e-06) • Limitation of obituaries: – Cannot derive effect of additional factors (e.g., age at first pregnancy, breastfeeding, lifestyle choices)
  26. 26. Use Case 3: Lung Cancer Risk and Air Quality
  27. 27. General Approach • Instead of studying exposure profiles of individuals • Study exposure profiles of geographical regions
  28. 28. Exposure Element 28 • Particulate Matter – small particles – PM10 < 10 micrometers in diameter – PM2.5 < 2.5 micrometers in diameter – IARC Group I carcinogen • STUDY OBJECTIVE: Can we predict the geographical variation of lung cancer incidence by examining the spatiotemporal trend of particulate matter air pollution levels?
  29. 29. Data Sources 29 • US EPA monitoring PM exposures since 1982 – Hourly PM10 exposures from 3,356 distinct sites – Hourly PM2.5 exposures from 2,398 distinct sites • Average exposure, US county level – Inverse Distance Weighting interpolation – Monitoring stations within 50 miles of county’s geolocation center
  30. 30. Data Sources 30 • NCI State Cancer Profiles – 5-year averaged (2008~2012) age-adjusted cancer incidence – National average = 63.7 cases (per 100,000) – US county level precision • Determine high and low risk counties – High risk counties: Risk, 95% confidence > 63.7 cases – Low risk counties: Risk, 95% confidence < 63.7 cases
  31. 31. Shapelet Analysis 31 • Shapelets – time series analysis – Comprehensive way to explore local shape similarity – (12 monthly measurements X 10 years)=120 0 5 10 15 20 25 30 Jan-98 Feb-98 Mar-98 Apr-98 May-98 Jun-98 Jul-98 Aug-98 Sep-98 Oct-98 Nov-98 Dec-98 Jan-99 Feb-99 Mar-99 Apr-99 May-99 Jun-99 Jul-99 Aug-99 Sep-99 Oct-99 Nov-99 Dec-99 AverageExposure Date Anderson County, TN PM2.5 Shapelet
  32. 32. Shapelet Analysis 32 • Collect maximally informative shapelets Sopt – With respect to level of support of candidate shapelets • Classification – Feature vector: existence of absence of Sopt • Prediction – Observing air pollution data from 1998 to 2007 – Classifying high/low risk of lung cancer incidence Predictive Model PM Exposure Profile in 1997- 2008 High or Low Lung Cancer Incidence in 2008-2012
  33. 33. Results • Classification Accuracy • AUC (average PM10) = 0.596 • AUC(average PM2.5) = 0.729 33
  34. 34. Results 34 PM10 shapelets High Risk M=21.24 O=6.64 (3.38, 11.36) M=22.21 O=5.64 (3.52, 9.50) M=21.15 O=4.44 (2.91, 6.78) Low Risk M=23.10 O=0.30 (0.20, 0.46) M=20.44 O=0.35 (0.25, 0.51) M=18.24 O=0.36 (0.25, 0.51) M=mean O=odds ratio
  35. 35. Results 35 PM2.5 shapelets M=mean O=odds ratio High Risk M=13.67 O=11.58 (8.72, 15.39) M=13.86 O=11.57 (8.72, 15.34) M=14.30 O=11.15 (8.39, 14.80) Low Risk M=1.34 O=0.07 (0.04, 0.10) M=4.76 O=0.07 (0.05, 0.11) M=6.38 O=0.08 (0.05, 0.12)
  36. 36. Summary 36 • Experimental results confirmed that the prolonged high exposure of PM adversely influences lung cancer risk • Individual shapelets suggests an association between high lung cancer risk and highly fluctuating PM exposures • Future Work: apply time series of multiple channels of various environmental, lifestyle, socio-economic factors that may be associated with cancer risk
  37. 37. Use Case 4: Lung Cancer Risk and Mobility
  38. 38. Mobility & Cancer • Frequent relocation has been linked to health decline, particularly with respect to emotional and psychological wellbeing. • Is there a similar relationship with cancer? • Hypothesis: The prevalence of lung cancer is higher in populations with more frequent relocations. • Methods: In silico case-control observational study using cyber-informatics 38
  39. 39. Data Collection • Augmented LinkedIn Profiles 1,458 Lung Cancer patients in open forums with matching LinkedIn profiles 14,886 Cancer-Free LinkedIn subjects TOTAL: 16,344 subjects 39
  40. 40. Results Average Number of Relocations: Cancer: 3.49 Non-Cancer: 2.65
  41. 41. Results Effect of Age Threshold for Major Relocations: Effect of Distance Threshold for Major Relocations:
  42. 42. Conclusions Creative (Re)Use of (Unexpected) Big Data: Bigger can be better even if unusual, noisy or sparse Cost-effective way for epidemiological discovery and hypotheses generation in cancer research Must be mindful of potential sources of selection and sampling biases Need advanced web mining and text mining tools Need workflows for computational efficiency and scalability
  43. 43. Thank you! • Hong-Jun Yoon, PhD – ORNL • Songhua Xu, PhD – NJIT • National Cancer Institute at the National Institutes of Health (Grant No. 1R01-CA170508) • tourassig@ornl.gov

×