Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Denise Esserman MedicReS World Congress 2015


Published on

The Future of Frequentist Hypothesis Testing Presentation to MedicReS 5th World Congress on October 19-25, 2015 in New York by Denise Esserman, PhD

Published in: Data & Analytics
  • Login to see the comments

  • Be the first to like this

Denise Esserman MedicReS World Congress 2015

  1. 1. The Future of Frequentist Hypothesis Testing Denise Esserman, PhD Yale Center for Analytical Sciences (YCAS) Yale School of Public Health October 19-25 | 2015 New York
  2. 2. Outline • Definition of Big Data • Expanding into Medicine • Statistical Concerns • The Future October 19-25 | 2015 New York
  3. 3. “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks that everyone else is doing it, so everyone claims they are doing it…” Dan Ariely 2013 [Terry Speed talk, 2014] October 19-25 | 2015 New York
  4. 4. What are/is “Big Data”? October 19-25 | 2015 New York • “Buzzword” • Multiple Definitions – The V’s: volume, velocity, variability, variety, veracity, value (and complexity) • Variability in the quantity and quality of the data
  5. 5. Origin • Believed to have originated with Web search companies – Querying very large distributed aggregations of loosely-structured data1 • Does not always reference the volume of data October 19-25 | 2015 New York
  6. 6. Big Data Trends • Google trends October 19-25 | 2015 New York
  7. 7. “The hopeful vision of big data is that organizations will be able to harvest and harness every byte of relevant data and use it to make the best decisions. Big data technologies not only support the ability to collect large amounts, but more importantly, the ability to understand and take advantage of its full value.” Mark Troester14 October 19-25 | 2015 New York
  8. 8. Big Data in Medicine • “Potential lies in innovative ways it can be linked, related, and integrated to provide more detailed and personalized information than is possible with data from a single source”7 • For health care providers to offer personalized medicine October 19-25 | 2015 New York
  9. 9. NIH – Big Data to Knowledge6 “The ability to harvest the wealth of information contained in biomedical Big Data will advance our understanding of human health and disease; however, lack of appropriate tools, poor data accessibility, and insufficient training, are major impediments to rapid translational impact. To meet this challenge, the National Institutes of Health (NIH) launched the Big Data to Knowledge (BD2K) initiative in 2012.” October 19-25 | 2015 New York
  10. 10. BD2K Mission Statement “BD2K is a trans-NIH initiative established to enable biomedical research as a digital research enterprise, to facilitate discovery and support new knowledge, and to maximize community engagement.” October 19-25 | 2015 New York
  11. 11. BD2K Four Major Aims • To facilitate broad use of biomedical digital assets by making them discoverable, accessible, and citable. • To conduct research and develop the methods, software, and tools needed to analyze biomedical Big Data. • To enhance training in the development and use of methods and tools necessary for biomedical Big Data science. • To support a data ecosystem that accelerates discovery as part of a digital enterprise. October 19-25 | 2015 New York
  12. 12. Biomedical Big Data6 • More than just very large data or a large number of data sources. – Complexity, challenges, and new opportunities presented by the combined analysis of data. • Diverse and complex. – Imaging, phenotypic, molecular, exposure, health, behavioral, and many other types of data. October 19-25 | 2015 New York
  13. 13. • Faces many challenges. – Unwieldy amount of information – Lack of organization and access to data and tools – Insufficient training in data science methods • Spectacular opportunities. – Maximize the potential of existing data and enable new directions for research. October 19-25 | 2015 New York
  14. 14. Quantity of data does not mean one can ignore foundational issues of measurement and construct validity and reliability and dependencies among data5 October 19-25 | 2015 New York
  15. 15. Google Flu Trends2,3 • Machine Learning Algorithm – Predict number of flu cases based on Google Search Terms – Theory Free – Misunderstanding about uncertainties in data collection and modeling process • Inaccurate results over time – Lack of statistics: did not know what linked search terms to spread of flu October 19-25 | 2015 New York
  16. 16. Google Flu Trends (cont) • Correlation rather than causation – Cheaper and easier • Theory-free analysis is fragile • Intended as a “complementary signal” rather than stand alone forecasting tool4 October 19-25 | 2015 New York
  17. 17. “There are a lot of small data problems that occur in big data. They don’t disappear because you’ve got lots of the stuff. They get worse.” David Spielgelhalter “Big data is like a big trash dump. You have to know how to find the nuggets so it’s profitable.” Vin Gupta October 19-25 | 2015 New York
  18. 18. Indexing vs. Analyzing Big Data • Search companies index it – Make relevant data easy to use • Statisticians analyze it – Find structure within the data October 19-25 | 2015 New York
  19. 19. Data Scientists7 • People who draw insights from large quantities of data – Innovative problem solvers – Expertise in statistical modeling and machine learning – Specialized programming skills – Solid grasp of problem domain • Data science is blend of statistical, mathematical, and computational sciences October 19-25 | 2015 New York
  20. 20. Statistical Disconnect • Statisticians should be leaders of Big Data and data science movement8 – Scope goes beyond traditional activities • Statisticians need to be more engaged – Need to develop the skills to handle the sheer volume of data • Data scientists need to engage more statisticians (or more statisticians need to become data scientists) October 19-25 | 2015 New York
  21. 21. “Better data matters because simply having Big Data does not guarantee reliable answers for Big Questions.” Robert Rodriguez October 19-25 | 2015 New York
  22. 22. Some of the Statistical Concerns • Sampling populations – Sampling error – not representative • Confounders • Multiple Testing • Bias – Sampling bias – not randomly chosen • Overfitting October 19-25 | 2015 New York
  23. 23. Big n problem • Myth that problem is only computational in nature, not statistical because of large n • Standard errors can be large even with Big Data – Issue with large p October 19-25 | 2015 New York
  24. 24. • Scale of data requires spreading across cluster or grid of computers • Computational work to be distributed with the data – Google MapReduce Model for parallel programming (Apache Hadoop) October 19-25 | 2015 New York
  25. 25. Software Alchemy (SA) • Simple, powerful method to reduce computation • Partition the data in r groups and calculate average of estimator across groups – Requires partitions to be distributed similarly • May need initial “shuffle” step • Works well for any asymptotically normal estimator – Also works for p growing October 19-25 | 2015 New York
  26. 26. Big p problem • High-dimensional data • Dimension reduction – e.g. Principal components analysis, variable selection • Issues: – Multiple comparisons and simultaneous inference – Sparsity of data October 19-25 | 2015 New York
  27. 27. In the future… • Can we find methods that allow larger values of p than “safe” o(√n)? – Dimension reduction may distort results • Can we more easily verify technical assumptions? – e.g. lack of potential consistency of the LASSO • Can we find more general methods? October 19-25 | 2015 New York
  28. 28. Theoretical Null Distributions • Null distribution is most often not estimated, but hypothesized in classic hypothesis testing • Incorrect null might lead to false inference • Influences – Correlation – Incorrect assumptions – Unobserved covariates October 19-25 | 2015 New York
  29. 29. Empirical Null Distribution • Empirical null estimated from study’s data • Does not assume “nice” normal with variance going to 0 • Need independence assumption, but do not need identical assumption – Can have heterogeneous groups October 19-25 | 2015 New York
  30. 30. Big-data Clinical Trials (BCT)10 • Neglected problem in RCT – analysis typically based on different effectiveness of different interventions provided at baseline • Want to be able to analyze association of baseline treatment and its subsequent dynamic processes October 19-25 | 2015 New York
  31. 31. Example: Blood Pressure • RCT: – Effectiveness of Antihypertensive on BP control – BP measured at specified outcomes – Long term outcomes (e.g. stroke) • BCT: – Maintenance stable BP every day, hour, minute – “Dosing” equipment October 19-25 | 2015 New York
  32. 32. Epidemiologic Perspective • Big data is useful to detect rare drug-related side effects, not likely to be observed in RCT • Chance to look at rare diseases • Can contribute to an understanding of the strengths and limitations of new population sources12 October 19-25 | 2015 New York
  33. 33. Future of BCT10 • How will this be defined? • What is the “right” data to collect? • Who is in a position to design this trial? • How do we handle threats of big data? • How do we incorporate the non-static populations? October 19-25 | 2015 New York
  34. 34. Moving Beyond Rectangular Data • Structure of the data may be irregular, non- structured – Knowledge will change over time11 • Varying types of data – Pictures, videos, images – Unstructured and Sei-structured from Social Media October 19-25 | 2015 New York
  35. 35. Machine Learning “…a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from example inputs in order to make data- driven predictions or decisions, rather than following strictly statistic program instructions.”9 October 19-25 | 2015 New York
  36. 36. Machine Learning Tasks • Supervised learning – Example inputs and output – learn a rule that maps inputs to outputs • Semi-supervised learning – Incomplete training signal (i.e. target outputs missing) • Unsupervised learning – Leave on own to find structure in input • Reinforcement learning – Interaction with dynamic environment October 19-25 | 2015 New York
  37. 37. Example: Support Vector Machines • Supervised learning method • Used for classification and regression analysis • Can perform non-linear classification October 19-25 | 2015 New York
  38. 38. Challenges with Machine Learning • Need to choose the appropriate algorithm • Need to be able to define hyper-parameters – One or model parameters • Data needs to be in appropriate format – This is not trivial • Execution time grows with number of attributes and data instances October 19-25 | 2015 New York
  39. 39. Future of Machine Learning • Automatic searches for optimal algorithm and hyper-parameters – Very time consuming, limited usefulness at present • More user friendly approaches – e.g. allow healthcare researcher to efficiently and independently build predictive model13 October 19-25 | 2015 New York
  40. 40. Example: Machine Learning for Big Clinical Data (MLBCD)13 • Supports whole process of iterative machine learning on big clinical data – Clinical parameter extraction – Feature construction – Algorithm and hyper-parameter selection – Model building – Model evaluation • Can use after once (1) defined study population and research question, (2) obtained clinical data set, and (3) prepped data, including cleaning and filling in missing data October 19-25 | 2015 New York
  41. 41. Data Set Preparation • Tremendous amount of work goes into getting a data set together – Acquire, normalize, clean • e.g. Pivoting entity-attribute value format EMR to relational table formats13 October 19-25 | 2015 New York ID (entity Test # (Attribute) Pulse (Value) 100100 Test 1 98 989021 Test 2 101 100100 Test 2 75 989021 Test 3 99 989021 Test 4 88 ID Test_1 Test_2 Test_3 100100 98 75 null 989021 null 101 99
  42. 42. • Even bigger challenge with large data sets – Different formats – Different locations – Data quality and governance – Security, privacy and regulatory challenges • Surprisingly little work done here – and it should be a priority! October 19-25 | 2015 New York
  43. 43. Example: Fitbit • Study of Cardiac Risk • Use Fitbit to measure daily step counts for 3 years (4000 participants) • Participants need to upload to laptop/phone and then sync to Fitbit server • Devise holds one month of data • Need to then connect with other study data October 19-25 | 2015 New York
  44. 44. Where should the field head? • Need new techniques for data management • New tools for data analysis • New tools for data visualization • Ways to acquire and analyze unstructured text data October 19-25 | 2015 New York
  45. 45. “Big data is not about the technologies to store massive amounts of data, it is about creating a flexible infrastructure with high- performance computing, high-performance analytics and governance – in a deployment model that makes sense for the organization.”14 Mark Troester October 19-25 | 2015 New York
  46. 46. References 1. (accessed October 15, 2015) 2. (accessed October 15,2015) 3. (accessed October 15, 2015) 4. (accessed October 16, 2015) 5. Lazer D, Kennedy R, King G, Vespignani. The Parable of Google Flu: Traps in Big Data Analysis. Science, 343: 1203-05, (2014). 6. (accessed October 15, 2015) 7. (accessed October 17, 2015) 8. (accessed October 18, 2015) 9. (accessed October 18, 2015) 10. Wang SD. Opportunities and challenges of clinical research in the big-data era: from RCT to BCT. J Thorac Dis, 5(6): 721- 723, (2013) 11. Wang SD, Shen Y. Redefining big-data clinical trial (BCT). Annals of Translational Medicine. 2(10): 96, (2014) 12. Gange SJ, Golub ET. From smallpox to big data: The Next 100 years of epidemiologic methods. American Journal of Epidemiology. DOI: 10.1093/aje/kwv150 13. Luo G. MLBCD: a machine learning tool for big clinical data. Inf Sci Syst. 3:3, 2015. 14. Big Data Meets Big Data Analytics. SAS White Paper October 19-25 | 2015 New York