Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Math, Stats and CS in Public Health and Medical Research


Published on

or, What are Biostatistics and Bioinformatics?

Published in: Data & Analytics
  • Be the first to comment

Math, Stats and CS in Public Health and Medical Research

  1. 1. Jessica Minnier, OHSU, Lewis & Clark College Mathematics Colloquium, 3.19.14 Math, Stats and CS in Public Health and Medical Research
  2. 2. “Biostatistics (a portmanteau of biology and statistics; sometimes referred to as biometry or biometrics) is the application of statistics to a wide range of topics in biology.” – Wikipedia or,“What is Biostatistics?” “Bioinformatics is an interdisciplinary scientific field that develops methods for storing, retrieving, organizing and analyzing biological data” – Wikipedia “Computational biology involves the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.” – Wikipedia
  3. 3. Sample (n = 1) ¨  L&C mathematics major (2007), CS minor ¨  PhD in Biostatistics (2007-2012) ¤  “Inference and Prediction for High Dimensional Data via Penalized Regression and Kernel Machine Methods” ¨  Postdoc (2012-2013) ¤  Cancer risk prediction with gene-environment interactions ¨  Assistant Professor (2013-now) v  Division of Biostatistics v  Department of Public Health & Preventive Medicine v  School of Medicine (soon to be School of Public Health) v  Oregon Health & Science University
  4. 4. Outline ¨  Biostatistics and Bioinformatics/ Computational Biology ¤  More interesting definitions, research examples, case studies ¤  Types of careers ¨  My trajectory ¤  LC math to grad school to jobs ¨  Resources and advice
  5. 5. Biostatistics, in the news. Comics from Jim Borgman; XKCD; also fun: In summary: A poor understanding of statistics makes everyone look bad.
  6. 6. Biostatistics, in the news Forbes
  7. 7. Biostatistics, in the news
  8. 8. Applied math? ¨  Applied mathematics often studies deterministic models (engineering and mechanics, population models, cryptography) ¨  Some questions can’t be solved by deterministic models, but a partial answer can be given with statistics ¤  Does smoking cause lung cancer? (inference from observational studies) ¤  Is it going to rain tomorrow? (stochastic model) ¤  Do statins lower cholesterol? (randomized trial) Rafa Irizarry’s math major talk:
  9. 9. Example data ¨  Collection of measurements from a sampled population ¨  Measurements of a lab experiment ¨  Medical images of subjects’ brains over time ¨  Results of a clinical trial ¨  Gene expression from different types of cultured tissue ¨  Simulated data modeling HIV progression ¨  Values from electronic medical records sampled retrospectively ¨  3 million genetic mutations from 20,000 subjects Brian Caffo’s MOOC: Biostatistics Bootcamp I, lecture 1
  10. 10. Inform medical decisions ¨  A large clinical trial in 2002 by the Women’s Health Initiative was stopped early due to preliminary data showing that hormone replacement therapy had a negative health impact. ¨  This data contradicted prior evidence on the efficacy of HRT for post menopausal women. ¨  Statistical decision to end the trial, prevent further harm Brian Caffo’s MOOC: Biostatistics Bootcamp I, lecture 1; JAMA 2002;288(3):321-333
  11. 11. Inform medical decisions ¨  Guidelines for mammogram screening based on probabilities of false positives and negatives, cost-benefit analyses, survival analysis ¨  Analysis of adverse effects in a clinical trial determines drug safety, dosage, subpopulations ¨  Even general public must make decisions about risk when making their own medical decisions ¨  Experts cannot make decisions without data
  12. 12. Bioinformatics & Computational biology ¨  Sequencing the human genome (aligning, matching, searching) ¨  Algorithms for turning massive information from electronic medical records into useful predictors of disease progression ¨  Machine learning algorithms for risk prediction models with large and complex data (imaging, genetic) ¨  Analysis of networks (protein interactions, genetic pathways, social behavior influencing health outcomes) ¨  Simulation of complex data (methylation patterns in the genome)
  13. 13. Biomathematics ¨  Mathematical models to study infectious disease progression (in a population or in a body’s cells) ¨  Steady-state simulations of cancer cell growth ¨  Usually in joint biostatistics/biomathematics or applied mathematics departments, some epidemiology
  14. 14. Where do we work? (non-random sample = my classmates) ¨  Assistant professors: OHSU School of Medicine, UNC School of Medicine, UIUC Statistics Dept, University of New Mexico School of Medicine ¨  Consultant/Manager, Analysis Group ¨  Assistant Member, RAND Corporation ¤  Nonprofit global policy think tank ¨  Computational Biologist, Genentech ¨  Instructors: UPenn School of Medicine, Harvard School of Public Health ¨  Research Associate, Dana Farber Cancer Institute ¨  Statistician, Partners Health Care ¨  Other possibilities: ¤  Government: National Institutes of Health, Food & Drug, Centers for Disease and Control,WHO, Health departments in foreign countries ¤  Google, Intel, etc. ¤  Liberal arts colleges or smaller universities focused on teaching ¤  Pharma, Consulting, Labs, Hospitals, Hospital Research Centers, Research Institutes, Universities
  15. 15. Real data, please? ¨  Two examples…
  16. 16. Case study 1: RNA-Seq Data ¨  RNA sequencing uses Next Generation Sequencing (NGS) to quantify RNA presence and quantity in a genetic sample at a moment in time ¨  Studies the dynamic transcriptome of a cell ¨  The problem: Compare expressions of genes in heart vs. brain tissues? Which genes are turned off in heart and on in brain?
  17. 17. Case study 1: RNA-Seq Data ¨  Step 1: Biologists collect samples, send to lab for sequencing ¨  Step 2: Genetic material is transformed into millions of ‘reads’ ¤  AACTAGACCTGG ¨  Step 3:The reads are mapped to the genome, transformed into counts for each gene ¨  Step 4:The distribution of gene counts for different tissues is compared
  18. 18. RNA-seq: Step 3 ¨  Step 3:The reads are mapped to the genome, transformed into counts for each gene ¨  Computational biologists developed fast searching algorithms to map a short read (likely containing errors) to a genome with millions of base pairs, much repetition, some variability (SNPs)
  19. 19. RNA-seq: Step 3 ¨  Bowtie (Langmead 2009 Genome Biology) incorporated the Burrows Wheeler indexing algorithm to shorten the mapping to less than a day (used to be days if not months) lecture_notes/bwt_and_fm_index.pdf ¨  TopHat (Trapnell 2009 Bioinformatics) can detect splicing junctions where certain genes code for multiple proteins via alternatively spliced mRNA
  20. 20. RNA-seq: Step 4 ¨  Step 4:The distribution of gene counts for different tissues is compared ¨  Bioinformaticians and biostatisticians clean the data, normalize the data, and conduct statistical tests to determine if certain genes are expressed in one tissue differently than another ¨  Tests based on models: negative binomial distribution of counts, likelihood ratio tests ¨  Clustering algorithms ¨  Study genetic pathway enrichment, up- or down- regulated genes ¨  Biologists then study these genes more closely
  21. 21. Heatmap and dendogram from cluster algorithm comparing genes in cultured mouse heart and brain tissues
  22. 22. Case study 2: Electronic Medical Records ¨  Medical and health records are becoming increasingly digitized ¨  EMR can contain records of health measurements (blood pressure), diagnoses (depression), treatments prescribed (statins), family history information, and even detailed descriptions of doctor visits (clinician notes) ¨  Thousands of patients can have dozens of records, some can have just 2 ¨  Question: How to select subjects with bipolar disorder from a large pool of patients?
  23. 23. Case study 2: Electronic Medical Records ¨  Step 1: All the records must be collected, stored, put in a database, managed, tracked ¨  Step 2: A small subset must be read by a team of clinicians and scored as “case” versus “control” ¨  Step 3:Transform codes and paragraphs of words into predictors of disease ¨  Step 4: Determine important predictors of disease and build a prediction model with these variables ¨  Step 5:Validate the model, assess its performance ¨  Step 6: Implement the model in larger pool of subjects to select the bipolar cases for a future genetic study
  24. 24. EMR: Step 1 ¨  Step 1: All the records must be collected, stored, put in a database, managed, tracked ¨  Computer scientists and bioinformaticians must perform these steps (SQL, anyone? MUMPS? Python, perl…) ¨  Efficiency in this setting is no small task
  25. 25. EMR: Step 3 ¨  Step 3:Transform codes and paragraphs of words into predictors of disease ¨  Natural language processing (NLP) is used by bioinformaticians to mine the paragraphs of data for terms that occur often in cases and less often in controls ¨  Certain words in a doctor’s note become possible predictors of disease
  26. 26. EMR: Step 4-6 ¨  Step 4-6: Determine important predictors of disease, build a prediction model with these variables, assess/validate performance, implement model ¨  Biostatisticians develop ¤  high dimensional regression methods or machine learning methods ¤  to select important predictors and build models ¤  to predict outcomes based on a large number of variables (i.e., LASSO, support vector machine learning)
  27. 27. Regularized logistic regression with NLP predictors Solution path for coefficients of predictors based on adaptive LASSO
  28. 28. Back to me. ¨  Began with Yung-Pin’s research project on CpG islands (related to new field of epigenetics) ¨  Enjoyed journal clubs/biostatistics meetings at OHSU ¨  Pure math vs. applied math vs. something else ¨  Did you want to be a doctor? Do you want to help people? ¨  Ended up in grad school, what did I learn?
  29. 29. Biostatistics grad school ¨  Statistics ≠ pure math! ¨  A masters would have helped with intuition, but not usually funded ¨  Research universities ≠ Lewis & Clark! ¨  Depend on self-teaching, your classmates, and especially the T.A.’s to get by (when interviewing, meet the students!) ¨  Light teaching load, (hopefully) heavy collaborative/consulting load ¨  Lots of women in public health (like LC)! ¨  Grad school is always hard.
  30. 30. Bioinformatics grad school ¨  So far mostly the same ¨  More focused on biology ¨  Incorporating more biology training, wet labs ¨  Software/Bioconductor/R package development ¨  Diverging from traditional biostat?
  31. 31. Helpful classes ¨  Statistics and probability (obviously) ¨  All the computer science classes, ever (python, more C!) ¨  Linear algebra ¨  Genetics (molecular biology would have been nice, though no biology required for biostat) ¨  Advanced calculus/real analysis (for theoretical classes such as Prob II and Inference II and writing my thesis, not always required) ¨  Discrete ¨  Abstract Algebra (don’t worry, not required either) ¨  Liberal arts education in general
  32. 32. Helpful skills ¨  Latex ¨  R ¨  Python or Perl ¨  Unix, cluster/cloud computing ¨  Teaching/tutoring ¨  Research experience! ¨  Programming, software development ¨  C, Fortran ¨  Github ¨  You must enjoy talking to people, collaborating, explaining math/stat/cs to non mathematical people!
  33. 33. Pros & Cons Pros ¨  Interesting & meaningful research problems ¨  Always in demand, more so every day ¨  Collaborate with clinicians, biologists, researchers of all kinds ¨  Salary isn’t too shabby Cons ¨  Soft money L ¨  Grants, grants, always grants (but not necessarily our own)
  34. 34. Last thoughts ¨  Consider Epidemiology ¨  Applied vs.Theoretical research ¨  My day: mostly programming and writing code (cleaning data + analysis, simulations), lots of meetings, a bit of pen & pencil research and thinking of new grants, reading articles, reading clinical trial protocols, sample size and power calculations ¨  This will vary on where you work ¨  Masters vs. PhD
  35. 35. More talks like this ¨  Excellent overview of bioinformatics & computational biology fields and careers in medicine by Dr. Shannon McWeeney ( at OHSU launcher=false&fcsContent=true&pbMode=normal ¨  Rafa Irizarry’s (at HSPH math major talk: ¨  Plenty of interesting talks at JSM, the big statistical meeting/conference, it will be nearby in Seattle in August of 2015 (in Boston this year);
  36. 36. Learning resources ¨  Summer Institute for Training in Biostatistics (for undergrads) ¤  U Wisc at Madison, Columbia, Emory, Boston U, NC State, U of Iowa, U of Minnesota, U of Pittsburgh (All of the websites have “What is Biostatistics?” pages) ¨  MOOC’s (Massive Online Open Courses) ¤  Learn R ¤  Learn biostats ¤  Learn statistical learning about ¤  Learn bioinformatics and ¨  UW’s Summer Institutes (scholarships for students) ¤  Statistical Genetics; Statistics and Modeling in Infectious Diseases; Statistics for Clinical Research ¨  Comprehensive list of job postings for statistics/biostatistics/bioinformatics:
  37. 37. The internet ¨  Youtube ¤  Rafa Irizarry’s youtube channel (especially ¨  Simply Statistics blog ( ¨  R-bloggers ¨  Getting Genetics Done blog ( ) ¨  FiveThirtyEight ( ¨  Neat summary measure of types of research done in various departments (biased toward east coast)
  38. 38. Questions? ¨