Biostatistics for Dummies


Published on

Published in: Technology, Economy & Finance
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The field of biostatistics is extremely broad, and consists of quantitative techniques in mathematics, statistics, and computing appropriate to data in the fields of medicine and biology.
  • More specifically, biostatistics is the science of transforming biological data into knowledge about biological processes.
  • Common applications include clinical medicine, epidemiologic studies, biological laboratory and field research, genetics, environmental health, health services, ecology, fisheries and wildlife biology, agriculture, and forestry.
  • Biostatisticians participate in this process at every level including design of collaborative research studies, conduct an analysis of those studies, oversight and regulation of scientific research, participation on governmental advisory or policy making boards, training researchers in good statistical practice, and development of new statistical theory and methods useful in the analysis of biological data.
  • An internet search on the word “Biostatistics” (from yields over 210,000 hits. There are over 50 universities having programs which offer a graduate degree (M.S. or Ph.D.) in Biostatistics. Clearly this field is a huge area, and it would be impossible to cover all of it in just one hour. And so, in trying to decide exactly what I would talk about in this seminar, …
  • … I studied the statement of focus in the description of the NIH Program of Excellence in Biomedical Computing document at the ERC web site. It says,
    ''Biomedical computing research encompasses an extremely broad area,and NIH welcomes biomedical computing research centers targeting research areas of any of its institutes. We have chosen to target computational simulation in environmental health science research for several strategic reasons. MSU has very significant and recognized strength in computational simulation in the physical sciences in the ERC. MSU also has such strength in environmental health sciences in the CEHS, and this is an NIH area expected to be less targeted by significant competitors for full centers. Bioinformatics (genomics, proteomics, etc.), on the other hand, is now a crowded field. Thus computational simulation, as opposed to bioinformatics, is especially well suited to our opportunity. Therefore, computational simulation in environmental health sciences is felt to be a focus area in which we can build on appreciable MSU strength and in which we can establish ourselves as a unique capability of particular appeal to the NIEHS.’’
  • And so I have chosen to focus on statistical methodologies which are directly related to computational simulation in the environmental health sciences, and which fall under the umbrella of ``biostatistics,'' (but which are also used in other areas of research requiring statistics). Specifically, I have chosen the areas of nonlinear multivariate stochastic modeling and spatial statistics*.
    *In November, Dr. Lance Waller, from the Graduate School of Public Health at Emory University will be giving another talk on spatial statistics in which he analyzes data in an attempt to determine if a newly constructed pier of the Florida coast has altered the nesting and mating habits of sea turtles.
  • A number of studies have been conducted to determine if exposure to pesticide run-off has an effect on cancer incidence rates. The task at hand is to first suitably adjust the data for known risk factors such as gender, age, race, and socioeconomic status. Then the next step is to determine if a relationship exists between (the adjusted) cancer incidence rates and pesticide exposure. If we determine there is a relationship, then the next more challenging task will be to develop a parsimonious model that will explain that relationship and estimate or predict cancer incidence rates for other areas and times.
  • Our data can be described in the following manner:
    From the National Statistical Service and the U.S. Department of Commerce National Technical Information Service, for the years 1972-2001, on the county level, we have the following crop information. This data acts as a proxy for type and amount of pesticide used.
    Number of acres harvested
    Type of crop (corn, soybeans, rice, cotton, etc.)
    From the Mississippi State Department of Health Central Cancer Registry for the years 1996 - 1998 on the individual (personal) level we have the following information. The registry for 1996 is complete. The registry for the years 1997 and 1998 is partially complete, and is nearing completion.
    Tumor type
    County of residence
    Cancer morbidity and mortality data provided as a frequency of crude incidence per 100,000 and age adjusted incidence rate per 100,000 on a county basis for different types of cancers.
    We also hope to work with the Mississippi State University Social Science Research Center, Remote Sensing Technology Center, and the State and Federal Census agencies to obtain other county-wide socioeconomic and demographic information including employment, land use, population density, personal income, and poverty level.
  • Statistics, the science of uncertainty, attempts to model order in disorder. Even when the disorder is discovered to have a perfectly rational explanation at one scale, there is very often a smaller scale where the data do not fit the theory exactly, and the need arises to investigate the new, residual uncertainty. The level of disorder may be measured through a quantity called entropy. A number of proposed measures of entropy exist. Shannon (1948) defined entropy in the following manner. Suppose there are i = 1, …, k possible states of nature that occur at random according to a probability distribution (p1, …, pk), where 0 < pi < 1, i = 1, …, k, and p1 + … + pk = 1. Then entropy E is defined as
    E = -- i=1k pi log(pi),
    where x log (x) = 0 for x = 0.
    There is a very interesting entropy inequality that can be proved. Consider the initial probability distribution p’ = (p1, …, pk) evolving into a subsequent probability distribution pA, where A is a k x k matrix of nonnegative elements, each of whose rows sum to one. Write A = (aij); then aij represents the conditional probability that the subsequent state of nature if j, given that the initial state of nature was i, and A is called the transition probability matrix. When states of nature evolve through repeated application of a stochastic mechanism described by the matrix A, the result is called a Markov chain. The entropy inequality is
    E((1):(1) ) > E ((0):(0) )
    where E((m):(m) ) = -- j=1k j(m)log(j(m)/ j(m)), m = 0,1. Initially j(0) = pj, and j(0)= 1, j = 1, …, k, so that E ((0):(0) ) is simply -- i=1k pi log(pi). The subsequent measures are j(1) = i=1k aij j(0) = i=1k aij pi and j(1) = i=1k aij j(0) = i=1k aij , j = 1, … k, depending upon the transition matrix A.
    A physical interpretation of this inequality is that the universe tends to seek levels of entropy that are higher in relation to previous levels. Technology attempts to slow the increase in entropy by constraining evolving systems, but there are convincing arguments that it will never decrease total entropy.
    Another approach which leads to the same logical conclusion stems from the area of nonlinear difference equations, a discrete time analog of differential equations. For each integer t, let xt denote a k-dimensional (state) vector in Rk, satisfying the equation
    xt = f(xt-1) (1)
    where f is a vector-valued function. Let f(k) represent the kth iteration of f; that is,
    f (k)(x) = xk = f(xk-1) = f(f(f(…x))) (k times).
    The deterministic iteration defined by (1) can exhibit trajectories almost indistinguishable from realizations of a stochastic process. Thus, randomness can be generated by a strictly deterministic equation. This phenomenon is loosely described as chaos. Fundamentally, randomness is generated because of sensitive dependence on initial conditions. In other words, a small perturbation of the initial condition can lead to vastly different realizations. In short, statistics is important because disorder -- chaos -- randomness is here to stay.
  • Independent Identically Distributed Data Modeling
    This is the simplest case, and is what is considered in most introductory statistics classes, including most first year Masters level classes in statistics. It is so often studied, that the initials “i.i.d.” will mean something to someone who has had only one or two courses in elementary statistics. In the i.i.d. case, the observations x1, x2, …, xn are the result of a random sample chosen from the same population. The randomness of selection assures independence of one observation from the others. This independence implies directly that knowledge about the value of one observation gives no increased knowledge about the values of any other observation. Because all observations were taken from the same population, they will all have the same restrictions and properties, albeit, these restrictions and properties are, for the most part, unknown. Typically the purpose of the study of such data is to determine the “best” family of distributions for describing the overall pattern of the data or to infer a decision about specified properties of the population from information in the sample data.
    Inhomogeneous Data Modeling
    Lack of homogeneity in data is often accounted for in the modeling process by including a non-constant mean or non-constant variance assumption. Often, this non-constancy is accounted for by modeling procedures specific to the source of inhomogeneity, and then the adjusted data analyzed as the i.i.d. case. However, it is often the case that, even after these large-scale variations are accounted for, reasons exist to suspect inhomogeneous small-scale variations.
    Dependent Data Modeling
    Independence is a very convenient assumption that makes much of mathematical-statistical theory tractable. However, models that involve statistical dependence are more often realistic; two classes of models that have commonly been used involve intraclass- correlation structures and serial correlation structures. The third new class of dependent data, is spatial data, where dependence is present in all directions and becomes weaker as data locations become more dispersed.
    The notion that data that come close together are likely to be correlated (i.e., cannot be modeled as statistically independent) is a natural one, and has been successfully used by statisticians to model physical and social phenomenon. Time series and spatial statistics are areas in statistics specifically targeted at analyzing this type of data.
  • Purely temporal models, or time series models, are usually based on identically distributed observations that are dependent and occur at equally spaced time points. The unidirectional flow of time underlies the construction of these models. Simply put, a time series model tries to describe the relationship between the past, present, and future, and also describe the randomness inherent in the data.
  • The objectives of a time series analysis can be grouped into four primary categories:
    Description -- time series analysis is extremely graphical in nature because of the lack of independence among the quantities that it investigates. Graphs describing time series fall into three categories:
    Time plot
    Correlation plots
    Partial autocorrelation
    Cross-correlation (multivariate series)
    Spectral plots
    Spectral density
    Spectral cumulative distribution
    Cross-spectrum (multivariate series)
    Modeling -- we assume that some random mechanism is generating the observed data. We then attempt to determine the form of that mechanism, and estimate the parameters within the form. In general, models fall into one of two classes: linear or nonlinear.
    Inference -- using the model, we seek to determine from a time series data set what other data could have been observed or are expected to be observed in the future. If the model generating the data is a linear model, then many inferential methods are parametric (usually asymptotically). However, for nonlinear models, the methods are largely non-parametric.
    Prediction -- use the correlation in the time series to predict the future.
  • Let $\{X(t), t \in \Omega\}$ represent a time series with index set $\Omega$. (Typical sets for $\Omega$ are ${\bZ}^+$ or ${\bR}^+$. In the discussion that follows, $\Omega$ is taken to be the set of positive integers ${\bZ}^+$.) The series $\{X(t)\}$ is a linear time series or is said to admit a linear representation if it can be written as
    \sum_{j=0}^p \alpha_j X(t-j) = \sum_{i=0}^q \beta_i \varepsilon(t-i),
    where $\alpha_0 = \beta_0 \equiv 1$, or equivalently as
    X(t) = \sum_{l = 0}^\infty \phi_l \varepsilon(t-l)
    where $\{\varepsilon(t)\}$ is a purely random error process with zero mean and finite variance $\sigma_\varepsilon^2$, and $\{\phi_l\}$ is a given sequence of constants satisfying $\sum_{l=0}^\infty \phi_l^2 < \infty$. Under reasonable constraints on the error process and the coefficients $\phi_l$, linear models have desirable probabilistic properties that make mathematical-statistical properties for inferential and predictive purposed easier to describe in the long run. Of specific interest is a property known as second-order or covariance stationarity, which ensures that
    the mean $\mu_X = {\mbox{E}}[X(t)]$ is constant as a function of time,
    and all second-moments (covariances) depend only upon how far apart in time points are, and not when they occur; that is, the covariance \gamma(\nu) = {\mbox{E}}[X(t)X(t+\nu)], \quad \nu = 0, 1, 2, \ldots is a function of $\nu$ only, and not of $t$. Note that for $\nu = 0$, $\gamma(\nu) = \sigma^2_X$, the variance of the series. And so this also implies that the variance is constant as a function of time.
    Time series satisfying these constraints have long-range forecasts which converge to a constant. If the distribution of the $\varepsilon(t)$ is symmetric, then the series is also time reversible.
    It is often the case that a time series will not initially satisfy second-order stationarity. In many of these cases, a simple transformation will result in a time series that is linear and second-order stationary. In these cases, linear models do an excellent job at describing the behavior observed an the data set.
  • The strength of linear models in the areas of parametric inferential methods and prediction should never be set aside unless it is clear that the random mechanism generating the data is nonlinear. If a time series clearly exhibits nonlinear behavior, then a nonlinear model should be used. Non-linear models have the potential of
    describing features such as limit cycles, jump phenomenon, harmonics, time irreversibility, synchronization, and other phenomenon that linear models are unable to capture. There are a number of tests for nonlinearity. A thorough exposition and study of these tests is given in Harvill (1999). There are a number of very important
    nonlinear time series models that have been shown to work well in modeling a variety of nonlinear behaviors.
  • The strength of linear models in the areas of parametric inferential methods and prediction should never be set aside unless it is clear that the random mechanism generating the data is non-linear. There are a number of tests for non-linearity. A thorough exposition and study of these tests is given in Harvill (1999). There are a number of very important non-linear time series models that have been shown to work well in modeling a variety of non-linear behaviors.
    Possibly the single most important class of non-linear time series models which are directly related to dynamical systems is the class of non-linear autoregressive models. Specifically, $\{X_t\}$ is said to follow a {\it non-linear autoregressive model of order $p$ with general noise} (NLAR) if there exist a function $f\,:{\bR}^{p+1} \rightarrow {\bR}$ such that X_t = f(X_{t-1},X_{t-2},\ldots,X_{t-p},\varepsilon_t), \quad t \in {\bZ} where $\{\varepsilon_t\}$ is a sequence of zero mean identically distributed random errors with finite variance. The absence of the error term in (\ref{nlar}) is a non-linear difference equation of order $p$. The noise-free case is commonly referred to as the skeleton. If (\ref{nlar}) can be written as X_t = f(X_{t-1},X_{t-2},\ldots,X_{t-p}) + \varepsilon_t, \quad t \in {\bZ} it is an additive noise model. Most non-linear time series procedures in existence are for additive noise models.
    A general {\it threshold model} allows for the analysis of a complex stochastic system by decomposing it into simpler subsystems. Threshold models envelope a huge set of behaviors. Of special interest are threshold autoregressive models (TAR), smoothed threshold autoregressive models (STAR), Markov chain driven models, and fractals.
    Amplitude-dependent exponential autoregressive models (EXPAR) are of the basic form $$ X(t) = \sum_{j=1}^p [\alpha_j + \beta_j\exp\{-\delta X^2(t-j)\}X(t-j) + \varepsilon(t), \quad \delta > 0. $$ A model of this type is particular useful for modeling amplitude-dependent behavior.
    A time series is said to follow a bilinear model if it satisfies the equation $$ X(t) + \sum_{l=1}^p \alpha_lX(t-l) - \sum_{k=1}^q \beta_k\varepsilon(t-k) = \sum_{i=1}^r\sum_{j=1}^s b_{ij}X(t-i)\varepsilon(t-j) + \varepsilon(t). $$
    Generalized autoregressive models with conditional heteroscedasticity (GARCH) modify use the model in (\ref{nlaran}) by allowing the variance of the error terms to change with values of $X$. That is, $$ X(t) = \varepsilon(t)V_t, $$ where $$ V_t = \delta + \sum_{i=1}^q \phi_i X^2(t-i) + \sum_{j=1}^p \psi_i V_{t-j}, \quad {\mbox{$\psi_i \ge 0$ for all $i$.}} $$
    An emerging more general class of non-linear models are functional coefficient autoregressive models (FCAR). This class of models is fairly new in the statistical literature. A time series $X(t)$ is said to follow and FCAR model if it satisfies $$ X(t) = f_1({\bX}^*(t-d))X(t-1) + \cdots + f_p({\bX}^*(t-d))X(t-p) + \varepsilon(t), $$ where $f_i,~i = 1, \ldots, p$ are all measurable functions from ${\bR}^k \rightarrow {\bR}$, and $${\bX}^*(t-d) = (X(t-d-i_1), \ldots, X(t-d-i_k))^\prime, \quad {\mbox{with $i_j > 0$ for $j = 1,\ldots,k$.}} $$ Without loss of generality, $d + \max(i_1, \ldots i_k) \le p$. This class of models includes the non-linear autoregressive models, as well as threshold models, bilinear models, and exponential autoregressive models as a special case. However, due to their non-parametric nature, most results (estimation, inference, and prediction) are highly computational in nature.
  • Correlation in time series is analogous to correlation in traditional statistics which is an attempt to graphically illustrate and numerically quantify the strength of the relationship between a set of pairs of $n$ points. In traditional statistics, the pairs of points are a set two observations on $n$ individuals (for example, height and
    weight), in time series, these pairs are the set of points $\nu$ units apart in time. In other words, for the time series $\{X(t)\}$, we want to describe the correlation for all pairs of points 1 unit apart in time, the correlation for all pairs of points 2 units apart in time, and so on. Because the correlation is of the time series with itself, it is most often referred to as {\it autocorrelation}. If the time series is second-order stationary, then we have an autocorrelation {\it function} $\rho(\nu),~\nu = 1, 2, \ldots$, given by
    \rho(\nu) = \dfrac{\gamma(\nu)}{\sigma^2_X}, \quad \nu = 1, 2, \ldots
    If the time series is linear, then a fairly obvious re-writing of the expression for correlation from traditional statistics yields an estimator of the linear association between $X(t)$ and $X(t+\nu)$. Let $x(1), x(2), \ldots, x(n)$ represent the realization of the time series $\{X(t)\}$. Then estimated autocorrelation function is $$
    \hat\rho(\nu) = \dfrac{\sum_{t=1}^{n-\nu}[x(t) - \bar x][x(t+\nu) - \bar x]}{\sum_{t=1}^{n-\nu}[x(t) - \bar x]^2} \quad \nu < n,
    where $\bar x = n^{-1}\sum_{t=1}^n x(t)$ is the sample mean of $x(t)$. As in traditional statistics, this is a measure of the strength of the linear association between points $\nu$ units apart in time.
    If the time series is non-linear, it could be that there is a strong non-linear association, but that $\hat\rho(\nu) \approx 0$ for all $\nu$ (Harvill and Ray, 2000). In this case, alternate non-parametric estimators for the strength of the association between points $\nu$ units apart in time must be considered. There are numerous proposed methods for doing so.
    Partial autocorrelation is an attempt to explain the correlation between points $\nu$ units apart in time with the common effects of the points in between removed. This is accomplished by fitting a model with all terms, and a reduced model and getting the correlation between the residuals.
    More generally, if ${\bX}_t = (X_{1,t}, X_{2,t}, \ldots, X_{k,t})$ represents a $k$-valued process at time $t,~t = 1, 2, \ldots, n$, the process is referred to as a vector or multivariate time series}. The $k$-vector of error terms $(\varepsilon_{i,t})~i = 1, \ldots, k,~t = 1, 2, \ldots, n$ is such that the within component $\varepsilon_{i,\cdot}~i,~i = 1,\ldots,k$ are independent, but a possible cross-correlation of error terms exists between components. The added dimensionality and correlation structure enriches the class of models that can be considered, but also adds a level of mathematical and statistical complexity. The ``curse of dimensionality'' applies, and innovative creative, but rigorous methods become difficult to come by. Tsay (1998) and Harvill and Ray (1999) extend tests of linearity and non-linear modeling into the multivariate framework for threshold models and non-linear autoregressive and bilinear models, respectively. Harvill and Ray (2000) extend some non-parametric methods for measuring the strength of non-linear association that are less affected by higher dimensions. Finally Ray and Harvill (pre-print, 2003) have begun extending results on functional coefficient autoregressive models into the multivariate time series literature.
    Spectral plots and correlation plots yield the same type of information in different settings. For a second-order stationary time series, if $\gamma(\nu)$ is absolutely summable, then there exists a function $f(\omega),~\omega \in [0,1]$, symmetric about $\omega = 1/2$, such that $\gamma(\nu)$ is the Fourier transform of $f(\omega)$; that is
    \gamma(\nu) & = & \int_0^1\! f(\omega) e^{2\pi i \nu \omega}\,d\omega \\
    f(\omega) & = & \sigma_X^2 +
    2\sum_{\nu=1}^\infty \gamma(\nu)e^{-2\pi i \nu \omega}.
    The absolutely function $F$ defined by $$
    F(\omega) = \int_0^\omega\! f(x)\,dx
    is the {\it cumulative spectral distribution function}. The function $f(\omega)$ is the {\it spectral density function} of $X(t)$.
  • Spatial models are a more recent addition to the statistics literature. Any discipline that works with data collected from different spatial locations needs to develop models that indicate when there is dependence between measurements at different locations. However, the models need to be more flexible that their temporal counterparts, because ``past, present, and future'' have no analogy in space, and furthermore it is simply not reasonable to assume that spatial locations of data occur regularly, as do most time series models. Although a relatively new field, already volumes of work have been written on the analysis of spatial data.
    The basic components of spatial data are the spatial locations $\{{\bs}_1, \ldots, {\bs}_n\}$ and the data $\{Z({\bs}_1), \ldots, Z({\bs}_n)\}$ observed at the locations. Usually the data are assumed random, and sometimes the locations are assumed random. Once the locations are given, the possibility of mistaken or imprecise positioning is generally not modeled.
    Let ${\bs} \in {\bR}^k$ be a generic data location in $k$-dimensional Euclidean space and suppose that the {\it potential} data ${\bZ}({\bs})$ at spatial location ${\bs}$ is a random quantity. The locations ${\bs}$ vary over some index set $D \subset {\bR}^k$ determines to a large extent the method of analysis.
    Just as in time series analysis, an attempt is made to determine correlation structure of points that are some distance apart (across space). If the covariance function $\gamma[{\bZ}({\bs}_1),{\bZ}({\bs}_2)] = \gamma({\bs}_1 - {\bs}_2)$, is a function of the difference of the locations, and not of the location itself for all locations, then the process is second-order stationary. If the covariance $\gamma(\cdot)$ is a function only of $||{\bs}_1 - {\bs}_2||$, then the $C(\cdot)$ is called {\it isotropic}. The function $C(\cdot)$ is called the {\it covariogram}. The property of {\it ergodicity} is also important in spatial statistics. Basically, it allows expectations over the set of all possible realizations of ${\bZ}({\bs})$ to be estimated by spatial averages. It says that the series, when successively translated, completely fills up the space of all possible trajectories. There are sufficient conditions for ergodicity. Often the assumption is made to allow inference to proceed for a series of dependent observations. It might only be verifiable in the sense that one fails to reject it.
    The type of spatial data will determine precisely how correlation is estimated, modeling is conducted, and predictions are obtained.
    Geostatistical data: $D$ is a fixed subset of ${\bR}^k$ that contains a $k$-dimensional rectangle of positive volume. The data $\{{\bZ}({\bs}\}$ is a random vector at location ${\bs} \in D$. The name ``geostatistics'' stems from the early beginnings of analysis of data where the spatial index ${\bs}$ is allowed to vary continuously over a subset of ${\bR}^k$. Other applications of methods in geostatistics include hydrology, soil science, public health, uniformity trials, and acid rain, to name a few.
    Lattice data: $D$ is a fixed regular or irregular collection of countably many points of ${\bR}^k$. The data $\{{\bZ}({\bs})\}$ is a random vector at location ${\bs} \in D$. A lattice of locations evokes an idea of regularly spaced points in ${\bR}^k$, linked to nearest neighbors, second-nearest neighbors, and so on. Of all of the possible spatial structures, a data set with spatial locations on a regular lattice in ${\bR}^k$ are the closes analog to a time series at equally spaced time points.
    Point patterns or marked spatial point process: $D$ is a point process in ${\bR}^k$ or a subset of ${\bR}^k$, and the data $\{{\bZ}({\bs})\}$ is a random vector at location ${\bs} \in D$. When no ${\bZ}$ is specified, the usual spatial point process is obtained. \item {\it Objects}: $D$ is a point process in ${\bR}^k$; $Z({\bs})$ is a random set. Point patterns arise when the important variable to be analyzed is the location of the events. Most often, the question to be answered is whether the pattern is exhibiting complete spatial randomness, clustering, or regularity. In the simplest case, the $Z$ variable is called the {\it mark variable}, and the whole process is a marked spatial point process. But the mark variable does not have to be a real variable. It could be a set. This yields processes such as the Boolean model.
    The typical assumptions made in spatial analysis are that either ${\bZ}$ or $D$ is fixed (and the other random), or that ${\bZ}$ and $D$ are independent if both are random. Therefore spatial modeling occurs within the ${\bZ}$ process (geostatistical data and lattice data), within the $D$ process, or within both processes (point patterns), and typically involves modeling the large- and small-scale variations in terms of a finite number of parameters.
  • Other applications of methods in geostatistics include hydrology, soil science, public health, uniformity trials, and acid rain, to name a few.
    Data from remote sensing satellites offers an efficient means of gathering data of this type. There is a large overlap between the remote sensing techniques and (low-level) medical imaging techniques; although the spatial scales are vastly different, the form of the data and the questions being asked are often similar. Statistical models for such data need to express the fact that observations nearby (in time or space) tend to be more alike.
    Data that form this construct are often images. The goal of analyzing such a data set is typically to estimate parameters of the random set and the point process. Boolean models have been successfully used to describe tumor growth rate. Another application of these is modeling cells growing in vitro, where the analysis is conducted in such a manner that takes shape as well as size into account.
  • It is of probably no surprise at this point to say that the two approaches have been combined into a field known as ``spatio-temporal modeling,'' or simply {\it space-time modeling}. The field is complex, but is exactly what lends itself to solve the application mentioned in Section~\ref{application}. It should not be a surprise to anyone that a person does not instantaneously develop cancer with some probability immediately after they are exposed to a pesticide. And so, there is a temporal element involved. Moreover, it would seem to be the case that, the more a person is exposed to a pesticide, the more likely they would be to develop cancer. A person who lives downstream from a crop would be more likely to be exposed over the long run to a pesticide. Using methods in spatio-temporal modeling, these relationships are precisely what we intend to examine.
  • Biostatistics for Dummies

    1. 1. Biostatistics for Dummies Biomedical Computing Cross-Training Seminar October 18th , 2002
    2. 2. What is “Biostatistics”? Techniques Mathematics Statistics Computing Data Medicine Biology
    3. 3. What is “Biostatistics”? Biological data Knowledge of biological process
    4. 4. Common Applications (Medical and otherwise) Clinical medicine Epidemiologic studies Biological laboratory research Biological field research Genetics Environmental health Health services Ecology Fisheries Wildlife biology Agriculture Forestry
    5. 5. Biostatisticians Work Develop study design Conduct analysis Oversee and regulate Determine policy Training researchers Development of new methods
    6. 6. Some Statistics on Biostatistics Internet search (Google) > 210,000 hits > 50 Graduate Programs in U.S. Too much to cover in one hour!
    7. 7. Center Focus MSU strengths  Computational simulation in physical sciences  Environmental health sciences Bioinformatics is crowded Computational simulation in environmental health sciences  Build on appreciable MSU strength  Establish ourselves  Unique capability  Particular appeal to NIEHS
    8. 8. Focus of Seminar Statistical methodologies Computational simulation in environmental health sciences Can be classified as “biostatistics” Stochastic modeling Time series Spatial statistics*
    9. 9. The Application Of interest  Cancer incidence rate  Pesticide exposure Of concern  Age  Gender  Race  Socioeconomic status Objectives  Suitably adjust cancer incidence rate  Determine if relationship exists  Develop model  Explain relationship  Estimate cancer rate  Predict cancer rate
    10. 10. The Data N.S.S. & U.S. Dept. of Commerce National T.I.S. (1972-2001, by county)  Number of acres harvested  Type of crop MS State Dept. Health Central Cancer Registry (1996 – 1998, by person)  Tumor type  Age  Gender  Race  County of residence  Cancer morbidity  Crude incidence/100,000  Age adjusted incidence/100,000
    11. 11. Why (Bio)statistics? Statistics  Science of uncertainty  Model order from disorder Disorder exists  Large scale rational explanation  Smaller scale residual uncertainty Chaos Deterministic equation Randomness x0 Entropy
    12. 12. (Bio)statistical Data Independent identically distributed Inhomogeneous data Dependent data Time series Spatial statistics
    13. 13. Time Series Identically distributed Time dependent Equally spaced Randomness
    14. 14. Objectives in Time Series Graphical description Time plots Correlation plots Spectral plots Modeling Inference Prediction
    15. 15. Time Series Models Linear Models Covariance stationary  Constant mean  Constant variance  Covariance function of distance in time(t) ~ i.i.d  Zero mean  Finite variance  square summable
    16. 16. Nonlinear Time Series Amplitude-frequency dependence Jump phenomenon Harmonics Synchronization Limit cycles Biomedical applications  Respiration  Lupus-erythematosis  Urinary introgen excretion  Neural science  Human pupillary system
    17. 17. Some Nonlinear Models Nonlinear AR  Additive noise Threshold  AR  Smoothed TAR  Markov chain driven  Fractals Amplitude- dependent exponential AR Bilinear AR with conditional heteroscedasticity Functional coefficient AR
    18. 18. A Threshold Model
    19. 19. A Threshold Model
    20. 20. Describing Correlation Autocorrelation AR: exponential decay MA: 0 past q Partial autocorrelation AR: 0 past p MA: exponential decay Cross-correlation Relationship to spectral density
    21. 21. Spatial Statistics* Data components  Spatial locations S = {s1,s2,…,sn}  Observable variable {Z(s1),Z(s2),…,Z(sn)}  s D  Rk Correlation Data structures  Geostatistical  Lattice  Point patterns or marked spatial point processes  Objects Assumptions on Z and D
    22. 22. Biological Applications Geostatistics  Soil science  Public health Lattice  Remote sensing  Medical imaging Point patterns  Tumor growth rate  In vitro cell growth
    23. 23. Spatial Temporal Models Combine time series with spatial data Application Time element time Pesticide exposure develop cancer  Spatial element  Proximity to pesticide use