Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Introduction to Clinical Biostatist... by Medresearch 5263 views
- Biostatistics for dummies pdf by ford34452 73 views
- Clinical biostatistics and epidemio... by pamela_house 184 views
- biostatistics basic by jjm medical college 784 views
- Fundamentals of biostatistics by Kingsuk Sarkar 5969 views
- Application of Biostatistics by Jippy Jack 15823 views

6,266 views

Published on

No Downloads

Total views

6,266

On SlideShare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

220

Comments

0

Likes

2

No embeds

No notes for slide

&apos;&apos;Biomedical computing research encompasses an extremely broad area,and NIH welcomes biomedical computing research centers targeting research areas of any of its institutes. We have chosen to target computational simulation in environmental health science research for several strategic reasons. MSU has very significant and recognized strength in computational simulation in the physical sciences in the ERC. MSU also has such strength in environmental health sciences in the CEHS, and this is an NIH area expected to be less targeted by significant competitors for full centers. Bioinformatics (genomics, proteomics, etc.), on the other hand, is now a crowded field. Thus computational simulation, as opposed to bioinformatics, is especially well suited to our opportunity. Therefore, computational simulation in environmental health sciences is felt to be a focus area in which we can build on appreciable MSU strength and in which we can establish ourselves as a unique capability of particular appeal to the NIEHS.’’

*In November, Dr. Lance Waller, from the Graduate School of Public Health at Emory University will be giving another talk on spatial statistics in which he analyzes data in an attempt to determine if a newly constructed pier of the Florida coast has altered the nesting and mating habits of sea turtles.

From the National Statistical Service and the U.S. Department of Commerce National Technical Information Service, for the years 1972-2001, on the county level, we have the following crop information. This data acts as a proxy for type and amount of pesticide used.

Number of acres harvested

Type of crop (corn, soybeans, rice, cotton, etc.)

From the Mississippi State Department of Health Central Cancer Registry for the years 1996 - 1998 on the individual (personal) level we have the following information. The registry for 1996 is complete. The registry for the years 1997 and 1998 is partially complete, and is nearing completion.

Tumor type

Age

Gender

Race

County of residence

Cancer morbidity and mortality data provided as a frequency of crude incidence per 100,000 and age adjusted incidence rate per 100,000 on a county basis for different types of cancers.

We also hope to work with the Mississippi State University Social Science Research Center, Remote Sensing Technology Center, and the State and Federal Census agencies to obtain other county-wide socioeconomic and demographic information including employment, land use, population density, personal income, and poverty level.

E = -- i=1k pi log(pi),

where x log (x) = 0 for x = 0.

There is a very interesting entropy inequality that can be proved. Consider the initial probability distribution p’ = (p1, …, pk) evolving into a subsequent probability distribution pA, where A is a k x k matrix of nonnegative elements, each of whose rows sum to one. Write A = (aij); then aij represents the conditional probability that the subsequent state of nature if j, given that the initial state of nature was i, and A is called the transition probability matrix. When states of nature evolve through repeated application of a stochastic mechanism described by the matrix A, the result is called a Markov chain. The entropy inequality is

E((1):(1) ) &gt; E ((0):(0) )

where E((m):(m) ) = -- j=1k j(m)log(j(m)/ j(m)), m = 0,1. Initially j(0) = pj, and j(0)= 1, j = 1, …, k, so that E ((0):(0) ) is simply -- i=1k pi log(pi). The subsequent measures are j(1) = i=1k aij j(0) = i=1k aij pi and j(1) = i=1k aij j(0) = i=1k aij , j = 1, … k, depending upon the transition matrix A.

A physical interpretation of this inequality is that the universe tends to seek levels of entropy that are higher in relation to previous levels. Technology attempts to slow the increase in entropy by constraining evolving systems, but there are convincing arguments that it will never decrease total entropy.

Another approach which leads to the same logical conclusion stems from the area of nonlinear difference equations, a discrete time analog of differential equations. For each integer t, let xt denote a k-dimensional (state) vector in Rk, satisfying the equation

xt = f(xt-1) (1)

where f is a vector-valued function. Let f(k) represent the kth iteration of f; that is,

f (k)(x) = xk = f(xk-1) = f(f(f(…x))) (k times).

The deterministic iteration defined by (1) can exhibit trajectories almost indistinguishable from realizations of a stochastic process. Thus, randomness can be generated by a strictly deterministic equation. This phenomenon is loosely described as chaos. Fundamentally, randomness is generated because of sensitive dependence on initial conditions. In other words, a small perturbation of the initial condition can lead to vastly different realizations. In short, statistics is important because disorder -- chaos -- randomness is here to stay.

This is the simplest case, and is what is considered in most introductory statistics classes, including most first year Masters level classes in statistics. It is so often studied, that the initials “i.i.d.” will mean something to someone who has had only one or two courses in elementary statistics. In the i.i.d. case, the observations x1, x2, …, xn are the result of a random sample chosen from the same population. The randomness of selection assures independence of one observation from the others. This independence implies directly that knowledge about the value of one observation gives no increased knowledge about the values of any other observation. Because all observations were taken from the same population, they will all have the same restrictions and properties, albeit, these restrictions and properties are, for the most part, unknown. Typically the purpose of the study of such data is to determine the “best” family of distributions for describing the overall pattern of the data or to infer a decision about specified properties of the population from information in the sample data.

Inhomogeneous Data Modeling

Lack of homogeneity in data is often accounted for in the modeling process by including a non-constant mean or non-constant variance assumption. Often, this non-constancy is accounted for by modeling procedures specific to the source of inhomogeneity, and then the adjusted data analyzed as the i.i.d. case. However, it is often the case that, even after these large-scale variations are accounted for, reasons exist to suspect inhomogeneous small-scale variations.

Dependent Data Modeling

Independence is a very convenient assumption that makes much of mathematical-statistical theory tractable. However, models that involve statistical dependence are more often realistic; two classes of models that have commonly been used involve intraclass- correlation structures and serial correlation structures. The third new class of dependent data, is spatial data, where dependence is present in all directions and becomes weaker as data locations become more dispersed.

The notion that data that come close together are likely to be correlated (i.e., cannot be modeled as statistically independent) is a natural one, and has been successfully used by statisticians to model physical and social phenomenon. Time series and spatial statistics are areas in statistics specifically targeted at analyzing this type of data.

Description -- time series analysis is extremely graphical in nature because of the lack of independence among the quantities that it investigates. Graphs describing time series fall into three categories:

Time plot

Correlation plots

Autocorrelation

Partial autocorrelation

Cross-correlation (multivariate series)

Spectral plots

Spectral density

Spectral cumulative distribution

Cross-spectrum (multivariate series)

Modeling -- we assume that some random mechanism is generating the observed data. We then attempt to determine the form of that mechanism, and estimate the parameters within the form. In general, models fall into one of two classes: linear or nonlinear.

Inference -- using the model, we seek to determine from a time series data set what other data could have been observed or are expected to be observed in the future. If the model generating the data is a linear model, then many inferential methods are parametric (usually asymptotically). However, for nonlinear models, the methods are largely non-parametric.

Prediction -- use the correlation in the time series to predict the future.

$$

\sum_{j=0}^p \alpha_j X(t-j) = \sum_{i=0}^q \beta_i \varepsilon(t-i),

$$

where $\alpha_0 = \beta_0 \equiv 1$, or equivalently as

$$

X(t) = \sum_{l = 0}^\infty \phi_l \varepsilon(t-l)

$$

where $\{\varepsilon(t)\}$ is a purely random error process with zero mean and finite variance $\sigma_\varepsilon^2$, and $\{\phi_l\}$ is a given sequence of constants satisfying $\sum_{l=0}^\infty \phi_l^2 &lt; \infty$. Under reasonable constraints on the error process and the coefficients $\phi_l$, linear models have desirable probabilistic properties that make mathematical-statistical properties for inferential and predictive purposed easier to describe in the long run. Of specific interest is a property known as second-order or covariance stationarity, which ensures that

the mean $\mu_X = {\mbox{E}}[X(t)]$ is constant as a function of time,

and all second-moments (covariances) depend only upon how far apart in time points are, and not when they occur; that is, the covariance \gamma(\nu) = {\mbox{E}}[X(t)X(t+\nu)], \quad \nu = 0, 1, 2, \ldots is a function of $\nu$ only, and not of $t$. Note that for $\nu = 0$, $\gamma(\nu) = \sigma^2_X$, the variance of the series. And so this also implies that the variance is constant as a function of time.

Time series satisfying these constraints have long-range forecasts which converge to a constant. If the distribution of the $\varepsilon(t)$ is symmetric, then the series is also time reversible.

It is often the case that a time series will not initially satisfy second-order stationarity. In many of these cases, a simple transformation will result in a time series that is linear and second-order stationary. In these cases, linear models do an excellent job at describing the behavior observed an the data set.

describing features such as limit cycles, jump phenomenon, harmonics, time irreversibility, synchronization, and other phenomenon that linear models are unable to capture. There are a number of tests for nonlinearity. A thorough exposition and study of these tests is given in Harvill (1999). There are a number of very important

nonlinear time series models that have been shown to work well in modeling a variety of nonlinear behaviors.

Possibly the single most important class of non-linear time series models which are directly related to dynamical systems is the class of non-linear autoregressive models. Specifically, $\{X_t\}$ is said to follow a {\it non-linear autoregressive model of order $p$ with general noise} (NLAR) if there exist a function $f\,:{\bR}^{p+1} \rightarrow {\bR}$ such that X_t = f(X_{t-1},X_{t-2},\ldots,X_{t-p},\varepsilon_t), \quad t \in {\bZ} where $\{\varepsilon_t\}$ is a sequence of zero mean identically distributed random errors with finite variance. The absence of the error term in (\ref{nlar}) is a non-linear difference equation of order $p$. The noise-free case is commonly referred to as the skeleton. If (\ref{nlar}) can be written as X_t = f(X_{t-1},X_{t-2},\ldots,X_{t-p}) + \varepsilon_t, \quad t \in {\bZ} it is an additive noise model. Most non-linear time series procedures in existence are for additive noise models.

A general {\it threshold model} allows for the analysis of a complex stochastic system by decomposing it into simpler subsystems. Threshold models envelope a huge set of behaviors. Of special interest are threshold autoregressive models (TAR), smoothed threshold autoregressive models (STAR), Markov chain driven models, and fractals.

Amplitude-dependent exponential autoregressive models (EXPAR) are of the basic form $$ X(t) = \sum_{j=1}^p [\alpha_j + \beta_j\exp\{-\delta X^2(t-j)\}X(t-j) + \varepsilon(t), \quad \delta &gt; 0. $$ A model of this type is particular useful for modeling amplitude-dependent behavior.

A time series is said to follow a bilinear model if it satisfies the equation $$ X(t) + \sum_{l=1}^p \alpha_lX(t-l) - \sum_{k=1}^q \beta_k\varepsilon(t-k) = \sum_{i=1}^r\sum_{j=1}^s b_{ij}X(t-i)\varepsilon(t-j) + \varepsilon(t). $$

Generalized autoregressive models with conditional heteroscedasticity (GARCH) modify use the model in (\ref{nlaran}) by allowing the variance of the error terms to change with values of $X$. That is, $$ X(t) = \varepsilon(t)V_t, $$ where $$ V_t = \delta + \sum_{i=1}^q \phi_i X^2(t-i) + \sum_{j=1}^p \psi_i V_{t-j}, \quad {\mbox{$\psi_i \ge 0$ for all $i$.}} $$

An emerging more general class of non-linear models are functional coefficient autoregressive models (FCAR). This class of models is fairly new in the statistical literature. A time series $X(t)$ is said to follow and FCAR model if it satisfies $$ X(t) = f_1({\bX}^*(t-d))X(t-1) + \cdots + f_p({\bX}^*(t-d))X(t-p) + \varepsilon(t), $$ where $f_i,~i = 1, \ldots, p$ are all measurable functions from ${\bR}^k \rightarrow {\bR}$, and $${\bX}^*(t-d) = (X(t-d-i_1), \ldots, X(t-d-i_k))^\prime, \quad {\mbox{with $i_j &gt; 0$ for $j = 1,\ldots,k$.}} $$ Without loss of generality, $d + \max(i_1, \ldots i_k) \le p$. This class of models includes the non-linear autoregressive models, as well as threshold models, bilinear models, and exponential autoregressive models as a special case. However, due to their non-parametric nature, most results (estimation, inference, and prediction) are highly computational in nature.

weight), in time series, these pairs are the set of points $\nu$ units apart in time. In other words, for the time series $\{X(t)\}$, we want to describe the correlation for all pairs of points 1 unit apart in time, the correlation for all pairs of points 2 units apart in time, and so on. Because the correlation is of the time series with itself, it is most often referred to as {\it autocorrelation}. If the time series is second-order stationary, then we have an autocorrelation {\it function} $\rho(\nu),~\nu = 1, 2, \ldots$, given by

$$

\rho(\nu) = \dfrac{\gamma(\nu)}{\sigma^2_X}, \quad \nu = 1, 2, \ldots

$$

If the time series is linear, then a fairly obvious re-writing of the expression for correlation from traditional statistics yields an estimator of the linear association between $X(t)$ and $X(t+\nu)$. Let $x(1), x(2), \ldots, x(n)$ represent the realization of the time series $\{X(t)\}$. Then estimated autocorrelation function is $$

\hat\rho(\nu) = \dfrac{\sum_{t=1}^{n-\nu}[x(t) - \bar x][x(t+\nu) - \bar x]}{\sum_{t=1}^{n-\nu}[x(t) - \bar x]^2} \quad \nu &lt; n,

$$

where $\bar x = n^{-1}\sum_{t=1}^n x(t)$ is the sample mean of $x(t)$. As in traditional statistics, this is a measure of the strength of the linear association between points $\nu$ units apart in time.

If the time series is non-linear, it could be that there is a strong non-linear association, but that $\hat\rho(\nu) \approx 0$ for all $\nu$ (Harvill and Ray, 2000). In this case, alternate non-parametric estimators for the strength of the association between points $\nu$ units apart in time must be considered. There are numerous proposed methods for doing so.

Partial autocorrelation is an attempt to explain the correlation between points $\nu$ units apart in time with the common effects of the points in between removed. This is accomplished by fitting a model with all terms, and a reduced model and getting the correlation between the residuals.

More generally, if ${\bX}_t = (X_{1,t}, X_{2,t}, \ldots, X_{k,t})$ represents a $k$-valued process at time $t,~t = 1, 2, \ldots, n$, the process is referred to as a vector or multivariate time series}. The $k$-vector of error terms $(\varepsilon_{i,t})~i = 1, \ldots, k,~t = 1, 2, \ldots, n$ is such that the within component $\varepsilon_{i,\cdot}~i,~i = 1,\ldots,k$ are independent, but a possible cross-correlation of error terms exists between components. The added dimensionality and correlation structure enriches the class of models that can be considered, but also adds a level of mathematical and statistical complexity. The ``curse of dimensionality&apos;&apos; applies, and innovative creative, but rigorous methods become difficult to come by. Tsay (1998) and Harvill and Ray (1999) extend tests of linearity and non-linear modeling into the multivariate framework for threshold models and non-linear autoregressive and bilinear models, respectively. Harvill and Ray (2000) extend some non-parametric methods for measuring the strength of non-linear association that are less affected by higher dimensions. Finally Ray and Harvill (pre-print, 2003) have begun extending results on functional coefficient autoregressive models into the multivariate time series literature.

Spectral plots and correlation plots yield the same type of information in different settings. For a second-order stationary time series, if $\gamma(\nu)$ is absolutely summable, then there exists a function $f(\omega),~\omega \in [0,1]$, symmetric about $\omega = 1/2$, such that $\gamma(\nu)$ is the Fourier transform of $f(\omega)$; that is

\begin{eqnarray*}

\gamma(\nu) & = & \int_0^1\! f(\omega) e^{2\pi i \nu \omega}\,d\omega \\

f(\omega) & = & \sigma_X^2 +

2\sum_{\nu=1}^\infty \gamma(\nu)e^{-2\pi i \nu \omega}.

\end{eqnarray*}

The absolutely function $F$ defined by $$

F(\omega) = \int_0^\omega\! f(x)\,dx

$$

is the {\it cumulative spectral distribution function}. The function $f(\omega)$ is the {\it spectral density function} of $X(t)$.

The basic components of spatial data are the spatial locations $\{{\bs}_1, \ldots, {\bs}_n\}$ and the data $\{Z({\bs}_1), \ldots, Z({\bs}_n)\}$ observed at the locations. Usually the data are assumed random, and sometimes the locations are assumed random. Once the locations are given, the possibility of mistaken or imprecise positioning is generally not modeled.

Let ${\bs} \in {\bR}^k$ be a generic data location in $k$-dimensional Euclidean space and suppose that the {\it potential} data ${\bZ}({\bs})$ at spatial location ${\bs}$ is a random quantity. The locations ${\bs}$ vary over some index set $D \subset {\bR}^k$ determines to a large extent the method of analysis.

Just as in time series analysis, an attempt is made to determine correlation structure of points that are some distance apart (across space). If the covariance function $\gamma[{\bZ}({\bs}_1),{\bZ}({\bs}_2)] = \gamma({\bs}_1 - {\bs}_2)$, is a function of the difference of the locations, and not of the location itself for all locations, then the process is second-order stationary. If the covariance $\gamma(\cdot)$ is a function only of $||{\bs}_1 - {\bs}_2||$, then the $C(\cdot)$ is called {\it isotropic}. The function $C(\cdot)$ is called the {\it covariogram}. The property of {\it ergodicity} is also important in spatial statistics. Basically, it allows expectations over the set of all possible realizations of ${\bZ}({\bs})$ to be estimated by spatial averages. It says that the series, when successively translated, completely fills up the space of all possible trajectories. There are sufficient conditions for ergodicity. Often the assumption is made to allow inference to proceed for a series of dependent observations. It might only be verifiable in the sense that one fails to reject it.

The type of spatial data will determine precisely how correlation is estimated, modeling is conducted, and predictions are obtained.

Geostatistical data: $D$ is a fixed subset of ${\bR}^k$ that contains a $k$-dimensional rectangle of positive volume. The data $\{{\bZ}({\bs}\}$ is a random vector at location ${\bs} \in D$. The name ``geostatistics&apos;&apos; stems from the early beginnings of analysis of data where the spatial index ${\bs}$ is allowed to vary continuously over a subset of ${\bR}^k$. Other applications of methods in geostatistics include hydrology, soil science, public health, uniformity trials, and acid rain, to name a few.

Lattice data: $D$ is a fixed regular or irregular collection of countably many points of ${\bR}^k$. The data $\{{\bZ}({\bs})\}$ is a random vector at location ${\bs} \in D$. A lattice of locations evokes an idea of regularly spaced points in ${\bR}^k$, linked to nearest neighbors, second-nearest neighbors, and so on. Of all of the possible spatial structures, a data set with spatial locations on a regular lattice in ${\bR}^k$ are the closes analog to a time series at equally spaced time points.

Point patterns or marked spatial point process: $D$ is a point process in ${\bR}^k$ or a subset of ${\bR}^k$, and the data $\{{\bZ}({\bs})\}$ is a random vector at location ${\bs} \in D$. When no ${\bZ}$ is specified, the usual spatial point process is obtained. \item {\it Objects}: $D$ is a point process in ${\bR}^k$; $Z({\bs})$ is a random set. Point patterns arise when the important variable to be analyzed is the location of the events. Most often, the question to be answered is whether the pattern is exhibiting complete spatial randomness, clustering, or regularity. In the simplest case, the $Z$ variable is called the {\it mark variable}, and the whole process is a marked spatial point process. But the mark variable does not have to be a real variable. It could be a set. This yields processes such as the Boolean model.

The typical assumptions made in spatial analysis are that either ${\bZ}$ or $D$ is fixed (and the other random), or that ${\bZ}$ and $D$ are independent if both are random. Therefore spatial modeling occurs within the ${\bZ}$ process (geostatistical data and lattice data), within the $D$ process, or within both processes (point patterns), and typically involves modeling the large- and small-scale variations in terms of a finite number of parameters.

Data from remote sensing satellites offers an efficient means of gathering data of this type. There is a large overlap between the remote sensing techniques and (low-level) medical imaging techniques; although the spatial scales are vastly different, the form of the data and the questions being asked are often similar. Statistical models for such data need to express the fact that observations nearby (in time or space) tend to be more alike.

Data that form this construct are often images. The goal of analyzing such a data set is typically to estimate parameters of the random set and the point process. Boolean models have been successfully used to describe tumor growth rate. Another application of these is modeling cells growing in vitro, where the analysis is conducted in such a manner that takes shape as well as size into account.

- 1. Biostatistics for Dummies Biomedical Computing Cross-Training Seminar October 18th , 2002
- 2. What is “Biostatistics”? Techniques Mathematics Statistics Computing Data Medicine Biology
- 3. What is “Biostatistics”? Biological data Knowledge of biological process
- 4. Common Applications (Medical and otherwise) Clinical medicine Epidemiologic studies Biological laboratory research Biological field research Genetics Environmental health Health services Ecology Fisheries Wildlife biology Agriculture Forestry
- 5. Biostatisticians Work Develop study design Conduct analysis Oversee and regulate Determine policy Training researchers Development of new methods
- 6. Some Statistics on Biostatistics Internet search (Google) > 210,000 hits > 50 Graduate Programs in U.S. Too much to cover in one hour!
- 7. Center Focus MSU strengths Computational simulation in physical sciences Environmental health sciences Bioinformatics is crowded Computational simulation in environmental health sciences Build on appreciable MSU strength Establish ourselves Unique capability Particular appeal to NIEHS
- 8. Focus of Seminar Statistical methodologies Computational simulation in environmental health sciences Can be classified as “biostatistics” Stochastic modeling Time series Spatial statistics*
- 9. The Application Of interest Cancer incidence rate Pesticide exposure Of concern Age Gender Race Socioeconomic status Objectives Suitably adjust cancer incidence rate Determine if relationship exists Develop model Explain relationship Estimate cancer rate Predict cancer rate
- 10. The Data N.S.S. & U.S. Dept. of Commerce National T.I.S. (1972-2001, by county) Number of acres harvested Type of crop MS State Dept. Health Central Cancer Registry (1996 – 1998, by person) Tumor type Age Gender Race County of residence Cancer morbidity Crude incidence/100,000 Age adjusted incidence/100,000
- 11. Why (Bio)statistics? Statistics Science of uncertainty Model order from disorder Disorder exists Large scale rational explanation Smaller scale residual uncertainty Chaos Deterministic equation Randomness x0 Entropy
- 12. (Bio)statistical Data Independent identically distributed Inhomogeneous data Dependent data Time series Spatial statistics
- 13. Time Series Identically distributed Time dependent Equally spaced Randomness
- 14. Objectives in Time Series Graphical description Time plots Correlation plots Spectral plots Modeling Inference Prediction
- 15. Time Series Models Linear Models Covariance stationary Constant mean Constant variance Covariance function of distance in time(t) ~ i.i.d Zero mean Finite variance square summable
- 16. Nonlinear Time Series Amplitude-frequency dependence Jump phenomenon Harmonics Synchronization Limit cycles Biomedical applications Respiration Lupus-erythematosis Urinary introgen excretion Neural science Human pupillary system
- 17. Some Nonlinear Models Nonlinear AR Additive noise Threshold AR Smoothed TAR Markov chain driven Fractals Amplitude- dependent exponential AR Bilinear AR with conditional heteroscedasticity Functional coefficient AR
- 18. A Threshold Model
- 19. A Threshold Model
- 20. Describing Correlation Autocorrelation AR: exponential decay MA: 0 past q Partial autocorrelation AR: 0 past p MA: exponential decay Cross-correlation Relationship to spectral density
- 21. Spatial Statistics* Data components Spatial locations S = {s1,s2,…,sn} Observable variable {Z(s1),Z(s2),…,Z(sn)} s D Rk Correlation Data structures Geostatistical Lattice Point patterns or marked spatial point processes Objects Assumptions on Z and D
- 22. Biological Applications Geostatistics Soil science Public health Lattice Remote sensing Medical imaging Point patterns Tumor growth rate In vitro cell growth
- 23. Spatial Temporal Models Combine time series with spatial data Application Time element time Pesticide exposure develop cancer Spatial element Proximity to pesticide use

No public clipboards found for this slide

Be the first to comment