Advanced Statistical Manual for Ayurveda Research

CONFIDENTIAL © 2019 AyurData
Balance
Calmness
Serenity
Introducing
Advanced Statistical Manual
for Ayurveda Research
Kadiroo Jayaraman
Praveen Venugopal
AyurData

© 2019 AyurData /CONFIDENTIAL 2
The basic manual
covered the syllabus
specified for M.D.
students on Medical
Statistics, by the Central
Council of Indian
Medicine.
Earlier Release

In the advanced manual, we
have covered more
advanced statistical
applications including that in
data science.
The mode of presentation is
that the concept is
introduced first, followed by
illustration and the use in a
real context.
Some mathematics will be
involved but well explained
in the text.
Current Release

TOPICS COVERED
1.Analysis of Repeated Measures
2.Multiple Linear Regression
3.Superiority, Bioequivalence and Non-inferiority trials
4.Logistic Regression
5.Decision Trees
6.Random Forest
7.Support Vector Machines
8.Naïve Bayes Classifier
9.Neural Networks
10.K-Nearest Neighbour Technique
11.Principal Component Analysis
12.Cluster Analysis
13.Stratified Multistage Sampling
14.Analysis of Time Series Data
15.Analysis of Time-to-event Data

ANALYSIS OF REPEATED MEASURES
Repeated measurements is a common case in Ayurveda experiments.
Repeated measurements on the same individual would be correlated and
so require special analysis.

ANALYSIS OF REPEATED MEASURES
ANOVA table for two-way with repeated measures for one factor
For instance, Factor A could be treatment and B could be age group provided
the patients have been stratified based on age group. In case age was a taken
as a baseline variable that is continuous, it can be included as a covariate in the
model. The computations and interpretations are well-explained in the manual.

MULTIPLE LINEAR REGRESSION
The model we are covering is,
which in the matrix form would be,
The model fitting, testing and residual analysis are illustrated using
a real-life dataset.

Superiority, Bioequivalence and Non-inferiority trials
In many practical situations, we encounter the following types of
comparisons:
• The new drug is better than the standard drug.
• The new drug is equivalent to the standard drug.
• The new drug is at least as good as the standard drug.
The hypothesis and test criterion to be employed in each of the above
cases have to be different. For instance,
Superiority hypothesis
H0: ∆ = 0
H1: ∆ ≠ 0, or (∆ > 0, or ∆ <0 for one-tailed tests)

Superiority, Bioequivalence and Non-inferiority trials
Bio-equivalence hypothesis
H0: ∆ > ∆E or ∆ < -∆E
H1: ∆E ≤ ∆ ≤ ∆E where ∆E is a clinically relevant equivalence
margin (usually 10%).
Non-inferiority hypothesis
H0: ∆ ≤-∆NI
H1: ∆ >-∆NI where ∆NI is a clinically relevant non-inferiority margin
(usually 10%).
The tests involved and the interpretation are illustrated using
examples from Ayurveda.

Logistic Regression
Logistic model (or logit model) is used to model the probability of events
realized in two classes such as alive/dead or healthy/sick etc. The
dependent variable in this case would take values of 1 or 0. The
theoretical form of the model is,
Logistic models are especially suited for case-control studies and are
useful in understanding the predisposing factors leading to a diseased
condition.
The fitting of the logistic model including the derivation of ROC curve and
optimal cut-off points and measuring the goodness of fit are illustrated
using real data from Ayurveda.

Decision Trees
Decision tree is a type of supervised learning algorithm (having a pre-
defined target variable) that is mostly used in classification problems.
In this technique, we split the population or sample into two or more
homogeneous sets (or sub-populations) based on the most significant
splitter/differentiator in input variables.
One popular measure used for spitting is the information gain. This is
equivalent to selecting a particular node with maximum reduction in
entropy as measured by Shannon’s index (H).
where s is the number of groups at a node and indicate the proportion
of individuals in the ith group.

Decision Trees
Decision tree for exposure factors of Coronary Artery Disease

Random Forest
Random forest is an ensemble technique in which the idea is to generate
multiple models on a training dataset and then simply combine (average)
their output rules or their hypotheses to generate a stronger model which
performs very well.
For instance, in the case of fitting a decision tree, we consider fitting
several decision trees and taking the average decision tree. It consists of a
resampling plan that considers not only subsamples but also a subset of
features in each sample generated.
This way, it gives out a model with considerably better efficiency in
prediction or classification.
The use of random forest is illustrated in the context of some real data in
Ayurveda.

Support Vector Machines
Given a set of training examples, each marked as belonging to one or the
other of two categories, a support vector machine (SVM) training algorithm
builds a model that assigns new examples to one category or the other. In
theory, SVM is a discriminative classifier formally defined by a separating
hyperplane.
It is useful for even for non-linear separation of the data points.

Naïve Bayes Classifier
Naive Bayes is a common technique used in the field of medical science
and is especially used for cancer detection.
The foundation of naive Bayes algorithm is the Baye’s theorem which
states that
P(B/A) = P(B) * P(A/B) / P(A)
For the case multiple variables, we evaluate the posterior probability as,
For classification, we may use the following equation:

Neural Networks
A neural network is a series of algorithms that endeavours to recognize
underlying relationships in a set of data through a process that mimics the
way the human brain operates.
A neural network is akin to the human brain’s neural network. The basic
computational unit of the brain is a neuron. A ‘neuron’ in a neural network
is a mathematical function that collects and classifies information
according to a specific architecture.

Neural Networks
A neural network could involve both forward and backward propagation.
After a series of iterations, the algorithm arrives at a decision on
identification or classification.
The use of neural networks in classification is illustrated using example
from Ayurveda.

K-Nearest Neighbour Technique (K-NN)
K-NN is a simple algorithm that stores all available cases and classifies
new cases based on a similarity measure (e.g., distance functions). K-NN
can be used for both classification and predictive problems.
A case is classified by a majority vote of its neighbours meaning the case
being assigned to the most common class amongst its K nearest
neighbours measured by a distance function. Some of the distance
measures applicable for continuous variables are the following.

Principal Component Analysis
Principal component analysis (PCA) is a technique for reducing the
dimensionality of a dataset, increasing interpretability but at the same time
minimizing information loss.
PCA uses an orthogonal transformation to convert a set of observations of
possibly correlated variables into a set of values of linearly uncorrelated
variables called principal components.

Principal Component Analysis
The real advantage is that the variables in the transformed scale are
independent and a few components can usually account for a substantial
part of the total variance.
By interpreting the components in terms of the functions of the original
variables, hidden factors operating in the system can be identified which
are not directly measurable.
PC1 PC2 PC3 PC4 PC5
Age 0.125 0.066 -0.207 0.821 0.253
BMI 0.260 0.499 0.426 -0.071 -0.232
Glucose 0.439 -0.186 -0.131 0.126 0.200
Insulin 0.444 -0.386 0.094 -0.060 -0.298
HOMA 0.493 -0.375 -0.012 -0.006 -0.139
Leptin 0.331 0.234 0.583 0.058 0.288
Adiponectin -0.173 -0.481 0.282 -0.277 0.529
Resistin 0.282 0.304 -0.289 -0.303 0.598
MCP.1 0.255 0.210 -0.497 -0.359 -0.119

Cluster Analysis
Cluster analysis is a multivariate method which aims to classify a
collection of objects on the basis of a set of measured variables into a
number of different groups such that similar subjects are placed in the
same group.
Clustering algorithms start with computing a distance measure between
the objects and then grouping them based on a clustering algorithm. The
distance measure varies with the scale of measurement and many
clustering algorithms are available which can be broadly classed as
hierarchical and non-hierarchical.
In the final step, we are left with a cluster diagram that depicts the different
groups and the extent of similarity/dissimilarity between the objects.

Cluster Analysis
Hierarchical vs. Non-hierarchical clustering

Stratified Multistage Sampling
This is one of the most popular sampling schemes in large scale surveys.
In Ayurveda, this sampling design can be used effectively in prevalence
studies and other types status surveys.
The basic idea is to group the population into homogeneous units based
on geographical proximity or based on other characteristics and then
implement a multistage sampling within each stratum.
A two stage sampling will take a sample of larger clusters of sampling
units (primary stage units) and then get a subsample of smaller cluster of
units (second stage units) with each selected first stage units.
Stratified subsampling can provide greater precision than a simple random
sample of the same size. Multistage sampling works the other way that
variance gets inflated by subsampling but saves time and effort.

Analysis of Time Series Data
A time series is a sequence of observations recorded at a succession of time
intervals. It could be an output from an ECG or EEG, serial recording of pulse rate or
recordings of gait or tremor through digital devises from patients suffering from
Parkinson’s disease.
The peculiarity with time series data is that of correlation between successive
measurements (autocorrelation) which calls for special methods of analysis.
Quite often, the object of interest is to recognize the pattern of movements or
fluctuations over time and compare such patterns across different experimental
settings.
Methods for time series analysis may be divided into two classes: time-
domain methods and frequency-domain methods.
Time-domain methods splits the series into trend, cyclical, seasonal and random
components. Frequency-domain methods identifies mainly cyclical patterns through
spectral analysis, after making the series stationary.

Analysis of Time Series Data
The power spectrum obtained through spectral analysis of heart rate shows
concentration at a frequency of about 0.015 Hz equivalent to a cycle duration of
64 seconds. This pattern is commonly observed in the context of congestive heart
failure, where circulatory delays interfere with regulation of carbon dioxide and
oxygen in the blood, leading to slow oscillations of heart rate.

Analysis of Time-to-event Data
Time-to-event (TTE) data is unique because the outcome of interest is not only
whether or not an event occurred, but also ‘when’ that event occurred such as in
survival studies.
The major objects of interest are:
• S(t) = 1 – F(t): The survival function and the cumulative probability density
function.
• h(t)=f(t)/S(t): The instantaneous hazard equals the unconditional probability of
experiencing the event at time t, scaled by the fraction alive at time t.
• H(t) = -log[S(t)] The cumulative hazard function equals the negative log of the
survival function.
• S(t) = e –H(t) The survival function equals the exponentiated negative cumulative
hazard function.
One challenge specific to survival analysis is that only some individuals will have
experienced the event by the end of the study, and therefore survival times will be
unknown for a subset of the study group.
One of

Analysis of Time-to-event Data
Special estimation methods and models are available to deal with such data.
Kaplan-Meier Estimator
Cox proportional hazards model
Accelerated Failure Time (AFT) models
The advanced manual discusses all the above methods
Survival curves for the two treatment groups

Recommendations
In order to bring Ayurveda research in line with the mainstream scientific
research, there is a need to incorporate modern methods in research.
Ours was a humble attempt in this regard to bring to light some of the
more popular analytical methods in the Ayurveda context.
Our recommendation is to understand and utilize these techniques in your
research activities and bring Ayurveda to the forefront.
We are willing to conduct a training workshop on these methods for the
benefit of the researchers in Ayurveda.
Kindle edition of the book is available in amazon (link available in the
AyurData Facebook page):
https://www.facebook.com/pg/AyurData/posts/?ref=page_internal

THANK YOU
AyurData Team
Website: http://ayurdata.in/#service-content
Facebook: https://www.facebook.com/AyurData/

Advanced Statistical Manual for Ayurveda Research

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Advanced Statistical Manual for Ayurveda Research

Similar to Advanced Statistical Manual for Ayurveda Research (20)

More from Ayurdata

More from Ayurdata (18)

Recently uploaded

Recently uploaded (20)

Advanced Statistical Manual for Ayurveda Research