Microsoft Excel is a spreadsheet program used to record and analyse numerical and statistical data. Microsoft Excel provides multiple features to perform various operations like calculations, pivot tables, graph tools, macro programming, etc.
An Excel spreadsheet can be understood as a collection of columns and rows that form a table. Alphabetical letters are usually assigned to columns, and numbers are usually assigned to rows. The point where a column and a row meet is called a cell.
SPSS (Statistical Package for the Social Sciences) is a versatile and responsive program designed to undertake a range of statistical procedures. SPSS software is widely used in a range of disciplines and is available from all computer pools within the University of South Australia.
DOE is an essential tool to ensure products and processes satisfy Quality by Design requirements imposed by regulatory agencies. Using a QbD approach to develop your testing process can help you reduce waste, meet compliance criteria and get to market faster.
DOE helps you create a reliable QbD process for assessing formula robustness, determining critical quality attributes and predicting shelf life by using a few months of historical data.
Minitab is a statistics package developed at the Pennsylvania State University by researchers Barbara F. Ryan, Thomas A. Ryan, Jr., and Brian L. Joiner in conjunction with Triola Statistics Company in 1972.
It began as a light version of OMNITAB 80, a statistical analysis program by NIST, which was conceived by Joseph Hilsenrath in years 1962-1964 as OMNITAB program for IBM 7090. The documentation for OMNITAB 80 was last published 1986, and there has been no significant development since then.
R is a language and environment for statistical computing and graphics."
"R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering) and graphical techniques, and is highly extensible."
"One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed.“
1. BIOSTATISTICS AND RESEARCH METHODOLOGY
Unit-4: statistical analysis
PRESENTED BY
Himanshu Rasyara
B. Pharmacy IV Year
UNDER THE GUIDANCE OF
Gangu Sreelatha M.Pharm., (Ph.D)
Assistant Professor
CMR College of Pharmacy, Hyderabad.
email: sreelatha1801@gmail.com
2. • What is Microsoft Excel?
• Microsoft Excel is a spreadsheet program used to record and analyse numerical and statistical data.
Microsoft Excel provides multiple features to perform various operations like calculations, pivot tables,
graph tools, macro programming, etc.
• An Excel spreadsheet can be understood as a collection of columns and rows that form a table.
Alphabetical letters are usually assigned to columns, and numbers are usually assigned to rows. The point
where a column and a row meet is called a cell.
MS-EXCEL is a part of Microsoft Office suite software. It is an electronic spreadsheet with numerous
rows and columns, used for organizing data, graphically represent data(s), and performing different
calculations. It consists of 1048576 rows and 16383 columns, a row and column together make a cell.
3. • Understanding the Ribbon
• The ribbon provides shortcuts to commands in Excel. A command is an action that the user performs. An
example of a command is creating a new document, printing a documenting, etc.
You can perform statistical analysis with the help of Excel. It is used by most of the data scientists who
require the understanding of statistical concepts and behavior of the data. But when the data set is huge or
you need some specialized data analysis model such as linear or regression, you should go for advanced
tools such as Python, R programming. Here, we will go through the basic concept of statistical analysis and
will apply the concepts to our own data.
Before starting, you need to check whether Excel Analysis Tool Pak is enabled in Excel or not (it is an add-
in provided by Microsoft Excel). To check whether it is enabled or not, go to Excel → Data and check
whether data analysis option is there or not on the top right corner.
4. • If it is not there, go to Excel → File → Options → Add-in and enable the Analysis Tool Pak by selecting
the Excel Add-ins option in manage tab and then, click GO. This will open a small window; select the
Analysis Tool Pak option and enable it.
• Descriptive Analysis
• You can find descriptive analysis by going to Excel→ Data→ Data Analysis → Descriptive statistics. It is the
most basic set of analysis that can be performed on any data set. It gives you the general behaviour and pattern
of the data. It is helpful when you a have a set of data and want to have the summary of that dataset. This will
show the following statistic data for the chosen dataset.
• Mean, Standard error and Median
• Median, Mode and Standard Deviation
• Sample Variance
• Kurtosis and Skewness
• Range, Minimum, Maximum, Sum and Count
• ANOVA (Analysis Of Variance)
• It is a data analysis method which shows whether the mean of two or more data set is significantly different
from each other or not. In other words, it analyses two or more groups simultaneously and finds out whether
any relationship is there among the groups of data set or not. For example, you can use ANOVA if you want to
analyze the traffic of three different cities and find out which one is more efficient in handling the traffic (or if
there are no significant differences among the traffic).
You will find three types of ANOVA in the Excel
1.ANOVA single factor
2.ANOVA two factor with replication
3.ANOVA two factor without replication
5. • Moving Average
• Moving average is usually applicable for time series data such as stock price, weather report, attendance
in class etc. For example, it is heavily used in stock price as a technical indicator. If you want to predict the
stock price of today, the last ten days data would be more relevant than the last 1 year. So, you can plot the
moving average of the stock having a 10-day time period and you can then predict the price to some
extent. The same applies to the temperature of a city. The recent temperature of a city can be calculated by
taking the average of last few weeks rather than previous months.
• Regression
• Regression is a process of establishing a relationship among many variables. Usually, we establish a
relationship between dependent variables and independent variables. For example, cases when you want to
see if there is any increase in the revenue of product, which is not due to increase in the advertisement.
6. • Sampling
• This option is the data analysis tool which is used for creating samples from a huge population. You can
randomly select data from the dataset or select every nth item from the set. For example, if you want to
measure the effectiveness of a call center employee in a call center, you can use this tool to randomly select
few data every month and listen to their recorded calls and give a rating based on the selected call.
7. SPSS
• SPSS (Statistical Package for the Social Sciences) is a versatile and responsive program designed to
undertake a range of statistical procedures. SPSS software is widely used in a range of disciplines and is
available from all computer pools within the University of South Australia.
• SPSS is a Windows based program that can be used to perform data entry and analysis and to create
tables and graphs. SPSS is capable of handling large amounts of data and can perform all of the analyses
covered in the text and much more. SPSS is commonly used in the Social Sciences and in the business
world.
Task: Open SPSS
Click on the Start menu ( ) > All Programs > IBM SPSS Statistics > IBM SPSS Statistics 21
(or whatever is the latest version number) to pen the SPSS program.
• Layout of SPSS The Data Editor window has two views that can be selected from the lower left hand
side of the screen. Data View is where you see the data you are using. Variable View is where you can
specify the format of your data when you are creating a file or where you can check the format of a pre-
existing file. The data in the Data Editor is saved in a file with the extension.sav.
• Data view : It is the spreadsheet that is visible when you first open the Data Editor; this sheet contains
the data. Unlike MS Excel, formulas and variable names cannot be
entered here.
Variable view : It contains information about the variables in the data set.
8. Syntax
Another important window in the SPSS environment is the Syntax Editor. In earlier versions of SPSS, all
of the procedures performed by SPSS were submitted through the use of syntax, which instructed SPSS on
how to process your data. Using SPSS syntax allows you access to additional commands which are not
available through the menus and dialog boxes, and syntax files can be stored and rerun at a later date,
allowing you to repeat an analysis.
From the menu in the Data Editor window
File >> New >> Syntax
9. Output Viewer
When you execute a command for a statistical analysis, regardless of whether you used syntax or dialog boxes, the output will be
printed in the Output Viewer.
From the menu in the Data Editor window
File >> New >> Output
10. DESIGN OF EXPERIMENTS
• DOE is an essential tool to ensure products and processes satisfy Quality by Design requirements
imposed by regulatory agencies. Using a QbD approach to develop your testing process can help you
reduce waste, meet compliance criteria and get to market faster.
• DOE helps you create a reliable QbD process for assessing formula robustness, determining critical
quality attributes and predicting shelf life by using a few months of historical data.
Why Use a Quality By Design Approach?
Using a Quality by Design (QbD) approach to develop the testing process and to choose the critical quality
attributes for a pharmaceutical product can help to:
• Ensure products meet defined critical quality attributes
• Meet regulatory compliance criteria
• Predict formula robustness
• Reduce waste in production
• Get to market faster
11. Using DOE to Optimize Processes
• When it comes to creating an optimal manufacturing process that limits variation and conserves energy
or resources, or a developing a new formula that is most likely to meet customer expectations, design of
experiments (DOE) is an indispensable tool.
DOE helps you to:
• Minimize the number of experiments you have to do to find the ideal formula or recipe
• Create a robust process (one that holds up to changes in environment, humidity, ingredient variation,
etc.)
• Adapt a recipe for changes in ingredients or packaging needs (availability, eco-compliance, regulations,
consumer trends, etc.)
Using DOE to Predict Formula Robustness
• Being able to demonstrate product robustness and deliver the intended quality of the product within
allowable ranges for the claimed shelf-life period is critical for pharmaceutical manufacturers. Both
international and country specific regulatory agencies, such as the FDA, pay close attention to shelf-life
claims.
• Predicting formulation robustness requires a careful design of experiments that holds up under statistical
analysis. Using DOE for formulation robustness studies can help you select a commercial formulation
that is sufficiently robust within the acceptable ranges around the label claim to meet the shelf life
stability requirements.
12. Steps to Predict Formula Robustness
Step 1: Choose the Right Measurement Factors
• Ensure that the factors selected to study can be used to predict an acceptable formulation parameter
range where all the values for the assessed quality attributes will be inside the specified limits.
Step 2: Design a Statistically Valid Study
• Consider how the factors being investigated fit into a full factorial design. For pharma companies, for
example, robustness studies must be able to prove that specific critical quality attributes stay within the
acceptable ranges for the entire shelf-life period. In addition:
• The study must result in a regression model that is statistically significant
• The study must provide output parameters (quality attributes) that are within predefined limits
Step 3. Analyze the Data Using Multiple Linear Regression
• One important way to produce a valid testing model is to use a tool that makes Design of Experiments
easier. For example, MODDE® Design of Experiments Software, can help you set up multivariate
formulation robustness studies that demonstrate the acceptable ranges of quality for a target composition,
define the allowable edges of the composition range, and predict the stability requirements needed to
reach the end of shelf life.
13. DOE EduPack is designed to give students hands-on skills to solve problems and learn:
• How to create efficient experimental designs to match the objectives
• How to analyze data based on sound statistical principals to evaluate results of the experiments
• How to interpret results by using graphical and statistical tools
• How to convert modeling results into concrete action with MODDE® optimizer & verifying experiments
• How to define a design space and find robust setpoints
• APPLICATIONS OF DESIGN OF EXPERIMENTS IN QbD AND AQbD
Quality by Design approach was accepted by FDA in 2004 and described in ‘pharmaceutical cGMPs for
21st century – a risk-based approach’.
• International conference on harmonization (ICH) Q8 pharmaceutical development, Q9 quality risk
assessment, and Q10 pharmaceutical quality system provide detailed requirements regarding
pharmaceutical product quality.
• QbD and DoE approaches help to implement ICH/Q8 and ICH/Q9.
14. • Since QbD approach was accepted by FDA, DoE has been widely employed in order to provide a
complete understanding of the product and its manufacturing process. Many applications of DoE used
for screening and optimization purposes of pharmaceutical products and their manufacturing processes
may be found in the literature. Several input factors (independent variables), such as excipient
concentrations, stirring time, stirring speed, temperature, pressure, among other may be screened and
optimized using DoE. Studied output responses (dependent variables) included particle size, entrapment
efficiency, dissolution rate, among other.
• Application of screening designs in pharmaceutical QbD allow to identify the critical material attributes
(CMAs) and critical process parameters (CPPs) (independent variables) affecting the critical quality
attributes (CQAs) (dependent variables) and, therefore, the quality target product profile (QTPP). In
addition, optimizing design and surface response methodology and multiple response optimization allow
to define a design space region in which CQAs and QTPP are attended. The adoption of a design space
region based on product and process understanding allow regulatory flexibility, because changes within
the design space region do not require prior regulatory approval.
• Recently, DoE has been used in the rational development and optimization of analytical methods.
Culture media composition, mobile phase composition, flow rate, time of incubation are examples of
input factors (independent variables) that may the screened and optimized using DoE. Several output
responses (dependent variables), such as retention time, resolution between peaks, microbial growth,
among other responses were found in literature.
15.
16.
17. MINITAB
• Minitab is a statistics package developed at the Pennsylvania State University by researchers Barbara F.
Ryan, Thomas A. Ryan, Jr., and Brian L. Joiner in conjunction with Triola Statistics Company in 1972.
• It began as a light version of OMNITAB 80, a statistical analysis program by NIST, which was
conceived by Joseph Hilsenrath in years 1962-1964 as OMNITAB program for IBM 7090. The
documentation for OMNITAB 80 was last published 1986, and there has been no significant
development since then.
• In 2020, during the COVID-19 pandemic, Minitab LLC requested and received between $5 million and
$10 million under the Pay check Protection Program to avoid having to let go 250 employees. As of
2021, Minitab LLC had subsidiaries in the UK, France, Germany, Hong Kong, and Australia.
• A statistics package developed to help six sigma professionals analyse and interpret data to help in the
business process is called Minitab. The data input is simplified so that it can be easily used for statistical
analysis and it also helps in manipulating the dataset.
• Key Features of Minitab
1. Basic Statistics: This feature covers all kind of statistical tests, descriptive statistics, correlations, and
covariances.
2. Graphics: This enables users to draw various statistical graphs such as scatter plot, histograms,
boxplots, matrix plot, marginal plot, bubble charts etc.
3. Regression: This feature enables users to find the relationship between variables (which is a key feature
of any statistical tool). Regression is available in form of linear, non-linear, ordinal, nominal etc.
18. 4. Analysis of Variance: Analysis of variance i.e., ANOVA is used to analyse the difference between the group
means.
5. Statistical Process Control: This feature helps you create cause and effect diagrams, variable control charts, multi-
variate control charts, time-weighted charts, etc.
6. Measurement System Analysis: MSA is a mathematical method to determine the amount of variation that exists
within a measurement process. Variability in a process can directly impact the overall variance of a process.
7. Design of Experimentations: This feature helps you identify the cause-and-effect relationship. This helps in
creating and experimenting with various designs by noting down all its relevant outputs. This helps you on finalizing
a certain method and optimizing it.
8. Reliability/Survival: It enables you to select the best distribution for modelling data. It helps you in identifying
which is the best function that best describes your data.
• One of the most common methods used in statistical analysis is hypothesis testing. Minitab offers many
hypothesis tests, including t-tests and ANOVA (analysis of variance). Usually, when you perform a
hypothesis test, you assume an initial claim to be true, and then test this claim using sample data.
• Hypothesis tests include two hypotheses (claims), the null hypothesis (H0) and the alternative hypothesis
(H1). The null hypothesis is the initial claim and is often specified based on previous research or
common knowledge. The alternative hypothesis is what you believe might be true.
19. Perform an ANOVA
1. Choose Stat> ANOVA>One- Way.
2. Choose Response data are in one column for all factor levels.
3. In response, enter Days. In factor, enter Center.
4. Click Comparisons.
5. Under Comparison procedures assuming equal variances,
check Tukey.
20. Click OK.
7. Click Graphs. For many statistical commands, Minitab includes graphs that help you interpret the
results and assess the validity of statistical assumptions. These graphs are called built-in graphs.
8. Under Data plots, check Interval plot, Individual value plot, and Boxplot of data.
9. Under Residual plots, choose Four in one.
10. Click OK in each dialog box.
21. R-ONLINE
R is a language and environment for statistical computing and graphics."
• "R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-
series analysis, classification, clustering) and graphical techniques, and is highly extensible."
• "One of R's strengths is the ease with which well-designed publication-quality plots can be produced,
including mathematical symbols and formulae where needed.“
Importance of R Programming Language
• R is a well-developed, simple, and effective programming language. Which includes conditional loops;
user defined recursive functions and input and output facilities.
• R provides graphical facilities for data analysis and display.
• R is a very flexible language. It does not necessitate that everything should be done in R itself. It allows
the use of other tools, like C and C++ if required.
• R has an effective data handling and storage facility.
• R provides an extensive, coherent, and integrated collection of tools for data analysis.
R also includes a package system that allows the users to add their individual functionality in a manner
that is indistinguishable from the core of R.
• R is actively used for statistical computing and design. It has brought about revolutionary improvements
in big data and data analytics. It is the most widely used language in the world of data science! Some of
the big shots in the industry like Google, LinkedIn, and Facebook, rely on R for many of their
operations.
22. Programming Features of R
• R has various programming features which we will discuss below:
1. Data Inputs and Data Management
Data inputs such as data type, importing data, keyboard typing.
Data management such as data variables, operators.
2. Distributed Computing and R Packages –
• Distributed computing is an open-source, high-performance platform for the R language. It splits tasks
between multiple processing nodes to reduce execution time and analyse large datasets.
R Packages – R packages are a collection of R functions, compiled code and sample data. By default, R
installs a set of packages during installation.
Advantages and Disadvantages of R Programming
• There are several benefits and some limitations of the R programming language. Let us discuss them one
by one:
• Pros of R Language
• R is the most comprehensive statistical analysis package, as new technology and ideas often appear first
in R.
• It is cross-platform which runs on many operating systems. It’s best for GNU/Linux and Microsoft
Windows.
23. • In R, everyone is welcomed to provide bug fixes, code enhancements, and new packages.
• Cons of R Language
The quality of some packages in R is less than perfect.
There’s no customer support of R Language whom you can complain if something doesn’t work.
R commands hardly concerns over memory management, and so R can consume all the available
memory.
• USE OF R-PROGRAMMING FOR CLINICAL TRAIL DATAANALYSIS
• The use of R programming in clinical trials has not been the most popular and obvious, despite its recent
growth over the past few years, its practical use still seems to be hindered by several factors, sometimes
due to misunderstandings, (e.g. validation) but also because of a lack of knowledge of its capabilities.
Despite these bottlenecks, though, R is doubtlessly creating its own (larger by the day) niche in the
pharmaceutical industry.
• In this blog we will see how R can be used to create TLFs much like the current combination of PROC
REPORT/PROC TABULATE and the ODS currently does, thus showing its power and capability to play
an important role in our industry in the years to come, not as a replacement for, but rather as an
alternative option to SAS®.
24. USES OF R-PROGRAMMING
Although R is a popular language used by many programmers, it is especially effective when used for
Data analysis
Statistical inference
Machine learning algorithms
R offers a wide variety of statistics-related libraries and provides a favourable environment for statistical
computing and design. In addition, the R programming language gets used by many quantitative analysts
as a programming tool since it's useful for data importing and cleaning.
As of August 2021, R is one of the top five programming languages of the year, so it’s a favourite among
data analysts and research programmers. It’s also used as a fundamental tool for finance, which relies
heavily on statistical data.
The Popularity of R by Industry
Thanks to its versatility, many different industries use the R programming language. Here is a list of
disciplines that use the R programming language:
Fintech Companies (financial services)
Academic Research
Government (FDA, National Weather Service)
Retail
Social Media
Data Journalism
Manufacturing
Healthcare