Adam Ralph from the Irish Centre for High End Computing presented this Introduction to Basic R during the Big Data Workshop hosted by the Social Sciences Computing Hub at the Whitaker Institute on the 14th November 2013
3. What is Analytics?
Lies, DAMN LIES and STATISTICS.
The dictionary definition is ”the systematic computational analysis of
data or statistics.”.
Today we shall look at three areas:
1. Hypothesis testing,
2. Model construction,
3. Prediction.
BasicR 3
4. Some Definitions
Population: this represents a large group of
observations/measurements.
For example it could be the height or age of people in Ireland.
Sample: is a subset of the measurements/observations from the
population.
Could be the height/age of people in this room.
Variable or random variable, are the set of measurements/observations
of the same type. For instance age measurements would be one
variable and height measurements another.
BasicR 4
6. Hypothesis Testing
The simplest form of hypothesis is does this sample come from this
population.
This might not seem particularly useful, however if we consider the
effects of a drug.
Patients blood pressure is measured before and after the drug is
administered.
Using a paired T-test the effectiveness of the drug can be determined.
BasicR 6
7. Some Definitions
When modeling there is usually one variable that you want to model,
this is called the ”response variable”.
The other variables are the ”explanatory variables”.
The goal of the model is to ”explain” the variation in the response
variable by the variation in the explanatory ones.
BasicR 7
8. Model Building
The simplest model is a linear regression model with one response
and one explanatory variable.
Figure:
BasicR 8
9. Regression
Regression techniques can be extended to many explanatory variables.
With this comes the possibility of variables interacting and a choice of
models or model selection.
It is important to realize that even if a explanatory variable perfectly
models the response variable, it does not imply an effect!
BasicR 9
10. Classification
Regression is a technique used for continuous variables.
Classification techniques are like models for categorical data.
Typically you can train a machine-learning algorithm to classify
objects/people from a set of explanatory variables.
Given a new set of measurements, the algorithm can then classify the
new object/person.
BasicR 10
11. Prediction
Models are used to make predictions outside the range of experimental
values. For example the phases of the moon and the tides.
Care must be taken when using statistically derived models, in that
they may not hold outside this range.
Even when a system is completely deterministic, if it is chaotic
predictions can be difficult.
Monte-Carlo approaches can be used to determine the range of
responses (hence the error) in such systems.
BasicR 11
12. Time Series Analysis
Time series data are measurements collected at regular time intervals.
The data can be split into three components:
1. Seasonal, or regular fluctuations on a frequency higher than that of the
dataset.
2. Trend, fluctuations on a frequency larger than that of the dataset.
3. Random, fluctuations with no apparent pattern.
Time series analysis is a technique that allows prediction of events
into the future using data from the past.
BasicR 12
13. Trend Discovery
A trend is a steady one-way change in a response variable after
removing the random and/or known variation.
One of the most topical trends at the moment is Global Warming.
Trends are linked to model building in some sense in that discovery of
a trend indicates that the model is incomplete.
In the case of global warming, we know that temperature varies daily
and seasonally and over much longer time periods. The temperature
trend is the change in temperature when these effects are removed.
BasicR 13
15. Why is it Important?
From a scientific stand point, all measurements we take are subject to
error.
That means any conclusion given this flawed data must also have an
error.
The use of analytics provides a mechanism to objectively evaluate the
error in our conclusion given the data and some assumptions about
the data.
BasicR 15
16. ICHEC and BDI
First example is a collaboration with Biomedical Devices Ireland
(BDI).
We are measuring properties of blood platelets of normal individuals
vs those with blood disorders, under arterial shear.
We hope to be able to flag individuals for further testing using a
machine-learning algorithm.
BasicR 16
18. ICHEC and Wind Energy
Wind farms are mandated to provide an estimate of their future
power production.
Penalties exist for inaccurate information.
ICHEC has developed a system that will take weather forecasts from
Met Eirann and other sources that can be applied to farm in question.
Using model averaging techniques, the inevitable forecast errors can
be reduced.
BasicR 18