A sample of IELTS learners were tested and the time they spend studying and the mock tests practised were analyzed with respect to their performance. Two Machine Learning algorithms were utilized for classification and regression and instructional insights were derived for helping the learners in their journey towards success.
2. FACTORS THAT AFFECT IELTS
PERFORMANCE
27 students were part of an IELTS (International English Language Testing System) coaching
program and were tested and the number of months spent studying, weekly practice
hours and number of mock tests taken were measured. (Zakir. (2023). IELTS Success
Stories Dataset [Data set]. Kaggle.
Two Machine Learning algorithms, SVM (Support Vector Machine) and Decision-
Trees were used for both classification and regression basis the number of months spent
studying, weekly practice hours and number of mock tests taken.
Models basis these algorithms were built and predictions were made. The models were tested
for accuracy as well.
We may use the predictions for:
Understanding how the IELTS Score depends on number of months spent studying,
weekly practice hours and number of mock tests taken.
Classifying learners on the basis of the IELTS Score to find out how each variable affects
performance.
Predicting scores on the basis of the values of the variables.
3. SVM (SUPPORT VECTOR MACHINE)
CLASSIFICATION
Support Vector Machine (SVM) was
used for classification as it is useful for
smaller datasets and has versatile
kernel functions.
The IELTS scores are classified as
classes 1,2 and 3. (1= Band 7 and below,
2 = Bands 7.5 and 8, 3 = Band 8.5 and
above)
The independent variables are :
Months (Number of months spent
studying)
WeeklyPracticeHrs (Weekly
practice hours)
MockTeststaken (Number of mock
tests taken before the final test
date)
4. SVM – RADIAL KERNEL
A radial kernel was selected at first and
the model was run.
The misclassification was 33%.
The model was plotted with number of
months and weekly practice hours and
specifically focused on number of mock
tests taken at values 3 and 5 separately.
5. SVM CLASSIFICATION PLOT
We observe that
even after taking
only 3 mock tests
and studying up to
6 months, it is
possible to get a
very high score.
The number of
weekly practice
hours is more
important than the
number of months
spent studying.
6. SVM CLASSIFICATION PLOT
We observe that
even after taking 5
mock tests, the
learners might
score very high.
7. SVM CLASSIFICATION – KERNELS - 1
We can use different kernels to test which model is most accurate and suitable for classification.
Least misclassification
8. SVM CLASSIFICATION – KERNELS - 2
We can use different kernels to test which model is most accurate and suitable for classification.
9. TUNING THE MODEL
To capture non-linearity
between the variables and to
optimize model performance,
the model was tuned.
10.
11. Though the initial model with linear kernel showed
least misclassification, the model with radial kernel
was the best with misclassification of 33%. This
could be attributed to an imbalanced dataset with
only 2 observations of class 1(Band 7 and below) out
of total 27 observations. The best model also helps
capture non-linearity of data.
12.
13. SVR (SUPPORT VECTOR REGRESSION)
The variables ‘WeeklyPracticeHrs’ and
‘MockTeststaken’ are strongly
correlated. To avoid the impact of
multicollinearity, the dependent
variable, ‘Score’ was predicted using
‘WeeklyPracticeHrs’ and
‘MockTeststaken’ separately.
14. SVR (SUPPORT VECTOR REGRESSION)
A predictive analysis was
conducted using various
combinations of variable
values to calculate the IELTS
Score(Band value).
The Mean Squared Error
(MSE), Mean Absolute Error
(MAE) and Coefficient of
Determination (R-Squared)
were calculated.
Lower MSE and MAE
indicate good performance
of the model. High value of
R-squared indicates a good
fit.
15. There are several outliers in each
regression with number of months
spent studying, weekly practice
hours and number of mock tests
taken as independent variables
where score is the dependent
variable.
Hence , the regression lines are
non-linear. A larger dataset might
lead to a more significant analysis.
16. INSTRUCTIONAL INSIGHTS
The number of months spent studying has the most outliers and has a negative
correlation coefficient. It is not a very significant factor that affects the learner’s
score.
The number of hours spent studying per week and the number of mock tests
taken are more significant to learner performance. So the batch of learners
should be encouraged to practice studying more every week for greater
bandwidth to take mock tests.
The instructor may use the SVR predictive analysis to predict how many hours
of practice per week along with how many number of mock tests will enable the
learner to achieve the desired band.
The existing batch of learners(if they want to reappear) or a new batch of
learners with similar profiles could be classified on the basis of classes (1,2 and
3) and coached and tested weekly by using the SVM classification analysis.
LET US TRY ANOTHER APPROACH …….
17. DECISION TREES
Decision trees were also used to analyse the data for managing outliers and capturing non-
linearity.
The IELTS scores are classified as classes 1,2 and 3. (1= Band 7 and below, 2 = Bands 7.5
and 8, 3 = Band 8.5 and above)
The independent variables are :
Months (Number of months spent studying)
WeeklyPracticeHrs (Weekly practice hours)
MockTeststaken (Number of mock tests taken before the final test date)
A separate variable “Scoref” was created as Score was classified.
18. PARTITIONING THE DATASET
The dataset was partitioned
into a training test and test
dataset (validate).
The tree model was created
and run. The decision tree was
plotted.
19. PLOTTING THE TREE
The main variable that
contributes to determining
the score is
‘WeeklyPracticeHrs’.
For weekly practice hours
less than or equal to 14
hours, we go to node 2 that
is terminal and has 10
observations(majorly of
class 2).
For weekly practice hours
more than 14 hours, we go
to node 3, which is also
terminal and has 17
observations (majorly of
class 3).
21. ACCURACY OF MODEL
The misclassification error of
25% occurs in the predictions
as :
The dataset is imbalanced
as class 1 observations are
very few.
Nonlinearities may exist
between variables.
The dataset has few
observations; more
observations are required
for better predictive
analysis.
22. PREDICTIONS
The model was run for
predictions with the variable
Months eliminated as it is not
very significant to the
determination of the score.
Using both independent
variables separately as features
(WeeklyPracticeHrs and
MockTeststaken) , the score was
predicted for different
combinations.
24. DECISION TREES REGRESSION
The Mean Squared Error (MSE) and Mean
Absolute Error (MAE) were calculated.
The values of MSE and MAE are not very
low, indicating that the model does not
perform that well.
It needs to capture non-linearity better.
Perhaps, a greater number of
observations are required for better
analysis.
Prediction using
MockTeststaken as the main
independent variable
Prediction using
WeeklyPracticeHrs as the
main independent variable
25. COMPARISON BETWEEN THE TWO MODELS
Both SVM and Decision Tree were suitable Machine Learning algorithms
for classification and regression for the dataset.
While SVM had a misclassification error of 33%, decision tree had a
misclassification error of 25%.
For Regression, SVM had better results when it came to evaluation
metrics such as Mean Squared Error (MSE) and Mean Absolute Error
(MAE). (lower values)
Both models predicted the dependent variable (Score) basis
independent variables (weekly practice hours and number of mock tests
taken). The variable , Months was eliminated from the models as it was
not as statistically significant to the analysis as the other variables.
26. FINAL INSIGHTS - 1
The coaching firm should encourage learners to spend more time studying every
week for fewer months and practising mock tests rather than studying for fewer
hours every week for a greater number of months and practising mock tests.
Weekly practice is a must as it is the most significant factor for student success.
The more the learner practises every week, the more they will be able to attempt
mock tests. So, the firm should design the program with weekly targets of mock
tests based on the number of hours studied. Rewards, incentives and recognition
should be encouraged for more number of mock tests taken by the learner.
27. FINAL INSIGHTS - 2
There are many outliers while plotting the relationship between score and weekly
practice hours and number of mock tests taken. Several factors such as educational
background, prior level of English proficiency, studying strategies, etc affect learner
performance.
More data needs to be captured as the dataset had 27 observations. The
observations for class 1 (Score 7 and below) are only 2 making the dataset
imbalanced. An imbalanced dataset might not support predictive analysis suitably.
The classes of learners (1,2 and 3) can be predicted on the basis of the number of
weekly practice hours and number of mock tests taken. After classifying the
learners, further batches can be formed for more focused coaching by the trainers.