2. INTRODUCTION
The teaching assistant evaluation reviews academic qualifications, relevant experience,
quality of teaching, and professional contributions. All these aspects can be assessed by the
students, peers or by the teachers themselves. This paper includes the teachers’ evaluation.
Aleamoni suggests that students are the main source of information about the learning
environment, including teachers' ability to motivate them for continued learning, rapport or
degree of communication between instructors and students. They are also the most
consistent evaluators of the quality, the effectiveness of the learning process and satisfaction
with course content, method of instruction, textbooks, homework, and student interest
(Aleamoni, 1981). This project focuses on the application of some machine learning
techniques on this data in order to develop a model that can use some past assessment to
determine a future evaluation.
This report includes brief introduction of supervised learning used to build Random Forest
technique.
SUPERVISED LEARNING
Supervised learning as the name indicates a presence of supervisor as teacher. Basically
supervised learning is a learning in which we teach or train the machine using data which is
well labelled that means some data is already tagged with correct answer. After that, machine
is provided with new set of examples(data) so that supervised learning algorithm analyses the
training data(set of training examples) and produces an correct outcome from labelled data.
RANDOM FOREST
Random forests or random decision forests are an ensemble learning method for
classification, regression and other tasks that operates by constructing a multitude of decision
trees at training time and outputting the class that is the mode of the classes (classification)
or mean prediction (regression) of the individual trees. Random decision forests correct for
decision trees' habit of overfitting to their training set.
Random Forest Machine Learning Algorithm maintains accuracy even when there is
inconsistent data and is simple to use. It also gives estimates on what variables are important
for the classification. It runs efficiently on large databases while generating an internal
unbiased estimate of the generalisation error. It also provides methods for balancing error in
class population unbalanced data sets but analysing them theoretically is difficult and
formation of a large number of trees can also slow down prediction while handling real-time
system. There is also another drawback that is, it does not predict beyond the range of the
response values in the training data.
3. DATA SET INFORMATION
The data consist of evaluations of teaching performance over three regular semesters and
two summer semesters of 151 teaching assistant (TA) assignments at the Statistics
Department of the University of Wisconsin-Madison. The scores were divided into 3 roughly
equal-sized categories ("low", "medium", and "high") to form the class variable.
(Fig-1-Dataset Snapshot)
Attribute Information:
1. Whether or not the TA is a native English speaker (binary); 1=English speaker, 2=non-English
speaker
2. Course instructor (categorical, 25 categories)
3. Course (categorical, 26 categories)
4. Summer or regular semester (binary) 1=Summer, 2=Regular
5. Class size (numerical)
6. Class attribute (categorical) 1=Low, 2=Medium, 3=High
5. LIBRARIES IMPORTED
● import pandas as pd – pandas is a Python package providing fast, flexible, and
expressive data structures designed to make working with structured (tabular,
multidimensional, potentially heterogeneous) and time series data both easy and
intuitive.
● from sklearn.model_selection import train_test_split -- Split arrays or matrices into
random train and test subsets
● from sklearn.ensemble import RandomForestClassifier -- Loads scikit's random
forest classifier library
● from sklearn.metrics import accuracy_score -- The accuracy_score function
computes the accuracy, either the fraction (default) or the count (normalize=False) of
correct predictions.
● from sklearn.metrics import confusion_matrix -- Compute confusion matrix to
evaluate the accuracy of a classification
Methods used:
read_csv()- Pandas read_csv() function is used to read or load data from CSV files
fillna()- fillna() fills the NaN values with a given number with which you want to substitute
fit- Fit the model according to the given training data.
predict() – To predict the output
describe() - Pandas describe() is used to view some basic statistical details like percentile,
mean, std etc. of a data frame or a series of numeric values
format() – the string format() method formats the given string into nicer output in Python.