Studies of HPCC Systems from Machine Learning Perspectives
NLP_Presentation
1. GROUP 2 Manali Shah
Aravind Ram Nathan
Ismail Enchikalathil Jelal
Ankita Tiwari
AUTOMATED SHORT
ANSWERS GRADING
2. GOAL
• Generation of computer learning model that can grade
short written responses.
• Advantages:
• Fairness
• Less human resource cost
• Timely feedback
Graded Short
Answers
NLP Features
Machine
Learning Model
3. RESOURCES
Dataset:
• Hewlett Foundation on Kaggle Data Platform.
• 17,000 short responses written by 10th
grade students.
• 10 different essay sets covering various topics ranging
from Science to Arts.
• Average length of response is 50 words.
• Training sets humanly graded and assigned a score
ranging from 0-3.
Technologies:
• Python: nltk, scikit-learn, pandas, pyplot, skll
• R: h2o
5. PREPROCESSING
• Remove non printable characters from raw text.
• Convert to lowercase.
• Spelling correction using Peter Norvig’s spelling corrector
• POS Tagging using NLTK pos_tag() function on corrected
text
• Extraction of numbers using regex.
• Remove stop words.
• Stemming using NLTK porter_stemmer() function.
6. FEATURE ENGINEERING
• Term usage:
• Statistics of various kinds of part of speech
• Statistics of length of words
• Spelling errors
• Sentence Quality:
• Grammar errors using 3gram, 4gram and 5gram dictionaries
• Bag of words:
• Top 10 most occurring unigrams from training set
• Content Fluency and Richness:
• Finding cosine similarity degree with essay scored 0-3
calculated from TF*IDF
• Essay length
7. FEATURE SELECTION
• Remove features that has little effect on the output.
• Large number of features
• Induce greater computational cost
• May lead to overfitting
• Sequential Forward Selection (SFS) algorithm
• Goodness of feature measured by kappa score.
• kappa score measures the inter-rater agreement
between two raters.
8. MACHINE LEARNING ALGORITHMS USED
FOR BUILDING THE MODEL
• K-Nearest Neighbors
• Naive Bayes
• Decision Tree
• Support Vector Machines (SVM)
• Gradient Boosting
• Deep Learning
• Random Forest
• Ensemble of all the above algorithms
9. CROSS VALIDATION
• 5-fold cross validation is used for choosing
hyperparameters in machine learning algorithms
• Hyperparameters :
• K-NN - 3 Neighbors
• Random Forest - 50 trees
• Gradient Boosting Machine - 200 trees
• Deep Learning - 3 layered network with 50 units in each layer