OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
Cd project
1. Text Classification
using ULMFiT
CS 6105 : COMPILER DESIGN
Submitted To :-
Dr. Vishwambhar Pathak
Submitted By:-
Yashasvini Mathur (BE/25001/16)
Divik Mittal (BE/25035/16)
2. Introduction
Natural Language Processing (NLP) needs no introduction in
today’s world. It’s one of the most important fields of study and
research, and has seen a phenomenal rise in interest in the last
decade. The basics of NLP are widely known and easy to grasp.
But things start to get tricky when the text data becomes huge
and unstructured. That’s where deep learning becomes so
pivotal.
We will focus on the concept of transfer learning and how we
can leverage it in NLP to build incredibly accurate models using
the popular fastai library.
3. Overview of ULMFiT
Universal Language Model Fine-tuning(ULMFiT) achieves state-of-the-art result using novel techniques like:
• Discriminative fine-tuning
• Slanted triangular learning rates, and
• Gradual unfreezing
This method involves fine-tuning a pre-trained language model (LM), trained on the Wikitext 103 dataset, to a
new dataset in such a manner that it does not forget what it previously learned.
Language modeling captures general properties of a language and provides an enormous amount of data
which can be fed to other downstream NLP tasks. That is why Language modeling has been chosen as the
source task for ULMFiT.
4. Problem Statement
Our objective here is to fine-tune a pre-trained model and use it for
text classification on a new dataset. We will implement ULMFiT in this
process. The interesting thing here is that this new data is quite small
in size (<1000 labeled instances). A neural network model trained from
scratch would overfit on such a small dataset.
Dataset: We will use the 20 Newsgroup dataset available
in sklearn.datasets. As the name suggests, it includes text documents
from 20 different newsgroups.
5. Procedure
1. Cleaning the dataset – Removing header and footer
2. Preprocessing data
2.1 Retaining only alphabets
2.2 Removing Stopwords (Example – I, me, my, is, am, are etc..)
3. Splitting the data in training and validation set.
4. Data Preparation - Preparing our data for language model and classification model
separately
5. Training the Model
6. Get Predictions