Text Classification With Weka
Waikato Environment of Knowledge Discovery
Text Classification With Weka
Weka is a collection of machine learning algorithms for data mining tasks.
These algorithms can be either applied directly on data sets or can be called
from your java/python code.
With Weka you can do classification, clustering, regression and maybe other
ml and data mining tasks.
Weka was developed at the university of Waikato, New Zealand in 1993 as a
free software available under the GNU General Public license.
Background
Machine Learning :
- Is type of Artificial Intelligence (AI) that provides computers with the ability to
learn without being explicitly programmed.
- A computer program is said to learn from experience E with respect to some class
of tasks T and performance measure P . If its performance P at some tasks int T
improves with experience E.
- E : is the data we learn from .
- T : is the task we are trying to achieve.
- P : the measurement of how accurate the algorithm is.
Background
Machine Learning has tow types:
- Supervised Learning where the computer is provided with set of instances each
instance has set of features and label and the task is to learn general rule to map
these features to labels. Example of this type is ( text classification )
- Unsupervised Learning No labels are given to the computer. and the task is to
figure patterns and relations between the instances based on the features it
contains.
Text Classification
The task of assigning predefined categories for free text documents. It can
provide conceptual views of documents collections and has important
application in the real world.
How do we do text classification :
1- Data preparation.
2- Text processing ( tokenization => removing stop words => stemming => attribute
selection)
3- train the model using the extracted text features
4- evaluating the model.
OpenSooq Text Classification
Task Definition :
Building a software to predict post category from its description. The software will be an
assistant to help the users to find the appropriate category for their selling.
How to build this software ?
We will use Text Classification with Weka to build ml model that learns from the posts
that opensooq already has.
OpenSooq Text Classification
- Loading posts from database : Using mysql-connector provided by weka we
will load 500 posts from three categories. Each post (instance) contains description
and category
OpenSooq Text Classification
- Text processing : Using StringToWordVector filter provided by Weka we will
transform description into word vector.
OpenSooq Text Classification
- Learning Model : We will build two models using the training data we have and
compare their performance. These two models are : ( J48 and Naive Bayes ).
OpenSooq Text Classification
- Model Evaluation : Measuring the accuracy of the model. It can be
done by applying the model on a training data that already linked with
labels and compare how many instance will be guessed right.
- In Weka you can do model evaluation by :
- Feeding the model with set of testing data
- Splitting data into training/testing sets.
- cross validation.
OpenSooq Text Classification
- Using the model to predict posts categories:
- We will be using PyWeka to load the model in
python and use it to predict the categories of the
post.

Text classification with Weka

  • 1.
    Text Classification WithWeka Waikato Environment of Knowledge Discovery
  • 2.
    Text Classification WithWeka Weka is a collection of machine learning algorithms for data mining tasks. These algorithms can be either applied directly on data sets or can be called from your java/python code. With Weka you can do classification, clustering, regression and maybe other ml and data mining tasks. Weka was developed at the university of Waikato, New Zealand in 1993 as a free software available under the GNU General Public license.
  • 3.
    Background Machine Learning : -Is type of Artificial Intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. - A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P . If its performance P at some tasks int T improves with experience E. - E : is the data we learn from . - T : is the task we are trying to achieve. - P : the measurement of how accurate the algorithm is.
  • 4.
    Background Machine Learning hastow types: - Supervised Learning where the computer is provided with set of instances each instance has set of features and label and the task is to learn general rule to map these features to labels. Example of this type is ( text classification ) - Unsupervised Learning No labels are given to the computer. and the task is to figure patterns and relations between the instances based on the features it contains.
  • 5.
    Text Classification The taskof assigning predefined categories for free text documents. It can provide conceptual views of documents collections and has important application in the real world. How do we do text classification : 1- Data preparation. 2- Text processing ( tokenization => removing stop words => stemming => attribute selection) 3- train the model using the extracted text features 4- evaluating the model.
  • 6.
    OpenSooq Text Classification TaskDefinition : Building a software to predict post category from its description. The software will be an assistant to help the users to find the appropriate category for their selling. How to build this software ? We will use Text Classification with Weka to build ml model that learns from the posts that opensooq already has.
  • 7.
    OpenSooq Text Classification -Loading posts from database : Using mysql-connector provided by weka we will load 500 posts from three categories. Each post (instance) contains description and category
  • 8.
    OpenSooq Text Classification -Text processing : Using StringToWordVector filter provided by Weka we will transform description into word vector.
  • 9.
    OpenSooq Text Classification -Learning Model : We will build two models using the training data we have and compare their performance. These two models are : ( J48 and Naive Bayes ).
  • 10.
    OpenSooq Text Classification -Model Evaluation : Measuring the accuracy of the model. It can be done by applying the model on a training data that already linked with labels and compare how many instance will be guessed right. - In Weka you can do model evaluation by : - Feeding the model with set of testing data - Splitting data into training/testing sets. - cross validation.
  • 11.
    OpenSooq Text Classification -Using the model to predict posts categories: - We will be using PyWeka to load the model in python and use it to predict the categories of the post.