< feature>:<value> are represented by <word>:<tfidf>
(Common words are eliminated before preparing data set).
For each of the Field/Group,
the following procedure is
Repeated (Training phase):
Collection Model by Dr. Zeil Field/Group K Field/Group K Download Documents ( PDF ) Convert PDF to Text Model Documents Using TF and IDF Positive Training Set for Negative Training Set for Field/Group K SVM For
Field/Group 1 Field/Group K Field/Group N Trained SVM For Trained SVM For Trained SVM For Input Test Document ( PDF ) Convert PDF to Text Model Documents Using TF and IDF Estimate in the range 0 to 1 indicating how likely the Field/Group K maps to the test document .
Improving the results
Scaling the vectors in datasets
To make the <value>s in <feature>:<value> pairs between 0 and 1
Randomly selected 5 Field/Groups.
140200, 120200, 201300, 220200, 250400.
For each field/group,
70 pdf files were downloaded.
50 files were used as positive files for training
20 files were used for testing
An additional 50 files were taken randomly from all other field/groups as negative files for training.