Slides presented at AI-Biz.
Title : Identifying Legality of Japanese Online Advertisements using Complex-valued Support Vector Machine with DFT-based Document Features
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
20211115 jsai international_symposia_slide
1. Identifying Legality of Japanese Online Advertisements
using Complex-valued Support Vector Machine
with DFT-based Document Features
The Graduate School of Arts and Sciences, The Open University of Japan
Satoshi Kawamoto
1
2. Background of this study
• Issues in Web Advertising
• Problematic expressions
• Violating the Pharmaceutical Affairs Law
• Wording such as "physician endorsed“
• Expressions related to patents
• Needs for a system to determine the validity of advertisements
• As the market expands, manual screening is becoming difficult.
• Benefits of implementing a discriminant model
• Ad-serving companies
• Reduce the manual workload of screening
• Advertisers
• Reduce risk of brand damage
• Media
• Prevent users from leaving
2
3. Related Work
• Determining Legality(Chinese Advertisements)
• SVM+Weighting of word vectors(Y. Tang et. al.)
• Word weighting using log-frequency ratio(weighted binary vector)
• Highlight words that occur frequently in problematic documents.
• Issues
• No word order information
• It’s unclear why weighting is effective.
• Dependency-based CNN(H. Huang et. al.)
• Word Embedding+Syntactic Structure
• Overall, CNN works better than SVM when categorizing Chinese Ads
• Issues
• Difficulty in tuning parameters
• Requires a relatively large amount of data
3
4. Definition of problematic advertisements(Prohibited Expressions)
4
• Problematic under the Pharmaceutical Affairs Law
• Restrictions on expressions related to efficacy and safety
• “Fine lines and wrinkles will disappear”, “Anti-aging effect will be obtained”
• Restrictions on efficacy guarantee expressions
• Historical phrases such as "proven to be effective over a period of 100 years“
• Provide examples of clinical data or experimental examples
• Wording that guarantees effectiveness.
• Restrictions on wording regarding ingredients and raw materials
• Without indicating the purpose of the ingredients and raw materials
• Wording that may imply pharmacological effects
• Restrictions on slanderous advertising of other companies' products
• Restrictions on recommendations from pharmaceutical professionals
5. Words likely to appear in problematic Ads
5
Words related to pharmaceuticals occur here and there
Advertisement containing these words may violate Pharmaceutical Law;
however, just containing these words doesn't mean illegality.
Tang’s weighting
6. 6
In nouns, there are many outliers.
(Problematic documents often include problematic nouns.)
High variance in verbs and nouns
Distribution of log frequency ratios for each part of speech
7. Where do the words with large appear?
7
start of a sentence end of a sentence
relative word position
word frequency
Words used very frequently in problematic documents
tend to occur near the center of the sentence.
8. 8
start of a sentence end of a sentence
Words that are likely to appear in problematic documents
tend to appear near the start or the center or the end of
the sentence.
There is no significant bias in the location of the words.
Where do the words with large appear?
9. 2. Words with large tend to appear in characteristic locations.
Effective document vector for classification of advertising documents
9
1. Certain words (e.g., medical science) are more likely to appear in problematic documents.
Statistical information is effective
- Likely to appear in specific locations in a sentence
- Some words appear periodically
Word order information and periodical information are effective
Features combining word weighting and discrete Fourier transform
If "word weighting," "word order information," and "period information"
are embedded into document vectors, discriminant models will be able to
categorize advertisements accurately.
10. How to create DFT-based document vector
10
10
word2vec
Word
weighting
weighted
embedding No rotation
Statistical information(SWEM-Aver)
One rotation
Word-order information
・
・
・
Two rotation
Periodic information
今
(now)
話題
(hot topic)
の
[particle]
ふるさと
(hometown)
納税
(tax payments)
DFT
Random
Projection
11. Outline of Complex-valued SVM(CV-SVM)
11
Discriminant Function
: Document Vector
Re
Im
Legal Documents
Illegal Documents
: basis function
: bias
12. Simulation using holdout method
• Data
• Cosmetics Advertisements
• Illegal :3008, Legal :8103
• How to divide the data
• Training(50%)
• Negative examples are downsampled and set to the same number as positive examples.
• Validation(25%)
• Adjust SVM parameters for higher F-measure(RBF kernel)
• Test(25%)
• Evaluation of Discrimination Performance
• Numerical evaluation (Accuracy, Precision, Recall, F-measure)
• Model & Feature
• SVM : SWEM-Aver
• CNN : word2vec
• Complex-valued SVM(CV-SVM) : SWEM-Aver, DFT-Based Feature 12
Balancing Precision and Recall
13. 13
Simulation Results
Word order information and period information improve discrimination performance.
Accuracy improves when
The performance both Precision and Recall is good.
14. Discussion of simulation results
• Combination of word vector weighting and DFT results in high F-values
• Benefits of weighting
• Higher Accuracy and Precision.
• Benefits of DFT
• Achieve both Precision and Recall at a high level (>0.75)
• Why this simulation result was obtained?
• Position of the words with high is characteristic.
• Tends to appear at the beginning, at the center , or at the end of a sentence
• Word order information is embedded.
• Words with a cycle of about half the sentence length are emphasized.
14
15. Summary
• Survey of characteristics of advertising documents
• Characteristics of words that appear in problematic documents
• Certain nouns and verbs are more likely to appear.
• There is a large bias in the position of the words.
• Discrimination Simulations
• Word weighting is highly effective.
• Discrimination performance is high when DFT and word weighting are combined.
• CV-SVM can handle complex-valued vectors and has high generalization performance.
• Future work
• Can this model also discriminate non-cosmetic advertisements?
• Is this model effective for general discrimination tasks?
• Need to Compare with more recent models, such as BERT
15