Semantically Enriched Machine Learning Approach to
Filter YouTube Comments for Socially Augmented User
                      Models
          Ahmad Ammari, Vania Dimitrova, Dimoklis
          Despotakis
          School of Computing, University of Leeds,
          Leeds, UK




                             Presented By:

                             Ahmad Ammari
                             User and Community Modelling
                             School of Computing, University of Leeds,
                             UK
Outline
• The ImREAL Project
• Socially Augmented User Modelling
• Research Objective, Roadmap,
  Challenges
• The Social Noise Filtering Approach
  –   Machine Learning – Based
  –   Methodology
  –   Comment Content Pre-Processing
  –   Semantic Enrichment
  –   Scoring and Labelling the Training Dataset
• Experimental Description / Results
• Evaluation
• Conclusions & Future Work
Immersive Reflective
                            Experience-based Adaptive
Specific Targeted Research Project STReP – FP7
         Learning
Partners
  University of Leeds, UK;               Trinity College Dublin, Ireland;
  Graz University of Technology, Austria; University of Erlangen-Nuremberg, Ger;
  Delft University of Technology, NL;     Imaginary SRL - IMA, Italy;
  Empower The User, ETU, Ireland;
                                    Problem:
 Experience in a simulated world is disconnected from the ‘real-
                            world’


                REALITY                                  VIRTUALITY

                                      ImREAL
           Augmented Reality         Approach         Augmented Virtuality
Augmented Simulated Experiential
                                           Learning




    Interactive
      User
      model

    Adaptive
                  Simulated Experiential
                  Learning Environment


    coach
                                             Augmented
                                                user         Real
                                              modelling      world
     Practice
                                                            activity
                                                            model-
                                                              ling
     Provide                                    Meta-
     content                                   cognitive               Records of Real
                                                                                         Other participants
                                                                         Job-related
                                                                                         (e.g. customers,
                                              scaffolding               Experiences
                                                                                            managers)




Simulated Learning Environment                                                Real World Experience
Augmented User Modelling
Socially Augmented User Modelling
                                                Open
                                            Social Spaces
           Simulated
          Environment



          User
         Profiles
                                                Sports
                           Psycholo   Social
                              gy
                                      Profile
                                        s
                                                   Diseases


                                  Politic
                                    s
Existing User
   Socially
    Model
Augmented User   Limited      Weighted Social
    Model        Scope!!         Interests
Broad Research Objective
Mining Social Media Content

generated by Users having awareness
 and/or Interest in an Activity Domain

to Derive Social Profiles


that Augment Existing User Models
Research Roadmap / Challenges
   • Three-Phase Research Roadmap
               towards achieving the Broad Objective
Phase One




                                        Phase Three
                         Phase Two


             Social
              Noise
            Filtration
The Social Noise Filtering Approach
• Supervised Machine Learning Model
  – Historic Content with known relevance states are
    used for training
  – Machine Learning Model learns the underlying
    rules
  – Model is used to predict unknown relevance
    states for new content with certain prediction
    confidence
The Social Noise Filtration Service:
                      Methodology

                          Semantically
                         Enriched Job
Experimental            Interview Bag of
      CASE STUDY:
ly Controlled Analyze   Filtering YouTube Comments
                         Words (JIBoW)
 Comments

  Social Media Source: YouTube
  Subject Content: Public Comments on Shared
  Videos
                    SCORE
  Activity Domain: Job Interview
                        Term – Comment
                              Matrix
                        (Training Corpus)
                                            S
                                            C
  Public
              Pre-                          O
Comments                                    R
            Process                         E
   On
                                            S
 YouTube
YouTube Video Selection
• Selected as part of a research study by
  [Despotakis, Lau & Dimitrova, 2011]
• Four Job Interview-related categories are
  manually identified from video content
  – Guides / Best Practices
  – Interviewee’s Stories
  – Interviewer’s Stories
  – Interview Mock Examples
• Videos from all categories are selected to
  retrieve the comment set for ML training
Comment Content Pre-Processing
• Objective: Deriving dataset for
  Classification
      Stop                 tfidf
                                           Comment
                                            – Term
      Word     Stemming
                          Weighting         Matrix
     Removal
                                             CTM
       1          2           3                4


                          I think most
                          Americans are like the
                          first example




                          think – Americans – like – first –
                          example
Semantically Enriched Job Interview
                                      Bag of Words
   • A Semantically Enriched Job Interview Bag of Words (JIBoW)
     used as Novel Means to Score and Label Training YouTube
     Comment Set
   • Collection of Textual Comments on Job Interview Videos [*]
        – Experimentally controlled
        – Closed social space
   • Text and Semantic Pre-Processing Phases
   • Semantically Expanded by the WordNet Lexicon and DISCO
     with Word Synonyms, Antonyms, Derivations, and
     semantically similar words




[*] Despotakis, Lau, Dimitrova (2011): A Semantic
Approach to Extract Individual Viewpoints from User
Comments on An Activity, AUM Workshop, UMAP
2011, Girona, Spain
Scoring and Labelling Training Corpus
• A Novel Term Frequency – based Mathematical Model
• Computes a Relevance Score for each observation in the
  training comment dataset
   – Intersection Size between Comment BoW and JIBoW
   – Score is Normalized by the Average Intersection Size




  • A Threshold is used to classify the comments for
    training a binary classifier
  • Labels observation (noisy, relevant) accordingly
Example Scoring & Labelling
C1: “The interviewee looks confident, he should
have some job experience in his work life”

  Comment       JIBOW
    BOW          w10
  interviewee    w21
   confident     w34
      job        w4
  experience     w57
     work        w113
      life       wn
Example Scored & Labelled Comments
Datasets
• YouTube API for Retrieval, Lucene API for Pre-
  Processing
• Post –YouTube Corpus Description:
         Analysis Data        Experimentally Controlled Corpus




• Training Corpus: 1159 Instances
   – Classified by the scoring model for Training C4.5 & Naïve
     Bayes Multinomial (NBM) Classifiers
   – {724 Noisy, 435 Relevant}
• Derived a Comment Term Matrix : 1159 Instances X 903
  tfidf Term Weights + 1 Discrete Class Column
Experimental Results
• Three variations of Training-to-Testing ratio
  Models for each classifier have been trained &
  tested
         See Evaluation
                                  ROC Area
             Results

• The Two Classifiers show good performance
  in predicting relevant & noisy comments in the
  testing data sets
• C4.5 is slightly better in predicting noisy
  comments from within the total noise in the
  data
• NBM shows less risk in misclassifying
  relevant comments as noise
Evaluation
Human-based Evaluation Experiment was
conducted to measure how well the service:
Goal1: Considers the comments that show
awareness in the application domain (Job
Interviews) See Example Question and
                    Records


Goal2: Considers the comments that their authors
are likely interested in the application domain
            See Example Question and
                    Records
Evaluation Results
                   Number of Evaluators                                  2
                   Number of Evaluated Comments (15% of Whole           180
                   Dataset)
                   Number of Comment Scored as Relevant                  90
                   Comments
                  Number of Comment Scored as Noisy Comments
                 Evaluator 2                                  90
                                                        Evaluator 1
      Goal 2           Goal 1                          Goal 2             Goal 1
                                                                   9%
                         3%                                                              Noisy
                                    Noisy
                                                                          15%
        17 24                                         46%
        % %                                                                              Relevant
                                    Releva                              19%
                       42%                                   45%                66%
        59                    55%   nt                                                   Doesn't
        %                                                                                know
                                    Doesn't
                                    know

     Metric            Goal 2       Goal 1           Metric             Goal 2        Goal 1
Total Match Rate        51.1%       68.3%       Total Match Rate        32.2%         60.0%
Total Mismatch                                  Total Mismatch
                        48.9%       31.7%                               67.8%         40.0%
Rate                                            Rate
Precision (Noisy)       42.2%       76.7%       Precision (Noisy)       36.7%         90.6%
Precision                                       Precision
                        76.7%       63.3%                               73.3%         44.4%
(Relevant)                                      (Relevant)
Recall (Noisy)          73.1%       67.6%       Recall (Noisy)          84.6%         68.2%
Summary
• Conclusions
  – High Rate of YouTube Video comments are Noisy
  – ML Models are good in Predicting and Filtering
    out Comments that do not show author
    awareness nor interests in the Activity Domain of
    Interests
• Future Work
  – Add more filters to improve the Scoring and
    Labelling Mechanism based on Evaluation
    Baseline
  – Exploit Activity Modelling Ontology to Derive
    JIBoW
  – Evaluate Impact of Semantic Enrichment
YouTube-based Social Profiling Service:
                                   Methodology
     YouTube / SM Comments          Noise Filtration Service            Comments Predicted as
                                                                             Relevant

                                                                           RC1    … ……. RCn
                                                                                     …….


                                          Clusters of Social Profiles
Profile1    Profile2    ProfileN
x   y      u   o      p   q   
e   r      x   o      x   c   
e   y      f   g      z   s   

        Associations of
                                                                         Profiling Source Authors
    Frequent Characteristics
                                  YT User Profiles
                            Uploaded YT Video meta data
                            Favored YT Video meta data
     ImREAL                 Comments on the YT Videos
    Simulators                       Social Profiling Corpus
Presented By:

Ahmad Ammari
User and Community Modelling
School of Computing, University of
Leeds, UK

Aum workshop paper_presentation

  • 1.
    Semantically Enriched MachineLearning Approach to Filter YouTube Comments for Socially Augmented User Models Ahmad Ammari, Vania Dimitrova, Dimoklis Despotakis School of Computing, University of Leeds, Leeds, UK Presented By: Ahmad Ammari User and Community Modelling School of Computing, University of Leeds, UK
  • 2.
    Outline • The ImREALProject • Socially Augmented User Modelling • Research Objective, Roadmap, Challenges • The Social Noise Filtering Approach – Machine Learning – Based – Methodology – Comment Content Pre-Processing – Semantic Enrichment – Scoring and Labelling the Training Dataset • Experimental Description / Results • Evaluation • Conclusions & Future Work
  • 3.
    Immersive Reflective Experience-based Adaptive Specific Targeted Research Project STReP – FP7 Learning Partners University of Leeds, UK; Trinity College Dublin, Ireland; Graz University of Technology, Austria; University of Erlangen-Nuremberg, Ger; Delft University of Technology, NL; Imaginary SRL - IMA, Italy; Empower The User, ETU, Ireland; Problem: Experience in a simulated world is disconnected from the ‘real- world’ REALITY VIRTUALITY ImREAL Augmented Reality Approach Augmented Virtuality
  • 4.
    Augmented Simulated Experiential Learning Interactive User model Adaptive Simulated Experiential Learning Environment coach Augmented user Real modelling world Practice activity model- ling Provide Meta- content cognitive Records of Real Other participants Job-related (e.g. customers, scaffolding Experiences managers) Simulated Learning Environment Real World Experience
  • 5.
    Augmented User Modelling SociallyAugmented User Modelling Open Social Spaces Simulated Environment User Profiles Sports Psycholo Social gy Profile s Diseases Politic s Existing User Socially Model Augmented User Limited Weighted Social Model Scope!! Interests
  • 6.
    Broad Research Objective MiningSocial Media Content generated by Users having awareness and/or Interest in an Activity Domain to Derive Social Profiles that Augment Existing User Models
  • 7.
    Research Roadmap /Challenges • Three-Phase Research Roadmap towards achieving the Broad Objective Phase One Phase Three Phase Two Social Noise Filtration
  • 8.
    The Social NoiseFiltering Approach • Supervised Machine Learning Model – Historic Content with known relevance states are used for training – Machine Learning Model learns the underlying rules – Model is used to predict unknown relevance states for new content with certain prediction confidence
  • 9.
    The Social NoiseFiltration Service: Methodology Semantically Enriched Job Experimental Interview Bag of CASE STUDY: ly Controlled Analyze Filtering YouTube Comments Words (JIBoW) Comments Social Media Source: YouTube Subject Content: Public Comments on Shared Videos SCORE Activity Domain: Job Interview Term – Comment Matrix (Training Corpus) S C Public Pre- O Comments R Process E On S YouTube
  • 10.
    YouTube Video Selection •Selected as part of a research study by [Despotakis, Lau & Dimitrova, 2011] • Four Job Interview-related categories are manually identified from video content – Guides / Best Practices – Interviewee’s Stories – Interviewer’s Stories – Interview Mock Examples • Videos from all categories are selected to retrieve the comment set for ML training
  • 11.
    Comment Content Pre-Processing •Objective: Deriving dataset for Classification Stop tfidf Comment – Term Word Stemming Weighting Matrix Removal CTM 1 2 3 4 I think most Americans are like the first example think – Americans – like – first – example
  • 12.
    Semantically Enriched JobInterview Bag of Words • A Semantically Enriched Job Interview Bag of Words (JIBoW) used as Novel Means to Score and Label Training YouTube Comment Set • Collection of Textual Comments on Job Interview Videos [*] – Experimentally controlled – Closed social space • Text and Semantic Pre-Processing Phases • Semantically Expanded by the WordNet Lexicon and DISCO with Word Synonyms, Antonyms, Derivations, and semantically similar words [*] Despotakis, Lau, Dimitrova (2011): A Semantic Approach to Extract Individual Viewpoints from User Comments on An Activity, AUM Workshop, UMAP 2011, Girona, Spain
  • 13.
    Scoring and LabellingTraining Corpus • A Novel Term Frequency – based Mathematical Model • Computes a Relevance Score for each observation in the training comment dataset – Intersection Size between Comment BoW and JIBoW – Score is Normalized by the Average Intersection Size • A Threshold is used to classify the comments for training a binary classifier • Labels observation (noisy, relevant) accordingly
  • 14.
    Example Scoring &Labelling C1: “The interviewee looks confident, he should have some job experience in his work life” Comment JIBOW BOW w10 interviewee w21 confident w34 job w4 experience w57 work w113 life wn
  • 15.
    Example Scored &Labelled Comments
  • 16.
    Datasets • YouTube APIfor Retrieval, Lucene API for Pre- Processing • Post –YouTube Corpus Description: Analysis Data Experimentally Controlled Corpus • Training Corpus: 1159 Instances – Classified by the scoring model for Training C4.5 & Naïve Bayes Multinomial (NBM) Classifiers – {724 Noisy, 435 Relevant} • Derived a Comment Term Matrix : 1159 Instances X 903 tfidf Term Weights + 1 Discrete Class Column
  • 17.
    Experimental Results • Threevariations of Training-to-Testing ratio Models for each classifier have been trained & tested See Evaluation ROC Area Results • The Two Classifiers show good performance in predicting relevant & noisy comments in the testing data sets • C4.5 is slightly better in predicting noisy comments from within the total noise in the data • NBM shows less risk in misclassifying relevant comments as noise
  • 18.
    Evaluation Human-based Evaluation Experimentwas conducted to measure how well the service: Goal1: Considers the comments that show awareness in the application domain (Job Interviews) See Example Question and Records Goal2: Considers the comments that their authors are likely interested in the application domain See Example Question and Records
  • 19.
    Evaluation Results Number of Evaluators 2 Number of Evaluated Comments (15% of Whole 180 Dataset) Number of Comment Scored as Relevant 90 Comments Number of Comment Scored as Noisy Comments Evaluator 2 90 Evaluator 1 Goal 2 Goal 1 Goal 2 Goal 1 9% 3% Noisy Noisy 15% 17 24 46% % % Relevant Releva 19% 42% 45% 66% 59 55% nt Doesn't % know Doesn't know Metric Goal 2 Goal 1 Metric Goal 2 Goal 1 Total Match Rate 51.1% 68.3% Total Match Rate 32.2% 60.0% Total Mismatch Total Mismatch 48.9% 31.7% 67.8% 40.0% Rate Rate Precision (Noisy) 42.2% 76.7% Precision (Noisy) 36.7% 90.6% Precision Precision 76.7% 63.3% 73.3% 44.4% (Relevant) (Relevant) Recall (Noisy) 73.1% 67.6% Recall (Noisy) 84.6% 68.2%
  • 20.
    Summary • Conclusions – High Rate of YouTube Video comments are Noisy – ML Models are good in Predicting and Filtering out Comments that do not show author awareness nor interests in the Activity Domain of Interests • Future Work – Add more filters to improve the Scoring and Labelling Mechanism based on Evaluation Baseline – Exploit Activity Modelling Ontology to Derive JIBoW – Evaluate Impact of Semantic Enrichment
  • 21.
    YouTube-based Social ProfilingService: Methodology YouTube / SM Comments Noise Filtration Service Comments Predicted as Relevant RC1 … ……. RCn ……. Clusters of Social Profiles Profile1 Profile2 ProfileN x y  u o  p q  e r  x o  x c  e y  f g  z s  Associations of Profiling Source Authors Frequent Characteristics YT User Profiles Uploaded YT Video meta data Favored YT Video meta data ImREAL Comments on the YT Videos Simulators Social Profiling Corpus
  • 22.
    Presented By: Ahmad Ammari Userand Community Modelling School of Computing, University of Leeds, UK