Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Aum workshop paper_presentation


Published on

Presenting work at the AUM2011 workshop of the UMAP conference, 2011, Girona, Spain

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Aum workshop paper_presentation

  1. 1. Semantically Enriched Machine Learning Approach toFilter YouTube Comments for Socially Augmented User Models Ahmad Ammari, Vania Dimitrova, Dimoklis Despotakis School of Computing, University of Leeds, Leeds, UK Presented By: Ahmad Ammari User and Community Modelling School of Computing, University of Leeds, UK
  2. 2. Outline• The ImREAL Project• Socially Augmented User Modelling• Research Objective, Roadmap, Challenges• The Social Noise Filtering Approach – Machine Learning – Based – Methodology – Comment Content Pre-Processing – Semantic Enrichment – Scoring and Labelling the Training Dataset• Experimental Description / Results• Evaluation• Conclusions & Future Work
  3. 3. Immersive Reflective Experience-based AdaptiveSpecific Targeted Research Project STReP – FP7 LearningPartners University of Leeds, UK; Trinity College Dublin, Ireland; Graz University of Technology, Austria; University of Erlangen-Nuremberg, Ger; Delft University of Technology, NL; Imaginary SRL - IMA, Italy; Empower The User, ETU, Ireland; Problem: Experience in a simulated world is disconnected from the ‘real- world’ REALITY VIRTUALITY ImREAL Augmented Reality Approach Augmented Virtuality
  4. 4. Augmented Simulated Experiential Learning Interactive User model Adaptive Simulated Experiential Learning Environment coach Augmented user Real modelling world Practice activity model- ling Provide Meta- content cognitive Records of Real Other participants Job-related (e.g. customers, scaffolding Experiences managers)Simulated Learning Environment Real World Experience
  5. 5. Augmented User ModellingSocially Augmented User Modelling Open Social Spaces Simulated Environment User Profiles Sports Psycholo Social gy Profile s Diseases Politic sExisting User Socially ModelAugmented User Limited Weighted Social Model Scope!! Interests
  6. 6. Broad Research ObjectiveMining Social Media Contentgenerated by Users having awareness and/or Interest in an Activity Domainto Derive Social Profilesthat Augment Existing User Models
  7. 7. Research Roadmap / Challenges • Three-Phase Research Roadmap towards achieving the Broad ObjectivePhase One Phase Three Phase Two Social Noise Filtration
  8. 8. The Social Noise Filtering Approach• Supervised Machine Learning Model – Historic Content with known relevance states are used for training – Machine Learning Model learns the underlying rules – Model is used to predict unknown relevance states for new content with certain prediction confidence
  9. 9. The Social Noise Filtration Service: Methodology Semantically Enriched JobExperimental Interview Bag of CASE STUDY:ly Controlled Analyze Filtering YouTube Comments Words (JIBoW) Comments Social Media Source: YouTube Subject Content: Public Comments on Shared Videos SCORE Activity Domain: Job Interview Term – Comment Matrix (Training Corpus) S C Public Pre- OComments R Process E On S YouTube
  10. 10. YouTube Video Selection• Selected as part of a research study by [Despotakis, Lau & Dimitrova, 2011]• Four Job Interview-related categories are manually identified from video content – Guides / Best Practices – Interviewee’s Stories – Interviewer’s Stories – Interview Mock Examples• Videos from all categories are selected to retrieve the comment set for ML training
  11. 11. Comment Content Pre-Processing• Objective: Deriving dataset for Classification Stop tfidf Comment – Term Word Stemming Weighting Matrix Removal CTM 1 2 3 4 I think most Americans are like the first example think – Americans – like – first – example
  12. 12. Semantically Enriched Job Interview Bag of Words • A Semantically Enriched Job Interview Bag of Words (JIBoW) used as Novel Means to Score and Label Training YouTube Comment Set • Collection of Textual Comments on Job Interview Videos [*] – Experimentally controlled – Closed social space • Text and Semantic Pre-Processing Phases • Semantically Expanded by the WordNet Lexicon and DISCO with Word Synonyms, Antonyms, Derivations, and semantically similar words[*] Despotakis, Lau, Dimitrova (2011): A SemanticApproach to Extract Individual Viewpoints from UserComments on An Activity, AUM Workshop, UMAP2011, Girona, Spain
  13. 13. Scoring and Labelling Training Corpus• A Novel Term Frequency – based Mathematical Model• Computes a Relevance Score for each observation in the training comment dataset – Intersection Size between Comment BoW and JIBoW – Score is Normalized by the Average Intersection Size • A Threshold is used to classify the comments for training a binary classifier • Labels observation (noisy, relevant) accordingly
  14. 14. Example Scoring & LabellingC1: “The interviewee looks confident, he shouldhave some job experience in his work life” Comment JIBOW BOW w10 interviewee w21 confident w34 job w4 experience w57 work w113 life wn
  15. 15. Example Scored & Labelled Comments
  16. 16. Datasets• YouTube API for Retrieval, Lucene API for Pre- Processing• Post –YouTube Corpus Description: Analysis Data Experimentally Controlled Corpus• Training Corpus: 1159 Instances – Classified by the scoring model for Training C4.5 & Naïve Bayes Multinomial (NBM) Classifiers – {724 Noisy, 435 Relevant}• Derived a Comment Term Matrix : 1159 Instances X 903 tfidf Term Weights + 1 Discrete Class Column
  17. 17. Experimental Results• Three variations of Training-to-Testing ratio Models for each classifier have been trained & tested See Evaluation ROC Area Results• The Two Classifiers show good performance in predicting relevant & noisy comments in the testing data sets• C4.5 is slightly better in predicting noisy comments from within the total noise in the data• NBM shows less risk in misclassifying relevant comments as noise
  18. 18. EvaluationHuman-based Evaluation Experiment wasconducted to measure how well the service:Goal1: Considers the comments that showawareness in the application domain (JobInterviews) See Example Question and RecordsGoal2: Considers the comments that their authorsare likely interested in the application domain See Example Question and Records
  19. 19. Evaluation Results Number of Evaluators 2 Number of Evaluated Comments (15% of Whole 180 Dataset) Number of Comment Scored as Relevant 90 Comments Number of Comment Scored as Noisy Comments Evaluator 2 90 Evaluator 1 Goal 2 Goal 1 Goal 2 Goal 1 9% 3% Noisy Noisy 15% 17 24 46% % % Relevant Releva 19% 42% 45% 66% 59 55% nt Doesnt % know Doesnt know Metric Goal 2 Goal 1 Metric Goal 2 Goal 1Total Match Rate 51.1% 68.3% Total Match Rate 32.2% 60.0%Total Mismatch Total Mismatch 48.9% 31.7% 67.8% 40.0%Rate RatePrecision (Noisy) 42.2% 76.7% Precision (Noisy) 36.7% 90.6%Precision Precision 76.7% 63.3% 73.3% 44.4%(Relevant) (Relevant)Recall (Noisy) 73.1% 67.6% Recall (Noisy) 84.6% 68.2%
  20. 20. Summary• Conclusions – High Rate of YouTube Video comments are Noisy – ML Models are good in Predicting and Filtering out Comments that do not show author awareness nor interests in the Activity Domain of Interests• Future Work – Add more filters to improve the Scoring and Labelling Mechanism based on Evaluation Baseline – Exploit Activity Modelling Ontology to Derive JIBoW – Evaluate Impact of Semantic Enrichment
  21. 21. YouTube-based Social Profiling Service: Methodology YouTube / SM Comments Noise Filtration Service Comments Predicted as Relevant RC1 … ……. RCn ……. Clusters of Social ProfilesProfile1 Profile2 ProfileNx y  u o  p q e r  x o  x c e y  f g  z s  Associations of Profiling Source Authors Frequent Characteristics YT User Profiles Uploaded YT Video meta data Favored YT Video meta data ImREAL Comments on the YT Videos Simulators Social Profiling Corpus
  22. 22. Presented By:Ahmad AmmariUser and Community ModellingSchool of Computing, University ofLeeds, UK