Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analysis and knowledge extraction of user behaviour and social media content for art culture events

1,241 views

Published on

Nowadays people share everything on online social networks, from daily life stories to the latest local and global news and events. Many researchers have exploited this as a source for understanding the user behaviour and profile in various settings. In this paper, we address the specific problem of user behavioural profiling in the context of cultural and artistic events. We propose a specific analysis pipeline that aims at examining the profile of online users, based on the textual content they published online. The pipeline covers the following aspects: data extraction and enrichment, topic modeling, user clustering, prediction of interest, content analysis including profiling of images and subjects. We show our approach at work for the monitoring of participation to a large-scale artistic installation that collected more than 1.5 million visitors in just two weeks (namely The Floating Piers, by Christo and Jeanne-Claude). We report our findings and discuss the pros and cons of the work.

Published in: Data & Analytics
  • Be the first to comment

Analysis and knowledge extraction of user behaviour and social media content for art culture events

  1. 1. Analysis & Knowledge Extraction of Online User Behaviour and Visual Content for Art and Culture Events Marco Brambilla Tahereh Arabghalizi Behnam Rahdari Marco Brambilla Contacts: @marcobrambi, marco.brambilla@polimi.it, http://datascience.deib.polimi.it UNIVERSITY OF PITTSBURGH
  2. 2. Agenda Context Method • Pre-processing • Topic analysis • User clustering • Multimedia: Images • concepts vs. text extraction • color schema and the main color pattern(s) • Prediction of interests Challenges & Conclusions
  3. 3. Context • Role of social media in our life • Social media for cultural and artistic events • Behaviour and content • Multi-disciplinary collaboration on social media analysis and cultural heritage • Collaboration: Politecnico di Milano, Musei di Brescia, University of Pittsburg
  4. 4. Research Questions Topics of interest of visitors? Categorization of users? Demographics of visitors? Engagement and online participation? Relation between photos, time, location, text and the event?
  5. 5. Approach Domain-specific pipeline to profile social media users and content in cultural or art events
  6. 6. Case Study The Floating Piers by Christo and Jeanne Claude Iseo Lake, Italy June 2016
  7. 7. Case Study
  8. 8. Case Study • 17 MLN $ • 220,000 floating blocks • 1.5 MLN visitors in 16 days
  9. 9. Pre-processing
  10. 10. Data Extraction • Using Instagram and Twitter APIs • Extract relevant tweets/posts during the event • Extract all relevant users o That tweet/post directly o that like, comment, retweet, etc. • Extract all properties o Textual: bio, tweet/post text, hashtag, etc. o Quantitative: #followers, #followings, etc. o Media: photos, metadata (geotag, …)
  11. 11. Tweets Posts 14,062 30,256 Users Users 23,916 94,666 Authors Reacting Authors Reacting 7,724 16,197 16,681 77,985 From June 10th to July 30th Collected Data
  12. 12. • Text normalization (NLP) • Language identification and translation • Gender detection • Data cleansing • Store clean and transformed data Preprocessing
  13. 13. Time Distribution (Twitter)
  14. 14. Time series – Instagram vs. Twitter
  15. 15. Instagram Likes and Comments
  16. 16. Italy Lombardy Region Iseo Lake Geographical Distribution (Instagram)
  17. 17. Data Analysis Process 1. Document Term Matrix (DTM) 2. Topic Extraction 3. Dimension Reduction 4. Cluster Analysis and Validation 5. Prediction 6. Media Analysis 7. Content Network Analysis
  18. 18. Topics
  19. 19. Document-term Matrix A matrix that describes the frequency of terms that occur in a collection of documents Terms Documents Art Travel Italy Design … Post 1 0 1 1 0 Post 2 1 2 0 1 Post 3 0 0 1 0 Post 4 1 1 3 1 …
  20. 20. Topic Extraction Latent Dirichlet Allocation (LDA): documents as mixtures of topics (with probability) Input: Document Term Matrix Outputs: Topics, Topic Probabilities Matrix Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 … Post 1 0.19 0.16 0.27 0.14 0.11 0.13 Post 2 0.31 0.18 0.21 0.08 0.10 0.12 Post 3 0.25 0.24 0.20 0.17 0.09 0.05 Post 4 0.19 0.32 0.22 0.10 0.07 0.10 …
  21. 21. Dimensionality Reduction • Hundreds of topics extracted with LDA • Using Principle Component Analysis (PCA) to extract a smaller set of linearly uncorrelated topics > 0.95 Variance share Cumulative variance share
  22. 22. User Clustering
  23. 23. Cluster Analysis • Apply clustering algorithms over Topic Probabilities Matrix to cluster users • Multiple data slices • Multiple algorithms o K-means o Hierarchical o DBSCAN Topic 1 Topic 3 Topic 2
  24. 24. Cluster Validity • How to evaluate the “goodness” of the resulting clusters? • Validation Measures – Internal : ex. Silhouette Coefficient, Dunn’s Index, Calinski-Harabasz index, etc. – External: ex. Entropy, Purity, Rand index, etc.
  25. 25. User Clustering
  26. 26. Travel Lovers Art Lovers Internet & Tech Lovers Users’ Biography Word Clouds Cluster Labeling
  27. 27. Word Network for Clusters
  28. 28. Travel Lovers Art Lovers Tech Lovers Hierarchical Clustering
  29. 29. Language Gender Impact of Demographics
  30. 30. Prediction
  31. 31. Prediction Predict the category or the interest area of potential new users for similar cultural or art events in the future Decision Trees o Prepare Required Data o Grow Decision Tree o Extract rules from the tree o Predict using test data o Evaluate
  32. 32. Extracted Rules Rule 1 : if (0.36 < Bio_score < 0.37 OR Bio_score < 0.35) then Travel Lover Rule 2: if (0.35 < Bio_score < 0.36 AND Status_count > 14.5) OR (Bio_score > 0.37 AND language != Italian) then Art Lover Rule 3: if (Bio_score > 0.37 AND Language = Italian) then Tech Lover Otherwise: Not Interested accuracy = 62 % Prediction rules
  33. 33. Decision Tree
  34. 34. Image Analysis
  35. 35. Tweets Posts 14,062 30,256 Users Users 23,916 94,666 Authors Reacting Authors Reacting 7,724 16,197 16,681 77,985 From June 10th to July 30th Only Instagram
  36. 36. Used Instagram Filters
  37. 37. People in Pictures
  38. 38. Age Sex 50.4% female 49.6% male Visitor Analytics Race Bias of the medium?
  39. 39. Image content analsys Concept extraction (DNN based third party service) Comparison with hashtags / text Image low-level feature analysis
  40. 40. Concepts in Pictures Hashtags Users tend not to report the actual content of the photos in their textual descriptions /hashtags Object Extraction from Pictures
  41. 41. Main color shades among all photos Color Detection for Subject Identification
  42. 42. Confusion Matrix Simple techniques “good enough”? Objects or Colors?
  43. 43. Ongoing Challenges
  44. 44. Future Challenges of KE Determining exact positioning based on perspective
  45. 45. Future Challenges of KE Network structures and their temporal evolution Max graph perturbation Daily graph variations
  46. 46. Future Challenges Real cross-disciplinarity (cultural heritage, humanities, social science) No visitors for the cultural part of the event! (exhibition at the museum) Exhibit--->
  47. 47. Conclusions • (Sometimes) Simple methods work just fine • Interesting profiling and behaviour detection • Still far from cross-disciplinary approaches
  48. 48. Contacts: Marco Brambilla, @marcobrambi, marco.brambilla@polimi.it http://datascience.deib.polimi.it http://www.marco-brambilla.com Analysis of Online User Behaviour for Art and Culture Events Marco Brambilla, Tahereh Arabghalizi, Behnam Rahdari UNIVERSITY OF PITTSBURGH

×