Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Anime recommendation (Big Data Certification#6)

418 views

Published on

ผลงานของกลุ่ม Anime recommendation ในหลักสูตร Big Data Certification รุ่นที่ 6 ที่นำเสนอเมื่อวันที่ 20 มกราคม 2561

Published in: Technology
  • Be the first to comment

Anime recommendation (Big Data Certification#6)

  1. 1. Anime Recommendation
  2. 2. Executive Summary • Problem Statement • Business Values • Project Requirements
  3. 3. Problem Statement • How is rating this anime if we give it to user? • How popular each anime based-on follower? • How many anime group based-on their genres? • Which anime will be recommended to user based-on their preference?
  4. 4. Business Values • Able to choose anime to match current viewer • Able to push advertisement to potential viewer • Able to upsell similar products for each anime • Able to accurately predict anime rating and popularity for license acquisition
  5. 5. Requirements • Anime data and user rating • Recommendation algorithm using ALS • Clustering algorithm using K-NN • Model evaluation using RMSE
  6. 6. Data • From https://www.kaggle.com/CooperUnion/anime- recommendations-database • Contains information on user preference data from 73,516 users on 12,294 anime • Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings
  7. 7. Data • 2 files, anime.csv and rating.csv • Data volume • 12,294 rows for anime.csv • 7,813,737 rows for rating.csv
  8. 8. Schema • anime.csv • anime_id: myanimelist.net's unique id identifying an anime • name: full name of anime • genre: comma separated list of genres for this anime • type: movie, TV, OVA, etc. • episodes: how many episodes in this show. (1 if movie) • rating: average rating out of 10 for this anime • members: number of community members that are in this anime's "group"
  9. 9. Schema
  10. 10. Schema • rating.csv • user_id: non identifiable randomly generated user id • anime_id: the anime that this user has rated • rating: rating out of 10 this user has assigned (-1 if the user watched it but didn't assign a rating)
  11. 11. Schema
  12. 12. Feature • Use original dataset to build recommendation model • Extract unique genre from genres column in anime.csv to build clustering model
  13. 13. Feature • Recommendation • anime_id • rating, also used as target • user_id
  14. 14. Feature • Clustering • anime_id • Pivoted genres (Action, Adventure, Comedy, Drama, …) • type • episodes • rating • members • Use predicted cluster as target
  15. 15. Running Prototyping Experiment • Get data • Data pre-processing • Feature engineering • Train the model • Model evaluation
  16. 16. Get Data • Dataset was downloaded from https://www.kaggle.com/CooperUnion/anime-recommendations- database • Data is in comma separated value file format • See data information in ”Data” section
  17. 17. Data Pre-Processing • Data retrieved are well-formed • Some NULL value in rating was found • Unknown episodes represented as “Unknown” • Rows with NULL and/or Unknown values was filtered out • Total filtered rows is ~500
  18. 18. Feature Engineering • Use original data schema • Processed only data in rating.csv • Use anime_id, user_id and rating as features • rating also used as target
  19. 19. Train the Model • Processed only data in rating.csv • Ratio of train-to-test data is 80:20 • Use ALS algorithm to build rating predictive model
  20. 20. Model Evaluation • Data in anime.csv is used for map anime_id with human-readible name • Predicted ratings were of type “floating point” • Using RMSE as an evaluation method • Some row of test data cannot be predicted, we get “NaN” as a result • NaN (Not-a-Number) was filtered out
  21. 21. Anime Recommendation Part 2
  22. 22. Contents • Clustering model with K-Means • Real-time data processing • Visualization
  23. 23. Clustering with K-Means
  24. 24. Environment • CRAN R 3.4.2 • Anime data file (anime.csv) • Genres distance file (distance.csv)
  25. 25. Build a Clustering Model • Try build with 5 to 10 clusters • Use distance.csv file to determine the distance • Visualizing clusters
  26. 26. Discussion • Distance value can be determine as indicated in “How to produce a pretty plot of the results of k-means cluster analysis?” discussion (https://stats.stackexchange.com/questions/31083/how-to-produce- a-pretty-plot-of-the-results-of-k-means-cluster-analysis) • Distance value in anime clustering should be a normalized value • Can be percent of running scene for each genre • Example: action scene running for 12 minutes out of 24 minutes, so distance for action is 50%
  27. 27. Real-time Data Processing
  28. 28. Environment • Web API • Kafka • Spark Streaming
  29. 29. Environment Client Client Client Request Response Producer Consumer
  30. 30. Demonstration
  31. 31. Visualization
  32. 32. Demonstration

×