STUDENT NAME : SRIRAGAVI J
REGISTER NUMBER : 422323106022
INSTITUTION : TCET - VANDAVASI
DEPARTMENT : ECE – II ND YEAR
DATE OF SUBMISSION : 15-05-2025
GITHUB REPOSITORY LINK:
https://github.com/boo253-hue/Personalized-Movie-Recommendation-
System-Using-Machine-Learning.git
PESONALIZED MOVIE
RECOMMENDATION SYSTEM USING
MACHINE LEARNING
Problem Statement
● Aim: Build a movie recommendation system based on ‘MovieLens’
dataset.
● We wish to integrate the aspects of personalization of user with
the overall features of movie such as genre, popularity etc.
ABSTRACT
Recommendation systems are becoming increasingly important in
today’s hectic world. People are always in the lookout for
products/services that are best suited for them. Therefore, the
recommendation systems are important as they help them make the
right choices, without having to expend their cognitive resources.
here, I will build a Movie Recommendation System using collaborative
filtering by implementing the K-Nearest Neighbors algorithm. I will also
predict the rating of the given movie based on its neighbors and
compare it with the actual rating.
SYSTEM REQUIREMENTS
• Operating System – Windows 8/9/10/11
• Jupyter lab
• Visual Studio Code(VS code)
• Python
• Processor : intel Processor i3 or Above
• CPU : 2.0GHz or above
• RAM : 4GB or more
• Hard Disk : 500GB
PROJECTS OBJECTIVES
● This project tackles the critical challenge of credit card fraud detection and prevention.
● Our goal is to develop effective methods using machine learning, anomaly detection, and deep
learning to identify fraudulent activities.
● This widespread criminal activity leads to financial losses and identity theft for consumers, while
businesses face chargebacks and reputational damage. Secure financial transactions are the
bedrock of trust in today's digital economy.
FLOW CHART OF PROJECT WORKFLOW
Genre Distribution: Number of ratings per
user:
DATASET DESCRIPTION
● MovieLens review dataset (ml-latest-small)
○ Ratings: 100k
○ Movies: 9k
○ Users: 600
● Integrated the dataset with IMDB and TMDB data set publically available.
● Split the dataset into 80% training and 20% testing based on the User ID.
Models
1. Popularity based model
2. Content based model
3. Collaborative Filtering
4. Matrix Factorization method
5. Combined model ( SVD + CF)
6. Hybrid model
Action Movies Animated Movies
DATA PREPROCESSING
converted categorical
into numerical
variables-
•Binary Encoding : Gender
•One Hot Encoding :
Transaction Category
Encoding
Performed standard
scaling to normalize
numerical features.
Ensures all variables
are on a similar scale,
preventing features with
larger magnitudes from
dominating the model.
Standard
Scaling:
To handle imbalance of
the dataset.
Adding more copies of
the minority class to
balance the dataset.
SMOTE (Synthetic
Minority Over-sampling
Technique) -
• a smarter way to
oversample, it creates
synthetic samples that
are similar to the
existing minority class
samples.
Oversampling
EDA (Exploratory Data Analysis)
Data
CleaningRemoved the
columns that are
not required for
model building
No nulls were
there & Rectified
inappropriate
datatype
Feature
Engineering
Created Some
new features as
required
•For e.g., is_fraud_cat
for categorical
analysis,
•for numerical analysis
age' , 'trans_month',
'trans_year',
'month_name’,etc.
Categorical
Variable
Analysis
Visualized -
•Transaction
categories and
gender distribution,
both for the entire
dataset and
specifically for
fraudulent
transactions.
•Top 10 fraudulent
transactions by job,
city, and state
Numerical
Variable
Analysis
Visualized Overall
Skewness
Class balance –
• Not Fraud
(99.4%)
• Fraud (0.6%)
Bivariate
Analysis -
Vizualisation with
'is_fraud'
• age groups ,
• latitudinal &
longitudinal
distance and
• month & year.
FEATURE ENGINEERING
1. User profile based on item profiles
a. Genre
b. Year of release of movie
2. Movie - Movie similarity
#to read csv file
#to print all details of 10 movies
#to calculate statiscal data like count, mean,std,
#to print all columns and nonull and data types
#returns the number of missing values in the dataset
import pandas as pd
movies = pd.read_csv('dataset.csv’)
movies.head(10)
movies.describe()
movies.info()
movies.isnull().sum()
movies.columns
#it will combine the genre and overview column
movies=movies[['id', 'title', 'overview', 'genre']]
movies
movies['tags'] = movies['overview']+movies['genre’]
movies
new_data = movies.drop(columns=['overview', 'genre'])
new_data
MODEL BUILDING
from sklearn.feature_extraction.text import CountVectorizer #method to convert text to numerical data.
cv=CountVectorizer(max_features=10000, stop_words='english')
cv
vector=cv.fit_transform(new_data['tags'].values.astype('U')).toarray()
vector.shape
from sklearn.metrics.pairwise import cosine_similarity
similarity=cosine_similarity(vector)
similarity
new_data[new_data['title']=="The Godfather"].index[0]
distance = sorted(list(enumerate(similarity[2])), reverse=True, key=lambda vector:vector[1])
for i in distance[0:5]:
print(new_data.iloc[i[0]].title)
def recommend(movies):
index=new_data[new_data['title']==movies].index[0]
distance = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda vector:vector[1])
for i in distance[0:5]: #to print only top 5 movies
print(new_data.iloc[i[0]].title)
import pickle
pickle.dump(new_data, open('movies_list.pkl',
'wb')) pickle.dump(similarity,
open('similarity.pkl', 'wb'))
pickle.load(open('movies_list.pkl', 'rb'))
import streamlit as st
import pickle
import requests
def fetch_poster(movie_id):
url = "https://api.themoviedb.org/3/movie/{}?api_key=43c2c7148a22f65595a5dcc10a9d6c8b".format(movie_id)
data=requests.get(url)
data=data.json()
poster_path = data['poster_path']
full_path = "https://image.tmdb.org/t/p/w500/"+poster_path
return full_path
movies = pickle.load(open("movies_list.pkl", 'rb'))
similarity = pickle.load(open("similarity.pkl", 'rb'))
movies_list=movies['title'].values
st.header("Movie Recommender System")
Snapshots
VISUALIZATION OF RESULTS & MODEL INSIGHTS
import streamlit.components.v1 as components
imageCarouselComponent = components.declare_component("image-carousel-
component", path="frontend/public")
#imageCarouselComponent(imageUrls=imageUrls, height=200)
selectvalue=st.selectbox("Select movie from dropdown", movies_list)
def recommend(movie):
index=movies[movies['title']==movie].index[0]
distance = sorted(list(enumerate(similarity[index])), reverse=True,
key=lambda vector:vector[1])
recommend_movie
=[]
recommend_poster=[]
for i in distance[1:6]:
movies_id=movies.iloc[i[0]].id
recommend_movie.append(movies.iloc[i[0]].title)
recommend_poster.append(fetch_poster(movies_id))
return recommend_movie, recommend_poster
if st.button("Show Recommend"):
movie_name, movie_poster = recommend(selectvalue)
col1,col2,col3,col4,col5=st.columns(5)
with col1:
st.text(movie_name[0])
st.image(movie_poster[0])
with col2:
st.text(movie_name[1
])
st.image(movie_poster[1])
with col3:
st.text(movie_name[2
])
st.image(movie_poster[2])
with col4:
st.text(movie_name[3])
st.image(movie_poster[3])
with col5:
st.text(movie_name[4])
st.image(movie_poster[4])
MODEL BUILDING
Item Vector:
Vector of length total genres with 1
at relevant indices
User Vector:
Vector of length total genres with
the value of average rating for each
genre based on ratings in train set
Evaluation metrics
Metric
Content based
(Genre)
Precision 0.800932214
Recall 0.495168862
F-Measure 0.6119842046
NDCG 0.945576877
Metric
Content based
(Genre)
RMSE
0.9185
MAE
0.7095
Movie-Movie Similarity
● TF-IDF using overview and tagline of movies (from TMDb)
● Issue: This just gives movies having similar description.
Movie-Movie Similarity (Cont.)
Overview of ‘Doctor Who: Last Christmas’
'The Doctor and Clara face their Last Christmas.
Trapped on an Arctic base, under attack from
terrifying creatures, who are you going to call?
Santa Claus!'
● Adding the genre two times to give more weightage
● Changing TF-IDF to Count Vector
○ TF-IDF gives lesser weight to frequently occurring terms across
documents
Improvement
Movie 1: '20 Years After'
“In the middle of nowhere, 20 years after an apocalyptic
terrorist event that obliterated the face of the world!”
Genre: ['Drama', 'Fantasy', 'Sci-Fi']
Movie 2: '4:44 Last Day on Earth'
Overview:
'A look at how a painter and a successful actor spend their
last day together before the world comes to an end.'
Genre: ['Drama', 'Fantasy', 'Sci-Fi']
Doctor Who:
- 'The Doctor and Clara face their Last
Christmas. Trapped on an Arctic base,
under attack from terrifying creatures,
who are you going to call? Santa Claus!'
- ['Adventure', 'Drama', 'Fantasy', 'Sci-Fi']
MODEL EVALUATION
● KNN (k- nearest neighbors) algorithm using Surprise library
● Variations of KNN based approaches:
○ KNNBasic
○ KNNwithMeans
○ KNNWithZScore
○ KNNBaseline : integrates the baseline estimate ratings
● Similarity metrics:
○ Cosine similarity
○ Mean square difference based similarity
○ Pearson coefficient (mean-centered cosine similarity)
○ Pearson Baseline (uses global baselines for centering instead of means)
User-User and Item-Item comparison
Latent Factor Methods
● Matrix Factorisation algorithms using Surprise library
○ SVD : baseline estimates + latent factor predictions
○ SVDpp : SVD + considers implicit ratings
● Hyperparameter tuning using GridsearchCV
○ Number of epochs, number of factors, regularization
parameter
Evaluation of various algorithms:
Precision and Recall @ 5
Relevant : rating >=3.75
Evaluation of different algorithms
NDCG scores for different algorithms
Which model is best for less ratings in training
data?
(Less than 18 ratings per user)
Combined Model
● Matrix Factorization + CF
● Weighted linear combination of prediction ratings
● Combined:
○ KNNBaseline (with pearson baseline similarity)
○ SVDpp
○ SVD
○ BaselineOnly
SOURCE CODE
● User Id = 1
● User top genre list from User vector:
○ [‘Film-Noir’, ‘Animation’, ‘Musical’]:
• Provides relevant content to user.
• It saves time and money.
• It increases customer engagement.
• Specially designed for binge watchers
FEATURE SCOPE
TEAMS MEMBERS AND CONTRIBUTIONS
BOOPATHI K : PROBLEM STATEMENT & ABSTRACT ,OBJECTIVE ,
FLOWCHART OF THE PROJECT WORKFLOW , DEPLOYMENT
SRIRAGAVI J : DATA SET DESCRIPTION & PREPROCESSING , EDA , MODEL
BUILDING, SOURCE CODE
VENNILAVAN K : MODEL BUILDING & FUTURE SCOPE, SYSTEM REQUIEMENTS

SRIRAGAVI PHASE 3phasephasephasephh.pptx

  • 1.
    STUDENT NAME :SRIRAGAVI J REGISTER NUMBER : 422323106022 INSTITUTION : TCET - VANDAVASI DEPARTMENT : ECE – II ND YEAR DATE OF SUBMISSION : 15-05-2025 GITHUB REPOSITORY LINK: https://github.com/boo253-hue/Personalized-Movie-Recommendation- System-Using-Machine-Learning.git
  • 2.
  • 3.
    Problem Statement ● Aim:Build a movie recommendation system based on ‘MovieLens’ dataset. ● We wish to integrate the aspects of personalization of user with the overall features of movie such as genre, popularity etc.
  • 4.
    ABSTRACT Recommendation systems arebecoming increasingly important in today’s hectic world. People are always in the lookout for products/services that are best suited for them. Therefore, the recommendation systems are important as they help them make the right choices, without having to expend their cognitive resources. here, I will build a Movie Recommendation System using collaborative filtering by implementing the K-Nearest Neighbors algorithm. I will also predict the rating of the given movie based on its neighbors and compare it with the actual rating.
  • 5.
    SYSTEM REQUIREMENTS • OperatingSystem – Windows 8/9/10/11 • Jupyter lab • Visual Studio Code(VS code) • Python • Processor : intel Processor i3 or Above • CPU : 2.0GHz or above • RAM : 4GB or more • Hard Disk : 500GB
  • 6.
    PROJECTS OBJECTIVES ● Thisproject tackles the critical challenge of credit card fraud detection and prevention. ● Our goal is to develop effective methods using machine learning, anomaly detection, and deep learning to identify fraudulent activities. ● This widespread criminal activity leads to financial losses and identity theft for consumers, while businesses face chargebacks and reputational damage. Secure financial transactions are the bedrock of trust in today's digital economy.
  • 7.
    FLOW CHART OFPROJECT WORKFLOW Genre Distribution: Number of ratings per user:
  • 8.
    DATASET DESCRIPTION ● MovieLensreview dataset (ml-latest-small) ○ Ratings: 100k ○ Movies: 9k ○ Users: 600 ● Integrated the dataset with IMDB and TMDB data set publically available. ● Split the dataset into 80% training and 20% testing based on the User ID.
  • 9.
    Models 1. Popularity basedmodel 2. Content based model 3. Collaborative Filtering 4. Matrix Factorization method 5. Combined model ( SVD + CF) 6. Hybrid model
  • 10.
  • 11.
    DATA PREPROCESSING converted categorical intonumerical variables- •Binary Encoding : Gender •One Hot Encoding : Transaction Category Encoding Performed standard scaling to normalize numerical features. Ensures all variables are on a similar scale, preventing features with larger magnitudes from dominating the model. Standard Scaling: To handle imbalance of the dataset. Adding more copies of the minority class to balance the dataset. SMOTE (Synthetic Minority Over-sampling Technique) - • a smarter way to oversample, it creates synthetic samples that are similar to the existing minority class samples. Oversampling
  • 12.
    EDA (Exploratory DataAnalysis) Data CleaningRemoved the columns that are not required for model building No nulls were there & Rectified inappropriate datatype Feature Engineering Created Some new features as required •For e.g., is_fraud_cat for categorical analysis, •for numerical analysis age' , 'trans_month', 'trans_year', 'month_name’,etc. Categorical Variable Analysis Visualized - •Transaction categories and gender distribution, both for the entire dataset and specifically for fraudulent transactions. •Top 10 fraudulent transactions by job, city, and state Numerical Variable Analysis Visualized Overall Skewness Class balance – • Not Fraud (99.4%) • Fraud (0.6%) Bivariate Analysis - Vizualisation with 'is_fraud' • age groups , • latitudinal & longitudinal distance and • month & year.
  • 13.
    FEATURE ENGINEERING 1. Userprofile based on item profiles a. Genre b. Year of release of movie 2. Movie - Movie similarity
  • 14.
    #to read csvfile #to print all details of 10 movies #to calculate statiscal data like count, mean,std, #to print all columns and nonull and data types #returns the number of missing values in the dataset import pandas as pd movies = pd.read_csv('dataset.csv’) movies.head(10) movies.describe() movies.info() movies.isnull().sum() movies.columns #it will combine the genre and overview column movies=movies[['id', 'title', 'overview', 'genre']] movies movies['tags'] = movies['overview']+movies['genre’] movies new_data = movies.drop(columns=['overview', 'genre']) new_data MODEL BUILDING
  • 15.
    from sklearn.feature_extraction.text importCountVectorizer #method to convert text to numerical data. cv=CountVectorizer(max_features=10000, stop_words='english') cv vector=cv.fit_transform(new_data['tags'].values.astype('U')).toarray() vector.shape from sklearn.metrics.pairwise import cosine_similarity similarity=cosine_similarity(vector) similarity new_data[new_data['title']=="The Godfather"].index[0] distance = sorted(list(enumerate(similarity[2])), reverse=True, key=lambda vector:vector[1]) for i in distance[0:5]: print(new_data.iloc[i[0]].title)
  • 16.
    def recommend(movies): index=new_data[new_data['title']==movies].index[0] distance =sorted(list(enumerate(similarity[index])), reverse=True, key=lambda vector:vector[1]) for i in distance[0:5]: #to print only top 5 movies print(new_data.iloc[i[0]].title) import pickle pickle.dump(new_data, open('movies_list.pkl', 'wb')) pickle.dump(similarity, open('similarity.pkl', 'wb')) pickle.load(open('movies_list.pkl', 'rb'))
  • 17.
    import streamlit asst import pickle import requests def fetch_poster(movie_id): url = "https://api.themoviedb.org/3/movie/{}?api_key=43c2c7148a22f65595a5dcc10a9d6c8b".format(movie_id) data=requests.get(url) data=data.json() poster_path = data['poster_path'] full_path = "https://image.tmdb.org/t/p/w500/"+poster_path return full_path movies = pickle.load(open("movies_list.pkl", 'rb')) similarity = pickle.load(open("similarity.pkl", 'rb')) movies_list=movies['title'].values st.header("Movie Recommender System")
  • 18.
  • 19.
    import streamlit.components.v1 ascomponents imageCarouselComponent = components.declare_component("image-carousel- component", path="frontend/public") #imageCarouselComponent(imageUrls=imageUrls, height=200) selectvalue=st.selectbox("Select movie from dropdown", movies_list) def recommend(movie): index=movies[movies['title']==movie].index[0] distance = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda vector:vector[1]) recommend_movie =[] recommend_poster=[] for i in distance[1:6]: movies_id=movies.iloc[i[0]].id recommend_movie.append(movies.iloc[i[0]].title) recommend_poster.append(fetch_poster(movies_id)) return recommend_movie, recommend_poster if st.button("Show Recommend"): movie_name, movie_poster = recommend(selectvalue) col1,col2,col3,col4,col5=st.columns(5) with col1: st.text(movie_name[0]) st.image(movie_poster[0]) with col2: st.text(movie_name[1 ]) st.image(movie_poster[1]) with col3: st.text(movie_name[2 ]) st.image(movie_poster[2]) with col4: st.text(movie_name[3]) st.image(movie_poster[3]) with col5: st.text(movie_name[4]) st.image(movie_poster[4])
  • 20.
    MODEL BUILDING Item Vector: Vectorof length total genres with 1 at relevant indices User Vector: Vector of length total genres with the value of average rating for each genre based on ratings in train set
  • 21.
    Evaluation metrics Metric Content based (Genre) Precision0.800932214 Recall 0.495168862 F-Measure 0.6119842046 NDCG 0.945576877 Metric Content based (Genre) RMSE 0.9185 MAE 0.7095
  • 22.
    Movie-Movie Similarity ● TF-IDFusing overview and tagline of movies (from TMDb) ● Issue: This just gives movies having similar description.
  • 23.
    Movie-Movie Similarity (Cont.) Overviewof ‘Doctor Who: Last Christmas’ 'The Doctor and Clara face their Last Christmas. Trapped on an Arctic base, under attack from terrifying creatures, who are you going to call? Santa Claus!'
  • 24.
    ● Adding thegenre two times to give more weightage ● Changing TF-IDF to Count Vector ○ TF-IDF gives lesser weight to frequently occurring terms across documents Improvement
  • 25.
    Movie 1: '20Years After' “In the middle of nowhere, 20 years after an apocalyptic terrorist event that obliterated the face of the world!” Genre: ['Drama', 'Fantasy', 'Sci-Fi'] Movie 2: '4:44 Last Day on Earth' Overview: 'A look at how a painter and a successful actor spend their last day together before the world comes to an end.' Genre: ['Drama', 'Fantasy', 'Sci-Fi'] Doctor Who: - 'The Doctor and Clara face their Last Christmas. Trapped on an Arctic base, under attack from terrifying creatures, who are you going to call? Santa Claus!' - ['Adventure', 'Drama', 'Fantasy', 'Sci-Fi']
  • 26.
    MODEL EVALUATION ● KNN(k- nearest neighbors) algorithm using Surprise library ● Variations of KNN based approaches: ○ KNNBasic ○ KNNwithMeans ○ KNNWithZScore ○ KNNBaseline : integrates the baseline estimate ratings ● Similarity metrics: ○ Cosine similarity ○ Mean square difference based similarity ○ Pearson coefficient (mean-centered cosine similarity) ○ Pearson Baseline (uses global baselines for centering instead of means)
  • 27.
  • 28.
    Latent Factor Methods ●Matrix Factorisation algorithms using Surprise library ○ SVD : baseline estimates + latent factor predictions ○ SVDpp : SVD + considers implicit ratings ● Hyperparameter tuning using GridsearchCV ○ Number of epochs, number of factors, regularization parameter
  • 29.
    Evaluation of variousalgorithms: Precision and Recall @ 5 Relevant : rating >=3.75
  • 30.
  • 31.
    NDCG scores fordifferent algorithms
  • 32.
    Which model isbest for less ratings in training data? (Less than 18 ratings per user)
  • 33.
    Combined Model ● MatrixFactorization + CF ● Weighted linear combination of prediction ratings ● Combined: ○ KNNBaseline (with pearson baseline similarity) ○ SVDpp ○ SVD ○ BaselineOnly
  • 34.
    SOURCE CODE ● UserId = 1 ● User top genre list from User vector: ○ [‘Film-Noir’, ‘Animation’, ‘Musical’]:
  • 35.
    • Provides relevantcontent to user. • It saves time and money. • It increases customer engagement. • Specially designed for binge watchers FEATURE SCOPE
  • 36.
    TEAMS MEMBERS ANDCONTRIBUTIONS BOOPATHI K : PROBLEM STATEMENT & ABSTRACT ,OBJECTIVE , FLOWCHART OF THE PROJECT WORKFLOW , DEPLOYMENT SRIRAGAVI J : DATA SET DESCRIPTION & PREPROCESSING , EDA , MODEL BUILDING, SOURCE CODE VENNILAVAN K : MODEL BUILDING & FUTURE SCOPE, SYSTEM REQUIEMENTS