This document discusses using Reddit comment data to cluster subreddits and authors based on shared interests over time. It describes challenges in clustering the large dataset due to its size and proposes solutions using techniques like filtering for active users/subreddits, principal component analysis (PCA) to reduce dimensionality, and random PCA to speed up the process. Silhouette analysis is used to determine the optimal number of clusters. The goal is to analyze how personalization of Reddit has changed over time through subreddit clustering.