2. …but reddit is a popularity contest
Given that I’m interested in a post
1. What are similar posts I can read?
2. Who are the best people I can talk to?
Goal: find related DIY projects
3. Collecting / organizing data
• Scraped reddit for do-it-yourself projects
– Collected text content from each project
i.e. title, externally linked blog post, comments,
and general topic
4. Data conversion
• Combine all of the text to create one
“document”
• Treat the document as a list of words
• Convert the list of words to a list of numbers
– Each number represents the “uniqueness” of a
particular word to its document
i.e. 0 means word appears in every post
1 means word only appears in that post
5. Data conversion
• Combine all of the text to create one
“document”
• Treat the document as a list of words
• Convert the list of words to a list of numbers
– Each number represents the “uniqueness” of a
particular word to its document
i.e. 0 means word appears in every post
1 means word only appears in that post
6. Data conversion
• Combine all of the text to create one
“document”
• Treat the document as a list of words
• Convert the list of words to a list of numbers
– Each number represents the “uniqueness” of a
particular word to its document
i.e. 0 means word appears in every post
1 means word only appears in that post
7. Data conversion
• Combine all of the text to create one
“document”
• Treat the document as a list of words
• Convert the list of words to a list of numbers
– Each number represents the “uniqueness” of a
particular word to its document
i.e. 0 means word appears in every post
1 means word only appears in that post
8. Data conversion
• Combine all of the text to create one
“document”
• Treat the document as a list of words
• Convert the list of words to a list of numbers
– Each number represents the “uniqueness” of a
particular word to its document
i.e. 0 means word appears in every post
1 means word only appears in that post
13. PhD Biophysics, UC Berkeley
BS Physics, WSU
About me (Kristofor Nyquist)
Hobbies
14. Algorithm
Post similarity / Classification
• Turn the text into a list of numbers using term-frequency-
inverse-document-frequency
• “Compress” the data for speed
– ~70,000 dimensions to 80 dimensions
For similarity:
• Calculate cosine-similarity between documents
• Present user with 5 most similar posts
For classification:
• Logistic regression (L1 regularization)
15. Settling on 80 PCs
80 principal components somewhat arbitrary
BUT overall accuracy of classifier has definitely converged…
even though 80 PCs capture ~30% of variance