Prep Fo/r/ DIY

•Download as PPTX, PDF•

1 like•328 views

Summary of my demo project for the Insight Data Science Fellows program. **Find relevant content to plan your next hobby or home improvement**

Data & Analytics

prep fo/r/diy
Find relevant content to plan your
next hobby or home-improvement
Kristofor Nyquist

…but reddit is a popularity contest
Given that I’m interested in a post
1. What are similar posts I can read?
2. Who are the best people I can talk to?
Goal: find related DIY projects

Collecting / organizing data
• Scraped reddit for do-it-yourself projects
– Collected text content from each project
i.e. title, externally linked blog post, comments,
and general topic

Data conversion
• Combine all of the text to create one
“document”
• Treat the document as a list of words
• Convert the list of words to a list of numbers
– Each number represents the “uniqueness” of a
particular word to its document
i.e. 0 means word appears in every post
1 means word only appears in that post

Clustering
Text content is rich enough to cluster projects

Clustering
Classify projects with missing categories based on user inputs

Validating post similarity
Similar posts group well by general topic.
Deviations can make sense

Against random
Obvious that randomly picking posts is not suitable

PhD Biophysics, UC Berkeley
BS Physics, WSU
About me (Kristofor Nyquist)
Hobbies

Algorithm
Post similarity / Classification
• Turn the text into a list of numbers using term-frequency-
inverse-document-frequency
• “Compress” the data for speed
– ~70,000 dimensions to 80 dimensions
For similarity:
• Calculate cosine-similarity between documents
• Present user with 5 most similar posts
For classification:
• Logistic regression (L1 regularization)

Settling on 80 PCs
80 principal components somewhat arbitrary
BUT overall accuracy of classifier has definitely converged…
even though 80 PCs capture ~30% of variance

Validation numbers
• NLP algorithm
– accuracy: 0.62
recalls:
auto: 0.73
electronic: 0.6
home improve: 0.54
metalwork: 0.44
other: 0.52
outdoor: 0.89
woodworking: 0.68
• Random
–accuracy: 0.17
recalls:
auto: 0.05
electronic: 0.00
home improve: 0.18
metalwork: 0.00
other: 0.08
outdoor: 0.12
woodworking: 0.32

Full BoW vs. 80 PCs
80 PCs Full tfidf vector

Validating classifier
auto
metalwork
home
electronic
other
outdoor
woodwork
auto
metalwork
home
electronic
other
outdoor
woodwork
Prediction
Truth
Column normalized
vs. rest ROCWoodworking
Other
Home improvement
With this application, can tolerate false
positives. We can also present users
with more limited options,
maintaining higher accuracy

Viewers also liked

вебинар психологическое сопровождение_внедрения_инноваций_в_учебно-воспитател...viksol

Assistive technologyLKHolder

Commodity market review1012 iron imfRomulo José Soares Miranda

вебинар психологическое сопровождение_внедрения_инноваций_в_учебно-воспитател...viksol

貸倒引当金ichitanaka

Plan de Viajes VIP VISION TRAVEL - CancunElys Santaella

Sistema zahodiv metodichnogo_kabinetu_z_vsebichnogo_kompleksnogo_ocinyuvannyaviksol

Zdoroveviksol

Stvoennya umov dlya_rozvitku_uspishnoi_osobistosti_v_shkoli_maybutnogo_viksol

Pedagogichniy dosvid zosh_33viksol

Shkola spriyannya zdorov_yuviksol

Bagatoprofilna gimnaziya m__krasnoarmiyska_1viksol

HPL CollectionsLeads Facade

засоби психологічної підтримкиviksol

майстер клас павловська с.п.viksol

Contextualised Service Delivery in Internet of Things, Smart Parking for Smar...Ali Yavari

Viewers also liked (16)

вебинар психологическое сопровождение_внедрения_инноваций_в_учебно-воспитател...

Assistive technology

Commodity market review1012 iron imf

вебинар психологическое сопровождение_внедрения_инноваций_в_учебно-воспитател...

貸倒引当金

Plan de Viajes VIP VISION TRAVEL - Cancun

Sistema zahodiv metodichnogo_kabinetu_z_vsebichnogo_kompleksnogo_ocinyuvannya

Zdorove

Stvoennya umov dlya_rozvitku_uspishnoi_osobistosti_v_shkoli_maybutnogo_

Pedagogichniy dosvid zosh_33

Shkola spriyannya zdorov_yu

Bagatoprofilna gimnaziya m__krasnoarmiyska_1

HPL Collections

засоби психологічної підтримки

майстер клас павловська с.п.

Contextualised Service Delivery in Internet of Things, Smart Parking for Smar...

Similar to Prep Fo/r/ DIY

Argumentation 1 am (week 3)Ron Martinez

Natural Language Processing with Graph Databases and Neo4jWilliam Lyon

Introduction to natural language processing (NLP)Alia Hamwi

Text similarity measuresankit_ppt

textnyomans1

OSCON 2012 MongoDB TutorialSteven Francia

Natural Language Processing with GraphsNeo4j

IR.pptxMahamSajid4

Business BINicholas Charles

Literati Platform Training and Usage WorkshopLOUIS Libraries

Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections

Searching with vectorsSimon Hughes

Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik

MongoDB for GenealogySteven Francia

Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Lucidworks

Vectors in Search - Towards More Semantic MatchingSimon Hughes

Chapter 6 Query Language .pdfHabtamu100

Books and Webs: Pulling the Down RowsPeter Brantley

Short Critical EssayShort Critical Essay ProjectThis project i.docxbudabrooks46239

Review of LiteratureCSN Vittal

Similar to Prep Fo/r/ DIY (20)

Argumentation 1 am (week 3)

Natural Language Processing with Graph Databases and Neo4j

Introduction to natural language processing (NLP)

Text similarity measures

text

OSCON 2012 MongoDB Tutorial

Natural Language Processing with Graphs

IR.pptx

Business BI

Literati Platform Training and Usage Workshop

Haystack 2019 - Search with Vectors - Simon Hughes

Searching with vectors

Engineering Intelligent NLP Applications Using Deep Learning – Part 1

MongoDB for Genealogy

Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com

Vectors in Search - Towards More Semantic Matching

Chapter 6 Query Language .pdf

Books and Webs: Pulling the Down Rows

Short Critical EssayShort Critical Essay ProjectThis project i.docx

Review of Literature

Recently uploaded

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

Edukaciniai dropshipping via API with DroFxolyaivanovalion

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7Call Girls in Nagpur High Profile Call Girls

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71

Data-Analysis for Chicago Crime Data 2023ymrp368

Mature dropshipping via API with DroFx.pptxolyaivanovalion

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

Midocean dropshipping via API with DroFxolyaivanovalion

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Recently uploaded (20)

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779

Sampling (random) method and Non random.ppt

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night

Ravak dropshipping via API with DroFx.pptx

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

CebaBaby dropshipping via API with DroFX.pptx

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx

Edukaciniai dropshipping via API with DroFx

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

VidaXL dropshipping via API with DroFx.pptx

Determinants of health, dimensions of health, positive health and spectrum of...

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha

Data-Analysis for Chicago Crime Data 2023

Mature dropshipping via API with DroFx.pptx

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...

Midocean dropshipping via API with DroFx

Generative AI on Enterprise Cloud with NiFi and Milvus

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

Prep Fo/r/ DIY

1. prep fo/r/diy Find relevant content to plan your next hobby or home-improvement Kristofor Nyquist

2. …but reddit is a popularity contest Given that I’m interested in a post 1. What are similar posts I can read? 2. Who are the best people I can talk to? Goal: find related DIY projects

3. Collecting / organizing data • Scraped reddit for do-it-yourself projects – Collected text content from each project i.e. title, externally linked blog post, comments, and general topic

4. Data conversion • Combine all of the text to create one “document” • Treat the document as a list of words • Convert the list of words to a list of numbers – Each number represents the “uniqueness” of a particular word to its document i.e. 0 means word appears in every post 1 means word only appears in that post

5. Data conversion • Combine all of the text to create one “document” • Treat the document as a list of words • Convert the list of words to a list of numbers – Each number represents the “uniqueness” of a particular word to its document i.e. 0 means word appears in every post 1 means word only appears in that post

6. Data conversion • Combine all of the text to create one “document” • Treat the document as a list of words • Convert the list of words to a list of numbers – Each number represents the “uniqueness” of a particular word to its document i.e. 0 means word appears in every post 1 means word only appears in that post

7. Data conversion • Combine all of the text to create one “document” • Treat the document as a list of words • Convert the list of words to a list of numbers – Each number represents the “uniqueness” of a particular word to its document i.e. 0 means word appears in every post 1 means word only appears in that post

8. Data conversion • Combine all of the text to create one “document” • Treat the document as a list of words • Convert the list of words to a list of numbers – Each number represents the “uniqueness” of a particular word to its document i.e. 0 means word appears in every post 1 means word only appears in that post

9. Clustering Text content is rich enough to cluster projects

10. Clustering Classify projects with missing categories based on user inputs

11. Validating post similarity Similar posts group well by general topic. Deviations can make sense

12. Against random Obvious that randomly picking posts is not suitable

13. PhD Biophysics, UC Berkeley BS Physics, WSU About me (Kristofor Nyquist) Hobbies

14. Algorithm Post similarity / Classification • Turn the text into a list of numbers using term-frequency- inverse-document-frequency • “Compress” the data for speed – ~70,000 dimensions to 80 dimensions For similarity: • Calculate cosine-similarity between documents • Present user with 5 most similar posts For classification: • Logistic regression (L1 regularization)

15. Settling on 80 PCs 80 principal components somewhat arbitrary BUT overall accuracy of classifier has definitely converged… even though 80 PCs capture ~30% of variance

16. Validation numbers • NLP algorithm – accuracy: 0.62 recalls: auto: 0.73 electronic: 0.6 home improve: 0.54 metalwork: 0.44 other: 0.52 outdoor: 0.89 woodworking: 0.68 • Random –accuracy: 0.17 recalls: auto: 0.05 electronic: 0.00 home improve: 0.18 metalwork: 0.00 other: 0.08 outdoor: 0.12 woodworking: 0.32

17. Full BoW vs. 80 PCs 80 PCs Full tfidf vector

18. Validating classifier auto metalwork home electronic other outdoor woodwork auto metalwork home electronic other outdoor woodwork Prediction Truth Column normalized vs. rest ROCWoodworking Other Home improvement With this application, can tolerate false positives. We can also present users with more limited options, maintaining higher accuracy

Prep Fo/r/ DIY

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to Prep Fo/r/ DIY

Similar to Prep Fo/r/ DIY (20)

Recently uploaded

Recently uploaded (20)

Prep Fo/r/ DIY