SlideShare a Scribd company logo
SPRINGBOARD CAPSTONE PROJECT TWO
WHAT'S FOR FREE ON CRAIGSLIST?
Author: Josh Mayer
Date: November 15, 2017
OUTLINE
• Introduction & Problem Statement
• The Dataset
• Analysis & Findings
• Statistical Inference
• Machine Learning
• Conclusions
11/17/17 2
INTRODUCTION & PROBLEM STATEMENT
Craigslist was founded in 1995 and has become one of the most powerful sources for classified
advertisements found on the web today.
11/17/17 3
Source:	https://denver.craigslist.org/
THE PROBLEM
Oftentimes productsare
misclassified or the category
itself is too broad to be useful.
PRIMARY OBJECTIVE
The goal is to explore the
“For Sale/Free” category,
leverage unsupervised
learning to cluster and
categorize products listed.
DATA ACQUISITION & WRANGLING
11/17/17 4
Data was scraped fromthe Denver, Boulder, Colorado Springs, and Fort Collins (Greater Denver
Metro) Craigslist sitesvia the Python package “Scrapy”.
DATA WRANGLING
• Duplicate Records – 133 records (3% of the acquired records) were dropped since they contained duplicate title
AND timestamp. New timestamps are created if listings are “reposted”.
• Non-letter Removal – remove numbers, punctuation, emojis, etc. in order to have only letters.
• Lowercase Conversion – all letters were converted to be lowercase for comparison.
2) Grab the title for each
listing on the page.
3) Parse the pages until
all listings have been
scraped.
FINAL DATASET
3,927 Listings
1) Initiate a “Scrapy”
spider/crawler to acquire
data based on HTML tag.
EXPLORATORY DATA ANALYSIS
11/17/17 5
The dataset acquired was highly skewedtowards infrequently found words.
DATASET CHALLENGES
• 2,585 unique words were found in 3,927
unique listings.
• The average word is found 4.51 times
with a min count of 1 and max count of
274 counts.
• 57% of the words are found only one time
throughout the many listings.
• 82% of the words are found fewer than 5
times.
FREQUENCY VS UNIQUE WORD COUNT
Charts	above:	frequency	 of	all	words	(left)	along	side	the	frequency	of	words	found	fewer	than	5	
times	(right).Frequency
FURTHER ANALYSIS
11/17/17 6
Analyzing the most (and least) frequent words begins to show some of the potential categories.
MOST FREQUENT VS INFREQUENT WORDS
COUNT TOP	10	WORDS
274 wood
267 couch
160 tv
143 boxes
132 chair
125 dirt
121 pallets
121 curb
115 moving
68 firewood
COUNT BOTTOM	10	WORDS
1 ability
1 newspaper
1 newer
1 neutral
1 network
1 net
1 nepro
1 needing
1 necklaces
1 mtb
Charts	above:	the	most	frequent	word	found	in	the	listings	was	“wood”	at	274	(left	
chart).	There	are	many	 words	with	a	count	of	w	(right	chart).
We can already start to see some
of the potential categories based
on word frequency alone (e.g.
“wood” and ”firewood” are
amongst the top ten words found).
MACHINE LEARNING
11/17/17 7
The primary goal of the machine learning (ML) section is to identify the best algorithm to properly
classify the “free stuff” listings on Craigslist.
MACHINE LEARNING APPROACH
1) Text analysis was performed with the term frequency-
inverse document frequency (tf-idf) algorithm and the
vectorizers were then transformed to the tf-idf matrix.
2) Parameter tuning included setting min/max term
frequency, ”N grams”, stop words, and logarithmic
data transformations.
3) Perform clustering with the ”K-Means” algorithm.
4) Understand model results and identify the best model
to name the clusters.
FOCUS ON UNSUPERVISED LEARNING
This is an unsupervised ML exercise and focused on
testing and understanding key parameters as relates
to the exploratory data analysis findings.
MACHINE LEARNING – TEXT ANALYSIS
11/17/17 8
Text analysis was performed with the term frequency-inverse document frequency (tf-idf) algorithm and
the vectorizers were then transformed to the tf-idf matrix.
STEPS TO CREATE THE TF-IDF MATRIX
1. Count word (term) occurrences by document.
2. Transform into a document-term matrix.
3. Apply the term frequency-inverse document
frequency weighting.
4. Perform sensitivity analysis on frequency weighting
as well as some of the other key parameters.
VISUAL EXAMPLE TF-IDF MATRIX
MACHINE LEARNING – PARAMETER TUNING
11/17/17 9
Sensitivity analysis was performed on key model parameters to determine the impact.
KEY MODEL PARAMATERS
• Min/Max Frequency – this is the frequency within the listings a given term can have to be used in the tf-idf matrix.
E.g. If the term is in greater than X% of the listings it probably carries little meaning.
• N grams: accounts for term relativity. As an example, setting this value equal to “2” would mean that I consider
unigrams and bigrams, e.g. “leather” and “leather couch” which are both important to me.
• Stop Words: some words are too common (e.g. the, and) to be considered (stop words) for analysis and need to
be removed. The only customization made for this model was to include the word “free” in the stop words list as it
was found in nearly all listings.
• Logarithmic Transformation: setting this value to “true” provided a log10 transformation of the data, which in this
case was sorely needed given the dataset was skewed towards infrequent words.
MACHINE LEARNING – K-MEANS CLUSTERING
11/17/17 10
The clustering algorithm of choice for this model was “K-means” which required a pre-determined
number of clusters for initialization.
K	Value Sum-of-Squares	Score Silhouette	Score
2 -2,273 0.134
3 -2,201 0.145
4 -2,128 0.163
5 -2,081 0.181
6 -1,981 0.202
7 -1,994 0.191
8 -1,893 0.223
9 -1,861 0.233
COMPARING VISUAL RESULTS OF K CLUSTERS
Charts	above:	at	k=6	(left)	Interesting	to	see	the	supposed	cluster	on	the	right-hand	side	of	the	
chart	with	multiple	colors	of	dots	next	to	each	other.	At	k=3	(right)	the	clusters	visually	appear	
more	relevant,	however,	yellow	and	green	dots	are	still	very	close	together	on	the	right-hand	side	
of	the	page.		Looking	back	at	the	scores	for	k=3,	 the	SS	is	0.145	which	means	a	weak	structure.	
K	=	6	(clusters) K	=	3	(clusters)
ITERATING OVER POTENTIAL K VALUES
The silhouette score (SS) ranges from -1 (a poor
clustering) to +1 (a very dense clustering) with 0
denoting the situation where clusters overlap. SS
scores ranged from 0.134 to 0.233 depending
on the K value.
MACHINE LEARNING – MODEL RESULTS
11/17/17 11
Performing sensitivity analysis on key model parametersproduced a model with k=3 clusters and the
following results.
Cluster	1
“Moving	Items”
Cluster	2
“Scrap	Materials”
Cluster	3
“Furniture”
Wood Firewood Couch
Tv Wood Leather
Pallets Pallets Leather	couch
Chair Today Loveseat
Boxes Come Recliner
Desk Scrap Chair
Dirt Yard Reclining
Stuff Small Sofa
Table Mulch Brown
Moving Pallet Bed
NAMING THE CLUSTERSA few key observations:
• As expected, some of the words are present across
clusters. This was to be expected given the low
silhouette scores and the visual of the many different
colored dots close together.
• Only one bigram is present, meaning that the non-
unigram words were not as important as initially
thought.
• An attempt at naming the clusters might be:
• Cluster 1 would be named “Moving Items”.
• Cluster 2 would be named “Scrap Materials”.
• Cluster 3 would be named “Furniture”. Chart	above:	 top	10	terms	in	each	cluster	of	the	model	results	of	k=3.
RESULTS SUMMARY
11/17/17 12
KEY FINDINGS
• The data set acquired for the Greater Denver Metro “Free Stuff” analysis was heavily skewed by infrequent
terms. Even a log 10 transformation was not able to properly redistribute or normalize the data.
• This does not mean it is “bad data”, only that the majority of the postings in the “free stuff” section represent
very few terms.
• With a Silhouette Score of 0.14 on the training set and 0.15 on the test set, the K-means clustering model
provided weak (at best) structure for the designated clusters.
• In this analysis of the “Greater Denver Metro” Craigslist sites, there was an abundance of items in the
Craigslist “free stuff” that was classified into three categories:
• Category 1 is “Moving Items”.
• Category 2 is “Scrap Materials”.
• Category 3 is “Furniture”.
RECOMMENDATIONS & FUTURE IMPROVEMENTS
11/17/17 13
Key results show that the Craigslist team could implement a clustering model to help users filter and find
the products.
RECOMMENDATIONS & FUTURE IMPROVEMENTS
• Craigslist site administrators should create and enable “tags” or “categories” to assist in filtering the “free
stuff” section.
• Perform unsupervised learning on the broader, national data set of all Craigslist site for better
understanding of what is offered for free across the country.
• There could be a similar analysis done regarding miscategorized or mislabeled listings where the goal
would be to find the outlying data points to understand their value.

More Related Content

Similar to What's "For Free" on Craigslist?

Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
Karthik Sankar
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Vivian S. Zhang
 
Semantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity CardsSemantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity Cards
Faegheh Hasibi
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
Samet KILICTAS
 
Quicksort with Equal Keys
Quicksort with Equal KeysQuicksort with Equal Keys
Quicksort with Equal Keys
Sebastian Wild
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
Stanka Dalekova
 
Cluster Analysis - Keyword Clustering
Cluster Analysis -  Keyword ClusteringCluster Analysis -  Keyword Clustering
Cluster Analysis - Keyword Clustering
Justine Jes Thomas
 
Text Mining
Text MiningText Mining
Text Mining
sathish sak
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
Ernesto Reig
 
Data Mining Email SPam Detection PPT WITH Algorithms
Data Mining Email SPam Detection PPT WITH AlgorithmsData Mining Email SPam Detection PPT WITH Algorithms
Data Mining Email SPam Detection PPT WITH Algorithms
deepika90811
 
SQL For Programmers -- Boston Big Data Techcon April 27th
SQL For Programmers -- Boston Big Data Techcon April 27thSQL For Programmers -- Boston Big Data Techcon April 27th
SQL For Programmers -- Boston Big Data Techcon April 27th
Dave Stokes
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
Arithmer Inc.
 
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
lucenerevolution
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
Lakshmi Sarvani Videla
 
How We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad GuysHow We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad Guys
New York City College of Technology Computer Systems Technology Colloquium
 
Combining machine learning and search through learning to rank
Combining machine learning and search through learning to rankCombining machine learning and search through learning to rank
Combining machine learning and search through learning to rank
Jettro Coenradie
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generation
krisztianbalog
 
Performance By Design
Performance By DesignPerformance By Design
Performance By Design
Guy Harrison
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
Shree Shree
 
It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.
Alex Powers
 

Similar to What's "For Free" on Craigslist? (20)

Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Semantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity CardsSemantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity Cards
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Quicksort with Equal Keys
Quicksort with Equal KeysQuicksort with Equal Keys
Quicksort with Equal Keys
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
Cluster Analysis - Keyword Clustering
Cluster Analysis -  Keyword ClusteringCluster Analysis -  Keyword Clustering
Cluster Analysis - Keyword Clustering
 
Text Mining
Text MiningText Mining
Text Mining
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Data Mining Email SPam Detection PPT WITH Algorithms
Data Mining Email SPam Detection PPT WITH AlgorithmsData Mining Email SPam Detection PPT WITH Algorithms
Data Mining Email SPam Detection PPT WITH Algorithms
 
SQL For Programmers -- Boston Big Data Techcon April 27th
SQL For Programmers -- Boston Big Data Techcon April 27thSQL For Programmers -- Boston Big Data Techcon April 27th
SQL For Programmers -- Boston Big Data Techcon April 27th
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
How We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad GuysHow We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad Guys
 
Combining machine learning and search through learning to rank
Combining machine learning and search through learning to rankCombining machine learning and search through learning to rank
Combining machine learning and search through learning to rank
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generation
 
Performance By Design
Performance By DesignPerformance By Design
Performance By Design
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.
 

Recently uploaded

Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 

Recently uploaded (20)

Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 

What's "For Free" on Craigslist?

  • 1. SPRINGBOARD CAPSTONE PROJECT TWO WHAT'S FOR FREE ON CRAIGSLIST? Author: Josh Mayer Date: November 15, 2017
  • 2. OUTLINE • Introduction & Problem Statement • The Dataset • Analysis & Findings • Statistical Inference • Machine Learning • Conclusions 11/17/17 2
  • 3. INTRODUCTION & PROBLEM STATEMENT Craigslist was founded in 1995 and has become one of the most powerful sources for classified advertisements found on the web today. 11/17/17 3 Source: https://denver.craigslist.org/ THE PROBLEM Oftentimes productsare misclassified or the category itself is too broad to be useful. PRIMARY OBJECTIVE The goal is to explore the “For Sale/Free” category, leverage unsupervised learning to cluster and categorize products listed.
  • 4. DATA ACQUISITION & WRANGLING 11/17/17 4 Data was scraped fromthe Denver, Boulder, Colorado Springs, and Fort Collins (Greater Denver Metro) Craigslist sitesvia the Python package “Scrapy”. DATA WRANGLING • Duplicate Records – 133 records (3% of the acquired records) were dropped since they contained duplicate title AND timestamp. New timestamps are created if listings are “reposted”. • Non-letter Removal – remove numbers, punctuation, emojis, etc. in order to have only letters. • Lowercase Conversion – all letters were converted to be lowercase for comparison. 2) Grab the title for each listing on the page. 3) Parse the pages until all listings have been scraped. FINAL DATASET 3,927 Listings 1) Initiate a “Scrapy” spider/crawler to acquire data based on HTML tag.
  • 5. EXPLORATORY DATA ANALYSIS 11/17/17 5 The dataset acquired was highly skewedtowards infrequently found words. DATASET CHALLENGES • 2,585 unique words were found in 3,927 unique listings. • The average word is found 4.51 times with a min count of 1 and max count of 274 counts. • 57% of the words are found only one time throughout the many listings. • 82% of the words are found fewer than 5 times. FREQUENCY VS UNIQUE WORD COUNT Charts above: frequency of all words (left) along side the frequency of words found fewer than 5 times (right).Frequency
  • 6. FURTHER ANALYSIS 11/17/17 6 Analyzing the most (and least) frequent words begins to show some of the potential categories. MOST FREQUENT VS INFREQUENT WORDS COUNT TOP 10 WORDS 274 wood 267 couch 160 tv 143 boxes 132 chair 125 dirt 121 pallets 121 curb 115 moving 68 firewood COUNT BOTTOM 10 WORDS 1 ability 1 newspaper 1 newer 1 neutral 1 network 1 net 1 nepro 1 needing 1 necklaces 1 mtb Charts above: the most frequent word found in the listings was “wood” at 274 (left chart). There are many words with a count of w (right chart). We can already start to see some of the potential categories based on word frequency alone (e.g. “wood” and ”firewood” are amongst the top ten words found).
  • 7. MACHINE LEARNING 11/17/17 7 The primary goal of the machine learning (ML) section is to identify the best algorithm to properly classify the “free stuff” listings on Craigslist. MACHINE LEARNING APPROACH 1) Text analysis was performed with the term frequency- inverse document frequency (tf-idf) algorithm and the vectorizers were then transformed to the tf-idf matrix. 2) Parameter tuning included setting min/max term frequency, ”N grams”, stop words, and logarithmic data transformations. 3) Perform clustering with the ”K-Means” algorithm. 4) Understand model results and identify the best model to name the clusters. FOCUS ON UNSUPERVISED LEARNING This is an unsupervised ML exercise and focused on testing and understanding key parameters as relates to the exploratory data analysis findings.
  • 8. MACHINE LEARNING – TEXT ANALYSIS 11/17/17 8 Text analysis was performed with the term frequency-inverse document frequency (tf-idf) algorithm and the vectorizers were then transformed to the tf-idf matrix. STEPS TO CREATE THE TF-IDF MATRIX 1. Count word (term) occurrences by document. 2. Transform into a document-term matrix. 3. Apply the term frequency-inverse document frequency weighting. 4. Perform sensitivity analysis on frequency weighting as well as some of the other key parameters. VISUAL EXAMPLE TF-IDF MATRIX
  • 9. MACHINE LEARNING – PARAMETER TUNING 11/17/17 9 Sensitivity analysis was performed on key model parameters to determine the impact. KEY MODEL PARAMATERS • Min/Max Frequency – this is the frequency within the listings a given term can have to be used in the tf-idf matrix. E.g. If the term is in greater than X% of the listings it probably carries little meaning. • N grams: accounts for term relativity. As an example, setting this value equal to “2” would mean that I consider unigrams and bigrams, e.g. “leather” and “leather couch” which are both important to me. • Stop Words: some words are too common (e.g. the, and) to be considered (stop words) for analysis and need to be removed. The only customization made for this model was to include the word “free” in the stop words list as it was found in nearly all listings. • Logarithmic Transformation: setting this value to “true” provided a log10 transformation of the data, which in this case was sorely needed given the dataset was skewed towards infrequent words.
  • 10. MACHINE LEARNING – K-MEANS CLUSTERING 11/17/17 10 The clustering algorithm of choice for this model was “K-means” which required a pre-determined number of clusters for initialization. K Value Sum-of-Squares Score Silhouette Score 2 -2,273 0.134 3 -2,201 0.145 4 -2,128 0.163 5 -2,081 0.181 6 -1,981 0.202 7 -1,994 0.191 8 -1,893 0.223 9 -1,861 0.233 COMPARING VISUAL RESULTS OF K CLUSTERS Charts above: at k=6 (left) Interesting to see the supposed cluster on the right-hand side of the chart with multiple colors of dots next to each other. At k=3 (right) the clusters visually appear more relevant, however, yellow and green dots are still very close together on the right-hand side of the page. Looking back at the scores for k=3, the SS is 0.145 which means a weak structure. K = 6 (clusters) K = 3 (clusters) ITERATING OVER POTENTIAL K VALUES The silhouette score (SS) ranges from -1 (a poor clustering) to +1 (a very dense clustering) with 0 denoting the situation where clusters overlap. SS scores ranged from 0.134 to 0.233 depending on the K value.
  • 11. MACHINE LEARNING – MODEL RESULTS 11/17/17 11 Performing sensitivity analysis on key model parametersproduced a model with k=3 clusters and the following results. Cluster 1 “Moving Items” Cluster 2 “Scrap Materials” Cluster 3 “Furniture” Wood Firewood Couch Tv Wood Leather Pallets Pallets Leather couch Chair Today Loveseat Boxes Come Recliner Desk Scrap Chair Dirt Yard Reclining Stuff Small Sofa Table Mulch Brown Moving Pallet Bed NAMING THE CLUSTERSA few key observations: • As expected, some of the words are present across clusters. This was to be expected given the low silhouette scores and the visual of the many different colored dots close together. • Only one bigram is present, meaning that the non- unigram words were not as important as initially thought. • An attempt at naming the clusters might be: • Cluster 1 would be named “Moving Items”. • Cluster 2 would be named “Scrap Materials”. • Cluster 3 would be named “Furniture”. Chart above: top 10 terms in each cluster of the model results of k=3.
  • 12. RESULTS SUMMARY 11/17/17 12 KEY FINDINGS • The data set acquired for the Greater Denver Metro “Free Stuff” analysis was heavily skewed by infrequent terms. Even a log 10 transformation was not able to properly redistribute or normalize the data. • This does not mean it is “bad data”, only that the majority of the postings in the “free stuff” section represent very few terms. • With a Silhouette Score of 0.14 on the training set and 0.15 on the test set, the K-means clustering model provided weak (at best) structure for the designated clusters. • In this analysis of the “Greater Denver Metro” Craigslist sites, there was an abundance of items in the Craigslist “free stuff” that was classified into three categories: • Category 1 is “Moving Items”. • Category 2 is “Scrap Materials”. • Category 3 is “Furniture”.
  • 13. RECOMMENDATIONS & FUTURE IMPROVEMENTS 11/17/17 13 Key results show that the Craigslist team could implement a clustering model to help users filter and find the products. RECOMMENDATIONS & FUTURE IMPROVEMENTS • Craigslist site administrators should create and enable “tags” or “categories” to assist in filtering the “free stuff” section. • Perform unsupervised learning on the broader, national data set of all Craigslist site for better understanding of what is offered for free across the country. • There could be a similar analysis done regarding miscategorized or mislabeled listings where the goal would be to find the outlying data points to understand their value.