The document summarizes research analyzing factors that influence the success of Kickstarter campaigns. It describes the dataset of over 190,000 Kickstarter projects and identifies variables like funding goal, number of backers, and project category as potential predictors of success. Statistical techniques like random forests and ridge regression are used to determine the most important predictors and predict the probability of success for ongoing campaigns. The random forests analysis finds that funding goal is the most important variable, while ridge regression results match this finding.
1. Deciding factors of Kickstarter campaigns’s success
Team 6 - Baing, Nguyen, Song, Tran, Wang
May 2, 2018
Business problems
Kickstarter is one of the most popular global crowdfunding platforms. To date, this Brooklyn based corporation
has helped to launch over 400,171 million projects in music, film, technology, design, fashion, theater, dance,
photography, art, journalism, comics and publishing with the equivalent pleaded amount of 3.66 billion US
dollars. Out of over 400 million launched projects, there are only 36% are successfully funded. There are
currently over 14 millions active backers who are pleading to invest and bring these projects to live, however
Kickstarters funding model is all or nothing which means if the campaign doesn’t reach the goal, the pledged
amount is returned to backers. Therefore, it is important for creators to know what enhance or offset their
likelihood of success when they launch their projects on Kickstarter.
There are many factors that can affect the chance of a project getting funded. These factors could be variables
like the category of the project, how long the campaign live on the site, goal, number of backers, whether
or not it is a popular project and is featured in the discovery page etc. We are interested in finding what
factors are significant in determining whether or not the project would be successfully funded. Within those
variables, we also aim to find which variables have more effect on the project success (success=1). By using a
fitted Logistic Model with selected variables, we will predict the likelihood of success for over 3,000 ongoing
campaigns.
Our goals are to encourage entrepreneurship and help entrepreneurs to develop their ideas into a real
business by increasing the chance of their projects to be funded. By figuring out the important factors
for a project to be successfully funded, the entrepreneurs can better allocate their time and energy on
aspects that drive up the successful rate of being fully funded, and better manage the risks and challenges.
Spreading entrepreneurs’ideas and improving the success rate create win-win situations for both backers and
entrepreneurs.
Dataset introduction
The data used here is from https://s3.amazonaws.com/weruns/forfun/Kickstarter/Kickstarter_
2018-02-15T03_20_44_743Z.zip. “Web robots” which is a data scraping website which has been
collecting Kickstarter campaign’s data from 2009 to 2018. There are a total of 48 files in the zip file.
Combining all files amounts to around 190,000 observations. The observations have their date and time
format in unix format with certain columns having unfiltered data that needs to be cleaned. Not all variables
can be used as predictors as some do not contribute towards the response variable. Here, we remove those
variables/columns.
The date and time variables are converted into normal data and time format in R while the campaign
categories are extracted from a huge string of unfiltered data using R packages. There are a total of 128
categories in the dataset which are grouped into 15 major categories. “Campaign name” variable is converted
to “Campaign Name length” variable to find out the effect of length of a campaign’s success. The response
variable is categorical with two outputs that are converted to a boolean binomial variable for VIF calculation.
Our output variable is binary: a campaign can either succeed or fail. Following are the predictors which we
think are important according to our business knowledge: 1) Number of backers - numbers of investors
that have backed 2) Pledged Amount - the amount of money backers have pledged to fund the project if
the campaign goal is met (integer) 3) Goal- the amount of money that the creators need to complete their
projects (double) 4) Campaign Category (15 Categories) - all projects fall into 5) Difference between
deadline and launch date - the duration of the campaign on Kickstarter site (integer) 6) Campaign
1
2. Name Length (integer) 7) Staff Pick - whether or not the project was a Kickstarter “staff pick” project
and published in “Projects We Love discover page” ( T/F - Binary)
We find out the importance of these variables, mainly the “Campaign categories” and try to predict the
probability of success of a live campaign.
Methods/specifications
1.Multicollinearity Test
We first test our predictors for Multicollinearity. According to the plot, there can be no multicollinearity
among our categories as they are mutually exclusive. Actually, in the Kickstarter’s requirements, each project
could be categorized in one kind/category only - if one project is listed in photography, it could not be listed
in another category.
Let’s Run an OLS model just to check multicollinearity.
lm.fit=lm(boolop ~ goal + date.difference + name.length+ backers_count + converted_pledged_amount + staf
summary(lm.fit)
##
## Call:
## lm(formula = boolop ~ goal + date.difference + name.length +
## backers_count + converted_pledged_amount + staff_pick, data = ops)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0483 -0.5006 0.1395 0.4446 0.9652
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.027e-01 3.968e-03 126.663 < 2e-16 ***
## goal -1.336e-08 9.891e-10 -13.504 < 2e-16 ***
## date.difference -2.796e-03 8.678e-05 -32.223 < 2e-16 ***
## name.length 3.562e-03 7.030e-05 50.666 < 2e-16 ***
## backers_count 2.371e-05 1.737e-06 13.649 < 2e-16 ***
## converted_pledged_amount 1.133e-07 1.992e-08 5.688 0.0000000129 ***
## staff_pickTRUE 3.387e-01 3.257e-03 104.014 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4698 on 175815 degrees of freedom
## Multiple R-squared: 0.09073, Adjusted R-squared: 0.09069
## F-statistic: 2924 on 6 and 175815 DF, p-value: < 2.2e-16
Testing for Condition Index and VIF:
collin.diag = colldiag(mod=lm.fit, scale=T, center=T, add.intercept=F)
collin.diag
## Condition
## Index Variance Decomposition Proportions
## goal date.difference name.length backers_count
## 1 1.000 0.000 0.000 0.002 0.086
## 2 1.346 0.042 0.340 0.502 0.003
## 3 1.349 0.473 0.194 0.165 0.001
2
3. ## 4 1.384 0.485 0.406 0.038 0.001
## 5 1.420 0.000 0.060 0.292 0.004
## 6 3.152 0.000 0.000 0.001 0.905
## converted_pledged_amount staff_pick
## 1 0.086 0.028
## 2 0.002 0.069
## 3 0.001 0.132
## 4 0.001 0.088
## 5 0.006 0.682
## 6 0.904 0.000
vif(lm.fit)
## goal date.difference name.length
## 1.000777 1.001391 1.004255
## backers_count converted_pledged_amount staff_pick
## 2.950967 2.945734 1.026015
Results:The CI and VIF test shows satisfactory results of no multicollinearity among the predictors.
2.Random Forests
To check the level of importance of categories on the output we run Random Forests and find out the most
to least important variables.
train = sample(1:nrow(ops), 0.7*nrow(ops))
rf.ops<-randomForest(boolop ~ backers_count + converted_pledged_amount +date.difference + staff_pick + g
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
varImpPlot(rf.ops)
3
5. the least important.
3.Ridge Regression
Now we have a rough idea of the level of importance of each variable. So we run Ridge Regression to check if
the results match. Additionally, the least important categorical variable photography is used as a reference
variable.
x<-model.matrix(boolop ~ goal + date.difference + name.length+ backers_count + converted_pledged_amount
y<-(ops$boolop)
ridge.mod<-glmnet(x,y,alpha=0,family="binomial")
plot(ridge.mod)
0 2 4 6 8 10
−1.0−0.50.00.51.0
L1 Norm
Coefficients
21 21 21 21 21 21
cv.10Fold=cv.glmnet(x,y,alpha=0)
cbind("Lambda"=cv.10Fold$lambda, "10-Fold MSE"=cv.10Fold$cvm)
## Lambda 10-Fold MSE
## [1,] 126.64214876 0.2427229
## [2,] 115.39160931 0.2421852
## [3,] 105.14053677 0.2421289
## [4,] 95.80014126 0.2420714
## [5,] 87.28952075 0.2420084
## [6,] 79.53496030 0.2419395
## [7,] 72.46929364 0.2418641
## [8,] 66.03132132 0.2417816
## [9,] 60.16528073 0.2416914
5
9. ## dance 1.2399057258460
## design 0.4052340914653
## fashion -0.5019103086745
## flim.video 0.3010883737987
## food -0.8090250738321
## games 0.2203254660493
## journalism -0.9622042123386
## music 0.5780717283194
## publishing 0.5946796011153
## technology -0.9767070739998
## theater 0.8625102834924
Let’s check out the coefficient plot:
ridge.mod2<-glmnet(x,y,alpha=0,family='binomial',lambda=best.lambda)
coefplot(ridge.mod2)
## Warning: Removed 22 rows containing missing values (geom_errorbarh).
## Warning: Removed 22 rows containing missing values (geom_errorbarh).
(Intercept)
goal
date.difference
name.length
backers_count
converted_pledged_amount
staff_pickFALSE
staff_pickTRUE
art
comics
crafts
dance
design
fashion
flim.video
food
games
journalism
music
publishing
technology
theater
−1.0 −0.5 0.0 0.5 1.0
Value
Coefficient
Coefficient Plot
Results: Ridge Regression gives the effect of Predictors on the success of response variable: positive
coefficients improve the probability of success while negative coefficients reduce it.
Following predictors contribute positively towards the success of a campaign with respect to campaign
category photography, in a decreasing order of importance: 1) Dance 2) Theater 3) Comics 4)
Publishing 5) Music 6) Design 7) Film & video 8) Games
Following predictors contribute negatively towards the success of a campaign with respect to campaign
9
10. category photography: 1) Technology 2) Journalism 3) food 4) Fashion 5) Art 6) Crafts
Following general predictors contribute towards success of a campaign:
1)Staff Pick - True 2)Staff Pick - False
The “Staff Pick” is somewhat significant in contributing towards success of a campaign while other predictors
like “Pledged Amount”, Backers Count“,”name Length“,”Difference between launch date and deadline and
“Goal Amount” have no effect on the probability of success according to this method.
4.Logistic Model
Now, we fit a Logistic Model and test it using Confusion Matrix.
logistic.fit = glm( boolop ~ goal + date.difference + name.length+ backers_count + converted_pledged_amo
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic.fit)
##
## Call:
## glm(formula = boolop ~ goal + date.difference + name.length +
## backers_count + converted_pledged_amount + staff_pick + art +
## comics + crafts + dance + design + fashion + flim.video +
## food + games + journalism + music + publishing + technology +
## theater, family = binomial(link = "logit"), data = ops)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.4904 -0.4252 0.0015 0.4338 8.4904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.424187243 0.041128923 -10.314 < 2e-16 ***
## goal -0.000185801 0.000001734 -107.121 < 2e-16 ***
## date.difference -0.011400929 0.000562606 -20.265 < 2e-16 ***
## name.length 0.006010506 0.000493338 12.183 < 2e-16 ***
## backers_count 0.039314573 0.000434350 90.513 < 2e-16 ***
## converted_pledged_amount 0.000275495 0.000004426 62.248 < 2e-16 ***
## staff_pickTRUE 0.057065520 0.034104189 1.673 0.094274 .
## art -0.085436709 0.039720898 -2.151 0.031482 *
## comics 0.033819395 0.051496494 0.657 0.511353
## crafts 0.058971934 0.047193891 1.250 0.211458
## dance 1.258774315 0.066437561 18.947 < 2e-16 ***
## design -0.204054953 0.055321204 -3.689 0.000226 ***
## fashion -0.507388561 0.048733470 -10.412 < 2e-16 ***
## flim.video 0.382472403 0.039006422 9.805 < 2e-16 ***
## food -0.759609780 0.046815586 -16.226 < 2e-16 ***
## games -0.436522026 0.049686105 -8.786 < 2e-16 ***
## journalism -0.632651126 0.054740419 -11.557 < 2e-16 ***
## music 0.342081944 0.038353244 8.919 < 2e-16 ***
## publishing 0.428364353 0.040587915 10.554 < 2e-16 ***
## technology -1.215911902 0.047985101 -25.339 < 2e-16 ***
## theater 0.862946027 0.051739868 16.679 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
10
11. ##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 238600 on 175821 degrees of freedom
## Residual deviance: 97839 on 175801 degrees of freedom
## AIC: 97881
##
## Number of Fisher Scoring iterations: 13
Results: Logistic Regression shows a majority of the variables to be significant with a low Residual deviance.
The coefficients depict log odds. Negative log odds depict a probability of less than 50% while positive depict
more than 50%. Log odds of zero depict a probability of 50%, meaning no effect on output.
Cross-validation 1) Training sample of 70% of the data:
train=sample(1:nrow(ops), 0.7*nrow(ops))
2) Testing sample of 30% of the data:
test=seq(1:nrow(ops))[-train]
3) Fitting a Logistic Model using training sub-sample:
ops.fit.train = glm(boolop ~ goal + date.difference + name.length+ backers_count + converted_pledged_amo
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
4) Predicting probabilities using test data:
ops.probs.test=predict(ops.fit.train, ops, type="response")[test]
ops.pred.test = ifelse(ops.probs.test>0.5, 1,0)
5) Creating a Confusion Matrix and checking classification fit statistics:
conf.mat.ops <- table(ops.pred.test, "Actual"=ops$boolop[test]) # Cross-tab prediction vs. actual
conf.mat.ops # Display confusion matrix
## Actual
## ops.pred.test 0 1
## 0 20273 1905
## 1 1758 28811
TruN=conf.mat.ops[1,1]; TruP=conf.mat.ops[2,2] # True negatives and positives
FalN=conf.mat.ops[1,2]; FalP=conf.mat.ops[2,1] # False negatives and positives
TotN=conf.mat.ops[1,1] + conf.mat.ops[2,1] # Total negatives
TotP=conf.mat.ops[1,2] + conf.mat.ops[2,2]
Tot=TotN+TotP # Total observations
Accuracy.Rate=(TruN+TruP)/Tot; Error.Rate=(FalN+FalP)/Tot
Sensitivity=TruP/TotP; Specificity=TruN/TotN; FalP.Rate = 1- Specificity
logit.rates.50=c(Error.Rate, Sensitivity, Specificity, FalP.Rate)
names(logit.rates.50)=c("Error Rate", "Sensitivity", "Specificity", "False Positives")
logit.rates.50
## Error Rate Sensitivity Specificity False Positives
## 0.06944471 0.93798021 0.92020335 0.07979665
Results:The model is trained on 70% of the data and tested on the remaining 30%. The Sensitivity and
Specificity for this model are very high with a low False Positive and Error Rate. As such, this model has a
near perfect fit.
11
12. We use the previously created Logistic Model to predict the probability of success of live campaigns. The live
campaign data:
live<- read.csv("newlive.csv", header=T)
response<-predict(logistic.fit, newdata = live)
odds <- exp(response)
ProbOfSuccess<- odds/(odds+1)
live<-cbind(live, ProbOfSuccess)
live[1:5,]
## X.2 X.1 X backers_count converted_pledged_amount country fx_rate
## 1 1 215 215 21 644 MX 0.05393109
## 2 2 216 216 24 634 US 1.00000000
## 3 3 217 217 51 2182 CA 0.80059885
## 4 4 218 218 34 433 US 1.00000000
## 5 5 219 219 222 8299 US 1.00000000
## goal pledged staff_pick state usd_pledged Cleaned_Category name.length
## 1 5700 11950 FALSE live 633.9715 art_illustration 33
## 2 500 634 FALSE live 634.0000 art_illustration 48
## 3 2500 2726 FALSE live 2187.8624 art_illustration 16
## 4 360 433 FALSE live 433.0000 art_illustration 48
## 5 2000 8299 TRUE live 8299.0000 art_illustration 41
## date.difference art comics crafts dance design fashion flim.video food
## 1 30 1 0 0 0 0 0 0 0
## 2 30 1 0 0 0 0 0 0 0
## 3 35 1 0 0 0 0 0 0 0
## 4 60 1 0 0 0 0 0 0 0
## 5 60 1 0 0 0 0 0 0 0
## games journalism music photography publishing technology theater
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## Grouped.Category ProbOfSuccess
## 1 art 0.3297429
## 2 art 0.6135277
## 3 art 0.7907052
## 4 art 0.6186733
## 5 art 0.9999418
Here, we have only shown 5 predictions due to paper’s space contraint but the entire validation set has 6533
observations.The “ProbOfSuccess”" variable shows the probability of success of a campaign.
5.Classification Tree
Logistic Model works fine but let’s try fitting a Classification Tree model:
tree.ops=tree(boolop ~ goal + date.difference + name.length+ backers_count + converted_pledged_amount +
summary(tree.ops)
##
## Regression tree:
## tree(formula = boolop ~ goal + date.difference + name.length +
## backers_count + converted_pledged_amount + staff_pick + art +
12
13. ## comics + crafts + dance + design + fashion + flim.video +
## food + games + journalism + music + publishing + technology +
## theater, data = ops)
## Variables actually used in tree construction:
## [1] "backers_count" "goal"
## [3] "converted_pledged_amount"
## Number of terminal nodes: 11
## Residual mean deviance: 0.04777 = 8398 / 175800
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.953800 -0.001703 0.046180 0.000000 0.046180 0.998300
plot(tree.ops); text(tree.ops, pretty=0)
|
backers_count < 11.5
goal < 676.5
converted_pledged_amount < 99.5converted_pledged_amount < 740.5goal < 3106.5
backers_count < 29.5
goal < 3269.5
converted_pledged_amount < 3497.5
goal < 15265converted_pledged_amount < 15753
0.1106000.8019000.0017030.8507000.082560
0.8894000.0302600.718100
0.953800
0.1369000.938200
Logistic Model drops a few variables which are important according to Business knowledge. Thus, we go
with Logistic Regression in this case.
Conclusion
For this project, we focused on finding what variables enhance or offset the likelihood of whether or not
the campaign is successfully funded on Kickstarter. We tried two models which are Classification Tree and
Logistic Regression. When we ran Classification Tree, some of the variables were dropped. According to on
our business knowledge, we strongly believe all variables are significant, thus, all should be retained. We used
Random Forests, Ridge Regression for variables selection, Condition Index/variance inflation factors to test
the multicollinearity and Confusion Matrix to check classification fit statistics. We applied our fitted Logistic
Model to a new dataset of on-going projects to predict their likelihood of success.
13
14. Two main challenges we faced were the large dataset and the binary response variable. The dataset contains
over 190,000 observations which resulted time-consuming data cleaning process. Some of the time variables
were in unix format instead of dd/mm/yyyy and there were a few missing data points. Moreover, with the
binary response variable, our choices of models were limited to Logistic Regression and Classification Tree
only.
References:
1.Kickstarter Stats. (n.d.). Retrieved May 02, 2018, from https://www.kickstarter.com/help/stats
14