Predictive modeling Paper-Team8 V0.1

July 26, 2016 Bike Sharing
Team 8
AUTHORS
Arpita Majumder
Jenny(Qian) Zhao
Alicia Ramharack
Rajarshi Das

1 | P a g e
Table of Contents
1. Project Objective........................................................................................................................................2
2. Description................................................................................................................................................2
3. Data Source...............................................................................................................................................2
4. Data Definition ..........................................................................................................................................2
5. Project Approach .......................................................................................................................................4
6. Data Preparation (Explore-Modify Phase): Adding new variables to the date set..............................................5
7. Data Preparation (Explore-Modify Phase): Missing value check......................................................................6
8. Explore Phase: Distribution and outlier analysis and key observations ............................................................6
9. Explore Phase: Robust outlier analysis and decision to delete....................................................................... 11
10. Explore Phase: K-means Clustering ............................................................................................................ 13
11. Explore Phase: Hierarchical Clustering........................................................................................................ 14
12. Modify Phase: Data set Split...................................................................................................................... 14
13. Modeling phase: Multiple regression model ............................................................................................... 15
14. Modeling phase: (Single) Decision tree Model............................................................................................. 19
15. Modeling phase: Boosted tree model......................................................................................................... 23
16. Modeling phase: Bootstrap forest model.................................................................................................... 25
17. Modeling phase: Neural network model..................................................................................................... 27
18. Assess Phase: Model comparison............................................................................................................... 30

2 | P a g e
1.Project Objective
Objective of this project is to predict the Bike sharing and rental demand, using the data generated by kiosk
system throughout a city. The project aims to predict the bike demand per hour based on some key available
data like for example, weather and other associated factors like season (summer/winter/fall/spring),
temperature, wind speed etc. From a business perspective, the model can be utilized to forecast the
customer’s demand and be prepared for it in terms of the rental inventory as well as using the demand data,
the rental company can also promote their business, showcasing their considerable demand handling
capacity, the company can also think of promoting other ancillary services like biking gears, biking attires etc.
in future if they can forecast considerable demands, assuming some repeat customers who will be willing to
take otheroffersas well infuture.
2.Description
The project is using a publicly available data-set, containing the data for the first 19 days of each month from
year 2011 to 2012. Each record contains the number of rented bikes based on date and timestamp (per hour
basis). Other than this, seasonal and weather related details are also available in the dataset. It also reflects
the detailswhetherbike isrentedbythe registeredcustomerorcasual customers.
3.DataSource
Followingis the linkforBike Sharingdemanddataset –
https://www.kaggle.com/c/bike-sharing-demand/data
4.DataDefinition
Following are the high level definitions for the different attributes available in the data-set being used by the
projectteam.

3 | P a g e
Table 1:
Attribute-Name Attribute Definition Sample value(s)
Daytime Hourly date + timestamp 1/20/2011
12:00:00 AM
Season 1 = spring,2 = summer,3 = fall,
4 = winter
1
Holiday Whetherthe dayis considered
a holiday
0
Workingday Whetherthe dayis neithera
weekendnorholiday
1
Weather 1: Clear,few clouds,partly
cloudy,partlycloudy
2: Mist + Cloudy,Mist+ Broken
clouds,Mist+ Few clouds,
Mist
3: LightSnow,LightRain +
Thunderstorm+ Scattered
clouds,LightRain+ Scattered
clouds
4: HeavyRain+ Ice Pallets+
Thunderstorm+ Mist,Snow +
Fog
1
Temperature Actual temperature inCelsius 10.66
Feelslike "Feelslike"temperature in
Celsius
11.365
Humidity Relative humidity 56
Windspeed Windspeed 26.0027
Casual Numberof non-registereduser
rentalsinitiated
3
Registered Numberof registereduser
rentalsinitiated
13
Count numberof total rentals (Casual
+Registered)
16

4 | P a g e
5.Project Approach
 For this project conventional SEMMA approach is being followed for the predictive analysis and
modelling,foranalyzingdataandretrievingunderstandable informationfromthe dataset.
 Following is a holistic description on how the SEMMA approach is being followed under this project
and whatare the technical activitiesbeingexecutedundereachconstituentof the SEMMA process.
 Also in the next few sections, of this project report, we have delineated with necessary graphical
representationsfromJMP, the different stageswe have executedunderthe SEMMA process.
 Sample:
The project team, started the sample process, with the data sampling, where we have scavenged
through a wide variety of the publicly available data-sets from a vast range of domains, ranging from
healthcare insurance, scientific clinical trials, presidential elections, customer demands (like the Bike
Sharing rental) etc. Based on our project timeline and scope,we have ultimately decided at the end of
our sampling phase, to select the ‘Bike Sharing and rental Demand’ data set, considering its data
volume, which would be ideal for analysis for our project with a stringent schedule, and also we will
be able to learn some aspect of consumer demand analysis. We have also did some minor data
partitioning in this phase to make sure we have data set with optimal range of data rows (Neither too
bignor too small).
 Explore:
Under the explore phase, our project team, worked on to understand the data, digging a little deeper
into the data definitions, discovering the anticipated and unanticipated relationships between the
variables, and also we explored the few abnormalities with in the variables with the aid of some data
visualization techniques in JMP that we have learned in our class. We have also explored to identifyif
there are any missing available in the data-set or not so that we are prepared to correct them as
needed.
 Modify:
After the data exploration, our project team progressed towards the modification phase, where we
looked closely again into each of the variables under the bike sharing demand data-set, decided with
a team consensus, to select certain variables as key variables to watch for, some of our team
members rightly explained the need for the ‘massaging’ & minor ‘transformation’ of certain data
attributes and some addition of new variables as part of the data preparation, which we have
adhered to considering, the fact that this will give the data more adequate variability, and also it will
enrichthe predictorvariablesultimately.
 Model:
Under the modelling phase, our project team, focused on applying various modeling techniques like
for example, regression, Decision tree algorithms including boosted tree, and bootstrap forest, neural
network algorithm, towards the prepared data-set we have come up with some possible outcomes of
our targetvariable (Count) todemonstrate the predictedvaluesof the bike rental demand.

5 | P a g e
 Assess:
Under the assess phase, our team, worked on the comparison of the predicted response of our
target variable, which we have obtained using the different modelling vehicles as explained under
the model section above. This comparison helped us in the evaluation of the effectiveness, reliability
and usefulness of the different models that we have utilized to come up with the forecasting of our
target variable.
6.DataPreparation(Explore-ModifyPhase):Adding new variables
to the date set
 Project team, worked on the modification of some of the existing data attributes and came up with some
new modified columns and added them under the data-set.
 These seven new manufactured attributes are added to the data-set for better understanding and
interpretation of the data, so that we can use them in our modelling effectively.
 Following is a tabular representation on how we have modified the existing attributes; the table represents
the following details.
o Existing available attribute
o Derived Attribute
o Derivation formula, used to createthe resulting new variables.
o Note: For detail definition of the Existing attribute, please refer the Table 1 above.
Table 2:
ExistingAttribute
(Available)
DerivedAttribute
(New)
DerivationFormula
Datetime Date AbbrevDate(:datetime)
Datetime Time (hourof the
day)
Hour(:datetime)
Date Day numberof Week Day Of Week(Informat(:Date))
Day numberof
Week
Day of the week If(:Daynumberof Week== 1, "Sunday",If(:Daynumberof
Week== 2, "Monday",If(:Daynumber of Week== 3,
"Tuesday",If(:Daynumberof Week==4, "Wednesday",
If(:Daynumberof Week== 5, "Thursday",If(:Daynumberof
Week== 6, "Friday","Saturday"))))))
season Seasonelaborated If(:season==1, "Spring",If(:season==2, "Summer",
If(:season==3, "Fall","Winter")))
holiday National Holiday If(:holiday==0, "NotHoliday","National Holiday")
weather Weatherelaborated If(:weather== 1, "Clear,few clouds,partlycloudy",
If(:weather== 2, "Mist + Cloudy,Mist+ Brokenclouds,Mist+
Few clouds,Mist", If(:weather== 3, "LightSnow,LightRain +
Thunderstorm+ Scatteredclouds,LightRain+ Scattered
clouds", "HeavyRain+ Ice Pallets+Thunderstorm+ Mist,
Snow + Fog")))

6 | P a g e
7. DataPreparation(Explore-ModifyPhase):Missing value check
 Project teamalso analyzed the data-set to check if there areany missing values available or not
 Based on the analysis, in JMP missing value exploration, we did not encounter any missing values.
 Fig 1 below represents our missing value analysis in JMP.
Fig1:
8. ExplorePhase:Distribution and outlier analysis and key observations
 Bike dataset has few continuous variable and few Nominal variables.
 The data set used in the project, a mixture of Continuous and Nominal variables (as documented
below in each section of type of variables)
 Before starting our modelling, our team analyzed some of these variable a little deeper, to come up
with some observations as delineated below, which helped us to understand the data and the
relationships in details. These are some preliminary prediction observations we made based on
individual analysis of the data, not necessary all of them affected directly the final prediction when
we ran these through the modelling algorithms, however, these are key factors in understanding the
pattern or the behavior how these individual data items can influence the decision collectively. This
exploration helped us to analyze and predict informally without modelling, and enriched the
analytical ability of each of ourproject team member.
 List of Nominal/ordinal variables Available in theData-set:
o Datetime
o Season
o Holiday
o Working day
o Weather
o Date
o Time
o Day number of the week
o Season elaborated
o National Holiday
o Weather elaborated
 Few Nominal variables are derived from another Nominal variable as well as you have
seen in Table 2 above.
 Below arethe few observations on of the Nominal variables:

7 | P a g e
Fig 2:
Fig 2a:
 Like for example, the above tabulation (fig 2) shows, that there is a propensity towards higher
bike demand on Saturdays.
 We can also see from the graph representation(fig2a) the higher bike demand also shifts towards
late afternoon to early evening
 Similarly, the tabulation below (Fig 3) shows that people are more interested to rent bike on Fall
and the demand is least in spring
Fig 3:

8 | P a g e
Fig 3a:
Fig 3b:
Fig 3c:
 Fig 4 below also shows a pattern that people tend to rent bikes more on weeks where there are
no holidays.
 Also from the graphical representation (fig 3a, to 3c) we can observe the following patterns of the
bike rental demands
o Fall season is the peak of demand.
o Higher temperature is preferred for the renters, however less or moderate humidity is
preferred as well, high humidity or extreme low temperature days can observe very low
or weak demand.
o We can also see one very important item from these individual analyses that, each
individual observation is affecting the target but it’s contributing towards the collective

9 | P a g e
influence of all variables (Some more, some less) towards the target as well. Like we
know from individual results that moderate temperature with moderate humidity leads
to high demand, we can understand from this, why Fall is also showing as season for
high demand, because it has comfortable temperature (not too high or low) and
moderatehumidity as well.
Fig 4:
 Continuous variable:
 We have explored the 3 continuous variable as well, Temp, humidity, wind speed
 The distribution for the variables areas below:
 As per the below observation, ‘Temp’ variable does not have any outlier data whereas
‘humidity ‘and ‘wind speed’ has few outliers

10 | P a g e
Fig 5
 ‘Johnson Si’ transformation for the variables (Humidity and Wind speed) (see in fig 6) shows some
detail representation of the outliers.

11 | P a g e
Fig 6:
9.ExplorePhase:Robust outlier analysis and decision to delete
 As some of the outliers are detected in the data-sets based on the project team’s analysis above,
the team went on to use the robust outlier analysis to assess what is the volume of the outlier in
the entire data set.
 As you can see in from fig 7-9, we have explored the Mahalanobis Distance with respect to the
correlation structure in our robust outlier analysis, there are many points/rows which are above
the distance line (UCL = 3.75). These points are considered as outlier
 The Mahalanobis Distance is saved in dataset for each row, and marked the rows where distance
is more than 3.75. This is done to find out the number of outlier rows
 We found that 669 rows are having outlier among 10886 rows which is around 6% of data. As the
outlier % is very low we havedecided as a team to delete the rows.

13 | P a g e
Fig 9:
10. ExplorePhase:K-means Clustering
 The project also went on executing the different clustering methods learned in class on the data-set (like you
can see in in section 10 and section 11 followed)
 However, this helped us to understand the distribution of the data, but we did not have to take any further
action on the data preparation or modification based on these clustering analysis.
Fig 10

14 | P a g e
11. ExplorePhase:Hierarchical Clustering
Fig 11
12. Modify Phase:Data set Split
 After all the individual data exploration, modification and preparation our team moved towards
modelling, however before modelling we have segregated our entire data set into 3 categories as
follows.
o TrainingData Set
o ValidationDataset.
o TestingData Set.
 Though this is a forecasting type of model and NOT classification, we still went to use a stratified
partition using the stratification on the Target variable, so that we have an optimized proportion,
thoughit wasnot mandatory.
 All our subsequent modelling exercise was constructed based on these partitioned data, so that we
couldcompare the modellingeffectsandefficiencyoneachpartitioneddataset.
 A figurative representationof the datasetis givenbelow,afterthe partition.

15 | P a g e
Fig 12
13. Modeling phase:Multiple regression model
 Responsevariable:
o Bike rent count
 Predictor variables:
o Time (hour of the day)
o Day of the week
o Season elaborated
o National Holiday
o Atemp
o Humidity
o Windspeed
 Prediction model outcomes:

16 | P a g e
Fig 13
Fig 14
 Based on primary modeling outcome, National Holiday and Wind speed appeared to be less effective
in prediction as the PValue is very high for thesevariables.
 So these two variables are removed from the model.
 After removing these variables, we have re-executed the regression model again and came up with the
following outcome.

17 | P a g e
Fig 15
 The RSquare value for the current model is 0.378.
 The prediction profile is represented as below.
Fig 16:
Importanceof thevariables as perthe prediction profiler analysis:
 Based on the prediction profiler analysis of the influence of the individual prediction variable, we have
observed the following patterns from this model.
o Bike rent demand is increasing as the day progresses.
o Between noon to evening and beyond the demand increases.
o Saturday is the day of the week, where the demand is very high. Whereas on other days of the
week the demand does not vary that much.
o This modelling shows that during fall to early winter the bike renting peaks.
o Also temperature and humidity is a significant predictor of the bike renting demand. Medium to
high temperature and moderate humidity is key to higher demands.
 Prediction model formula is saved into the data-set. The prediction formula for this model is depicted
below

18 | P a g e
Fig 17:
 Error for this model calculated as below :

19 | P a g e
Fig 18:
14. Modeling phase:(Single) Decision tree Model
 Responsevariable:
o Bike rent count
o Day of the week
o Season elaborated
o National Holiday
o Atemp
o Humidity
o Windspeed
 Predictive model outcomes:
Fig 19:

20 | P a g e
 RSquare value for dataset given below:
 RSquarevalueis more for this model as compared to theprevious model.
Fig 20:

21 | P a g e
 Column contribution in this model is given below:
Fig 21:
 Model prediction is saved in the dataset.

22 | P a g e
Fig 22:
 Error is calculated for this dataset as well.

23 | P a g e
Fig 23:
15. Modeling phase:Boostedtreemodel
 Responsevariable: Bike rent count
o Day of the week
o Season elaborated
o National Holiday
o Atemp
o Humidity
o Windspeed

24 | P a g e
Fig 24:
 Prediction Formula is saved in the dataset.
Fig 25:

25 | P a g e
 Error is calculated for this model:
Fig 26:
16. Modeling phase:Bootstrap forest model
o Day of the week
o Season elaborated
o National Holiday
o Atemp
o Humidity
o Windspeed

26 | P a g e
Fig 27:
 Prediction model formula is saved in the dataset:
Fig 28:

27 | P a g e
Fig 29:
17. Modeling phase:Neuralnetwork model
o Day of the week
o Season elaborated
o National Holiday
o Atemp
o Humidity
o Windspeed
Fig 30:

28 | P a g e
Fig 31:
 Prediction model formula is saved in the dataset:
Fig 31:

29 | P a g e
Fig 32:

30 | P a g e
18. AssessPhase:Model comparison
 After running multiple modelling on this data-set and obtaining multiple different prediction outcomes
of the bike rent count from each of the model, we are now at a stage where we should compare our
modelling results from each of the modelling to evaluate the best possible prediction model, which
can be employed on this data set.
 Following are the steps we have performed as a team using the available JMP software to compare
each of our models across all of the partitioned data-set e.g. Training, Validation and testing.
Modeling comparison outcomefortraining data:
Fig 33:
Modeling comparison outcome forvalidation data:
Fig 34:
Modeling comparison outcomefortesting data:
Fig 35:

31 | P a g e
Prediction Metrics – Numeric Distribution of Prediction error for each model
Fig 36:
Conclusion:
Based on the modelling comparison and analysis of the prediction error distribution for each model that we
have executed on this data-set, we have come up to the following conclusion.
 Based on the statistics of the comparison data it is evident that the Decision tree model is giving us the
most efficient and effective prediction model to count theRental demand.
 The next in orderof ranking is the Boosted TreeModel.
 From the error distribution also, we can see evidently that decision tree model has the smallest error
% (Error Mean =0.33), Boosted Tree model is giving slightly higher % of error (Error Mean =0.44)
whereas the multiple regression model is giving us the highest error % (Error mean = 0.82) for which
we haveconsidered theregression modelas the least effective.
 However, we had some important learning during our exploration phase that, individual analysis of the
data as well, can also help us understanding the prediction outcome, even when we ran regression,
even though the prediction error was high, still we found that under regression model, the prediction
profiler gave us the same predictor variable with influence characteristic, which we observed in the
individual observation as well. So even if the regression model did not give us the best efficient and
accurate result, it certainly helped us corroborating the fact that our exploration and analysis was
going in right direction in terms of understanding the influence of each variable. Which we
ultimately confirmed when we had the column contribution in our decision tree model which is the
best model as per ourevaluation.
Business Solution:

32 | P a g e
In walking through SEMMA, we find that the data helps us draw conclusions that address business
problems. From the data, we find that there are different bike rental habits between the casual customers and
registered customers. This is valuable data that can help grow the customer base of both populations. Rental
trends show that we can manage our inventory according to the seasons, offering more inventories during the
peak months to accommodate more users.
Casual customers include tourists and infrequent bike renters. For tourists to Washington DC, bike
rentals are a cost effective way of getting around the city for exploring and sightseeing. As a company, we can
offer recommendations and coupons to visit other attractions which they can access by bike. By offering this
type of incentive, we are not cutting into profit by reducing the price of a rental with offering a bike rental
coupon. In order to attract new customers, a first time renter’s discount can be offered. This can allow the
user to try the bike rental with low risk. Our registered customers are most valuable. In order to retain them,
accessory options can be offered. By registering, you are now a member of the loyalty program where you
have exclusive access to amenities such as cooling centers or coupons for related products.

Predictive modeling Paper-Team8 V0.1

Recommended

Recommended

More Related Content

Similar to Predictive modeling Paper-Team8 V0.1

Similar to Predictive modeling Paper-Team8 V0.1 (20)

Predictive modeling Paper-Team8 V0.1