SlideShare a Scribd company logo
1 of 33
July 26, 2016 Bike Sharing
Team 8
AUTHORS
Arpita Majumder
Jenny(Qian) Zhao
Alicia Ramharack
Rajarshi Das
1 | P a g e
Table of Contents
1. Project Objective........................................................................................................................................2
2. Description................................................................................................................................................2
3. Data Source...............................................................................................................................................2
4. Data Definition ..........................................................................................................................................2
5. Project Approach .......................................................................................................................................4
6. Data Preparation (Explore-Modify Phase): Adding new variables to the date set..............................................5
7. Data Preparation (Explore-Modify Phase): Missing value check......................................................................6
8. Explore Phase: Distribution and outlier analysis and key observations ............................................................6
9. Explore Phase: Robust outlier analysis and decision to delete....................................................................... 11
10. Explore Phase: K-means Clustering ............................................................................................................ 13
11. Explore Phase: Hierarchical Clustering........................................................................................................ 14
12. Modify Phase: Data set Split...................................................................................................................... 14
13. Modeling phase: Multiple regression model ............................................................................................... 15
14. Modeling phase: (Single) Decision tree Model............................................................................................. 19
15. Modeling phase: Boosted tree model......................................................................................................... 23
16. Modeling phase: Bootstrap forest model.................................................................................................... 25
17. Modeling phase: Neural network model..................................................................................................... 27
18. Assess Phase: Model comparison............................................................................................................... 30
2 | P a g e
1.Project Objective
Objective of this project is to predict the Bike sharing and rental demand, using the data generated by kiosk
system throughout a city. The project aims to predict the bike demand per hour based on some key available
data like for example, weather and other associated factors like season (summer/winter/fall/spring),
temperature, wind speed etc. From a business perspective, the model can be utilized to forecast the
customer’s demand and be prepared for it in terms of the rental inventory as well as using the demand data,
the rental company can also promote their business, showcasing their considerable demand handling
capacity, the company can also think of promoting other ancillary services like biking gears, biking attires etc.
in future if they can forecast considerable demands, assuming some repeat customers who will be willing to
take otheroffersas well infuture.
2.Description
The project is using a publicly available data-set, containing the data for the first 19 days of each month from
year 2011 to 2012. Each record contains the number of rented bikes based on date and timestamp (per hour
basis). Other than this, seasonal and weather related details are also available in the dataset. It also reflects
the detailswhetherbike isrentedbythe registeredcustomerorcasual customers.
3.DataSource
Followingis the linkforBike Sharingdemanddataset –
https://www.kaggle.com/c/bike-sharing-demand/data
4.DataDefinition
Following are the high level definitions for the different attributes available in the data-set being used by the
projectteam.
3 | P a g e
Table 1:
Attribute-Name Attribute Definition Sample value(s)
Daytime Hourly date + timestamp 1/20/2011
12:00:00 AM
Season 1 = spring,2 = summer,3 = fall,
4 = winter
1
Holiday Whetherthe dayis considered
a holiday
0
Workingday Whetherthe dayis neithera
weekendnorholiday
1
Weather 1: Clear,few clouds,partly
cloudy,partlycloudy
2: Mist + Cloudy,Mist+ Broken
clouds,Mist+ Few clouds,
Mist
3: LightSnow,LightRain +
Thunderstorm+ Scattered
clouds,LightRain+ Scattered
clouds
4: HeavyRain+ Ice Pallets+
Thunderstorm+ Mist,Snow +
Fog
1
Temperature Actual temperature inCelsius 10.66
Feelslike "Feelslike"temperature in
Celsius
11.365
Humidity Relative humidity 56
Windspeed Windspeed 26.0027
Casual Numberof non-registereduser
rentalsinitiated
3
Registered Numberof registereduser
rentalsinitiated
13
Count numberof total rentals (Casual
+Registered)
16
4 | P a g e
5.Project Approach
 For this project conventional SEMMA approach is being followed for the predictive analysis and
modelling,foranalyzingdataandretrievingunderstandable informationfromthe dataset.
 Following is a holistic description on how the SEMMA approach is being followed under this project
and whatare the technical activitiesbeingexecutedundereachconstituentof the SEMMA process.
 Also in the next few sections, of this project report, we have delineated with necessary graphical
representationsfromJMP, the different stageswe have executedunderthe SEMMA process.
 Sample:
The project team, started the sample process, with the data sampling, where we have scavenged
through a wide variety of the publicly available data-sets from a vast range of domains, ranging from
healthcare insurance, scientific clinical trials, presidential elections, customer demands (like the Bike
Sharing rental) etc. Based on our project timeline and scope,we have ultimately decided at the end of
our sampling phase, to select the ‘Bike Sharing and rental Demand’ data set, considering its data
volume, which would be ideal for analysis for our project with a stringent schedule, and also we will
be able to learn some aspect of consumer demand analysis. We have also did some minor data
partitioning in this phase to make sure we have data set with optimal range of data rows (Neither too
bignor too small).
 Explore:
Under the explore phase, our project team, worked on to understand the data, digging a little deeper
into the data definitions, discovering the anticipated and unanticipated relationships between the
variables, and also we explored the few abnormalities with in the variables with the aid of some data
visualization techniques in JMP that we have learned in our class. We have also explored to identifyif
there are any missing available in the data-set or not so that we are prepared to correct them as
needed.
 Modify:
After the data exploration, our project team progressed towards the modification phase, where we
looked closely again into each of the variables under the bike sharing demand data-set, decided with
a team consensus, to select certain variables as key variables to watch for, some of our team
members rightly explained the need for the ‘massaging’ & minor ‘transformation’ of certain data
attributes and some addition of new variables as part of the data preparation, which we have
adhered to considering, the fact that this will give the data more adequate variability, and also it will
enrichthe predictorvariablesultimately.
 Model:
Under the modelling phase, our project team, focused on applying various modeling techniques like
for example, regression, Decision tree algorithms including boosted tree, and bootstrap forest, neural
network algorithm, towards the prepared data-set we have come up with some possible outcomes of
our targetvariable (Count) todemonstrate the predictedvaluesof the bike rental demand.
5 | P a g e
 Assess:
Under the assess phase, our team, worked on the comparison of the predicted response of our
target variable, which we have obtained using the different modelling vehicles as explained under
the model section above. This comparison helped us in the evaluation of the effectiveness, reliability
and usefulness of the different models that we have utilized to come up with the forecasting of our
target variable.
6.DataPreparation(Explore-ModifyPhase):Adding new variables
to the date set
 Project team, worked on the modification of some of the existing data attributes and came up with some
new modified columns and added them under the data-set.
 These seven new manufactured attributes are added to the data-set for better understanding and
interpretation of the data, so that we can use them in our modelling effectively.
 Following is a tabular representation on how we have modified the existing attributes; the table represents
the following details.
o Existing available attribute
o Derived Attribute
o Derivation formula, used to createthe resulting new variables.
o Note: For detail definition of the Existing attribute, please refer the Table 1 above.
Table 2:
ExistingAttribute
(Available)
DerivedAttribute
(New)
DerivationFormula
Datetime Date AbbrevDate(:datetime)
Datetime Time (hourof the
day)
Hour(:datetime)
Date Day numberof Week Day Of Week(Informat(:Date))
Day numberof
Week
Day of the week If(:Daynumberof Week== 1, "Sunday",If(:Daynumberof
Week== 2, "Monday",If(:Daynumber of Week== 3,
"Tuesday",If(:Daynumberof Week==4, "Wednesday",
If(:Daynumberof Week== 5, "Thursday",If(:Daynumberof
Week== 6, "Friday","Saturday"))))))
season Seasonelaborated If(:season==1, "Spring",If(:season==2, "Summer",
If(:season==3, "Fall","Winter")))
holiday National Holiday If(:holiday==0, "NotHoliday","National Holiday")
weather Weatherelaborated If(:weather== 1, "Clear,few clouds,partlycloudy",
If(:weather== 2, "Mist + Cloudy,Mist+ Brokenclouds,Mist+
Few clouds,Mist", If(:weather== 3, "LightSnow,LightRain +
Thunderstorm+ Scatteredclouds,LightRain+ Scattered
clouds", "HeavyRain+ Ice Pallets+Thunderstorm+ Mist,
Snow + Fog")))
6 | P a g e
7. DataPreparation(Explore-ModifyPhase):Missing value check
 Project teamalso analyzed the data-set to check if there areany missing values available or not
 Based on the analysis, in JMP missing value exploration, we did not encounter any missing values.
 Fig 1 below represents our missing value analysis in JMP.
Fig1:
8. ExplorePhase:Distribution and outlier analysis and key observations
 Bike dataset has few continuous variable and few Nominal variables.
 The data set used in the project, a mixture of Continuous and Nominal variables (as documented
below in each section of type of variables)
 Before starting our modelling, our team analyzed some of these variable a little deeper, to come up
with some observations as delineated below, which helped us to understand the data and the
relationships in details. These are some preliminary prediction observations we made based on
individual analysis of the data, not necessary all of them affected directly the final prediction when
we ran these through the modelling algorithms, however, these are key factors in understanding the
pattern or the behavior how these individual data items can influence the decision collectively. This
exploration helped us to analyze and predict informally without modelling, and enriched the
analytical ability of each of ourproject team member.
 List of Nominal/ordinal variables Available in theData-set:
o Datetime
o Season
o Holiday
o Working day
o Weather
o Date
o Time
o Day number of the week
o Season elaborated
o National Holiday
o Weather elaborated
 Few Nominal variables are derived from another Nominal variable as well as you have
seen in Table 2 above.
 Below arethe few observations on of the Nominal variables:
7 | P a g e
Fig 2:
Fig 2a:
 Like for example, the above tabulation (fig 2) shows, that there is a propensity towards higher
bike demand on Saturdays.
 We can also see from the graph representation(fig2a) the higher bike demand also shifts towards
late afternoon to early evening
 Similarly, the tabulation below (Fig 3) shows that people are more interested to rent bike on Fall
and the demand is least in spring
Fig 3:
8 | P a g e
Fig 3a:
Fig 3b:
Fig 3c:
 Fig 4 below also shows a pattern that people tend to rent bikes more on weeks where there are
no holidays.
 Also from the graphical representation (fig 3a, to 3c) we can observe the following patterns of the
bike rental demands
o Fall season is the peak of demand.
o Higher temperature is preferred for the renters, however less or moderate humidity is
preferred as well, high humidity or extreme low temperature days can observe very low
or weak demand.
o We can also see one very important item from these individual analyses that, each
individual observation is affecting the target but it’s contributing towards the collective
9 | P a g e
influence of all variables (Some more, some less) towards the target as well. Like we
know from individual results that moderate temperature with moderate humidity leads
to high demand, we can understand from this, why Fall is also showing as season for
high demand, because it has comfortable temperature (not too high or low) and
moderatehumidity as well.
Fig 4:
 Continuous variable:
 We have explored the 3 continuous variable as well, Temp, humidity, wind speed
 The distribution for the variables areas below:
 As per the below observation, ‘Temp’ variable does not have any outlier data whereas
‘humidity ‘and ‘wind speed’ has few outliers
10 | P a g e
Fig 5
 ‘Johnson Si’ transformation for the variables (Humidity and Wind speed) (see in fig 6) shows some
detail representation of the outliers.
11 | P a g e
Fig 6:
9.ExplorePhase:Robust outlier analysis and decision to delete
 As some of the outliers are detected in the data-sets based on the project team’s analysis above,
the team went on to use the robust outlier analysis to assess what is the volume of the outlier in
the entire data set.
 As you can see in from fig 7-9, we have explored the Mahalanobis Distance with respect to the
correlation structure in our robust outlier analysis, there are many points/rows which are above
the distance line (UCL = 3.75). These points are considered as outlier
 The Mahalanobis Distance is saved in dataset for each row, and marked the rows where distance
is more than 3.75. This is done to find out the number of outlier rows
 We found that 669 rows are having outlier among 10886 rows which is around 6% of data. As the
outlier % is very low we havedecided as a team to delete the rows.
12 | P a g e
Fig 7:
Fig 8:
13 | P a g e
Fig 9:
10. ExplorePhase:K-means Clustering
 The project also went on executing the different clustering methods learned in class on the data-set (like you
can see in in section 10 and section 11 followed)
 However, this helped us to understand the distribution of the data, but we did not have to take any further
action on the data preparation or modification based on these clustering analysis.
Fig 10
14 | P a g e
11. ExplorePhase:Hierarchical Clustering
Fig 11
12. Modify Phase:Data set Split
 After all the individual data exploration, modification and preparation our team moved towards
modelling, however before modelling we have segregated our entire data set into 3 categories as
follows.
o TrainingData Set
o ValidationDataset.
o TestingData Set.
 Though this is a forecasting type of model and NOT classification, we still went to use a stratified
partition using the stratification on the Target variable, so that we have an optimized proportion,
thoughit wasnot mandatory.
 All our subsequent modelling exercise was constructed based on these partitioned data, so that we
couldcompare the modellingeffectsandefficiencyoneachpartitioneddataset.
 A figurative representationof the datasetis givenbelow,afterthe partition.
15 | P a g e
Fig 12
13. Modeling phase:Multiple regression model
 Responsevariable:
o Bike rent count
 Predictor variables:
o Time (hour of the day)
o Day of the week
o Season elaborated
o National Holiday
o Atemp
o Humidity
o Windspeed
 Prediction model outcomes:
16 | P a g e
Fig 13
Fig 14
 Based on primary modeling outcome, National Holiday and Wind speed appeared to be less effective
in prediction as the PValue is very high for thesevariables.
 So these two variables are removed from the model.
 After removing these variables, we have re-executed the regression model again and came up with the
following outcome.
17 | P a g e
Fig 15
 The RSquare value for the current model is 0.378.
 The prediction profile is represented as below.
Fig 16:
Importanceof thevariables as perthe prediction profiler analysis:
 Based on the prediction profiler analysis of the influence of the individual prediction variable, we have
observed the following patterns from this model.
o Bike rent demand is increasing as the day progresses.
o Between noon to evening and beyond the demand increases.
o Saturday is the day of the week, where the demand is very high. Whereas on other days of the
week the demand does not vary that much.
o This modelling shows that during fall to early winter the bike renting peaks.
o Also temperature and humidity is a significant predictor of the bike renting demand. Medium to
high temperature and moderate humidity is key to higher demands.
 Prediction model formula is saved into the data-set. The prediction formula for this model is depicted
below
18 | P a g e
Fig 17:
 Error for this model calculated as below :
19 | P a g e
Fig 18:
14. Modeling phase:(Single) Decision tree Model
 Responsevariable:
o Bike rent count
 Predictor variables:
o Time (hour of the day)
o Day of the week
o Season elaborated
o National Holiday
o Atemp
o Humidity
o Windspeed
 Predictive model outcomes:
Fig 19:
20 | P a g e
 RSquare value for dataset given below:
 RSquarevalueis more for this model as compared to theprevious model.
Fig 20:
21 | P a g e
 Column contribution in this model is given below:
Fig 21:
 Model prediction is saved in the dataset.
22 | P a g e
Fig 22:
 Error is calculated for this dataset as well.
23 | P a g e
Fig 23:
15. Modeling phase:Boostedtreemodel
 Responsevariable: Bike rent count
 Predictor variables:
o Time (hour of the day)
o Day of the week
o Season elaborated
o National Holiday
o Atemp
o Humidity
o Windspeed
 Prediction model outcomes:
24 | P a g e
Fig 24:
 Prediction Formula is saved in the dataset.
Fig 25:
25 | P a g e
 Error is calculated for this model:
Fig 26:
16. Modeling phase:Bootstrap forest model
 Responsevariable: Bike rent count
 Predictor variables:
o Time (hour of the day)
o Day of the week
o Season elaborated
o National Holiday
o Atemp
o Humidity
o Windspeed
 Prediction model outcomes:
26 | P a g e
Fig 27:
 Prediction model formula is saved in the dataset:
Fig 28:
27 | P a g e
 Error is calculated for this model:
Fig 29:
17. Modeling phase:Neuralnetwork model
 Responsevariable: Bike rent count
 Predictor variables:
o Time (hour of the day)
o Day of the week
o Season elaborated
o National Holiday
o Atemp
o Humidity
o Windspeed
 Prediction model outcomes:
Fig 30:
28 | P a g e
Fig 31:
 Prediction model formula is saved in the dataset:
Fig 31:
29 | P a g e
 Error is calculated for this model:
Fig 32:
30 | P a g e
18. AssessPhase:Model comparison
 After running multiple modelling on this data-set and obtaining multiple different prediction outcomes
of the bike rent count from each of the model, we are now at a stage where we should compare our
modelling results from each of the modelling to evaluate the best possible prediction model, which
can be employed on this data set.
 Following are the steps we have performed as a team using the available JMP software to compare
each of our models across all of the partitioned data-set e.g. Training, Validation and testing.
Modeling comparison outcomefortraining data:
Fig 33:
Modeling comparison outcome forvalidation data:
Fig 34:
Modeling comparison outcomefortesting data:
Fig 35:
31 | P a g e
Prediction Metrics – Numeric Distribution of Prediction error for each model
Fig 36:
Conclusion:
Based on the modelling comparison and analysis of the prediction error distribution for each model that we
have executed on this data-set, we have come up to the following conclusion.
 Based on the statistics of the comparison data it is evident that the Decision tree model is giving us the
most efficient and effective prediction model to count theRental demand.
 The next in orderof ranking is the Boosted TreeModel.
 From the error distribution also, we can see evidently that decision tree model has the smallest error
% (Error Mean =0.33), Boosted Tree model is giving slightly higher % of error (Error Mean =0.44)
whereas the multiple regression model is giving us the highest error % (Error mean = 0.82) for which
we haveconsidered theregression modelas the least effective.
 However, we had some important learning during our exploration phase that, individual analysis of the
data as well, can also help us understanding the prediction outcome, even when we ran regression,
even though the prediction error was high, still we found that under regression model, the prediction
profiler gave us the same predictor variable with influence characteristic, which we observed in the
individual observation as well. So even if the regression model did not give us the best efficient and
accurate result, it certainly helped us corroborating the fact that our exploration and analysis was
going in right direction in terms of understanding the influence of each variable. Which we
ultimately confirmed when we had the column contribution in our decision tree model which is the
best model as per ourevaluation.
Business Solution:
32 | P a g e
In walking through SEMMA, we find that the data helps us draw conclusions that address business
problems. From the data, we find that there are different bike rental habits between the casual customers and
registered customers. This is valuable data that can help grow the customer base of both populations. Rental
trends show that we can manage our inventory according to the seasons, offering more inventories during the
peak months to accommodate more users.
Casual customers include tourists and infrequent bike renters. For tourists to Washington DC, bike
rentals are a cost effective way of getting around the city for exploring and sightseeing. As a company, we can
offer recommendations and coupons to visit other attractions which they can access by bike. By offering this
type of incentive, we are not cutting into profit by reducing the price of a rental with offering a bike rental
coupon. In order to attract new customers, a first time renter’s discount can be offered. This can allow the
user to try the bike rental with low risk. Our registered customers are most valuable. In order to retain them,
accessory options can be offered. By registering, you are now a member of the loyalty program where you
have exclusive access to amenities such as cooling centers or coupons for related products.

More Related Content

Similar to Predictive modeling Paper-Team8 V0.1

Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report Tom Donoghue
 
cloudworkloadanalysisandsimulation-140521153543-phpapp02
cloudworkloadanalysisandsimulation-140521153543-phpapp02cloudworkloadanalysisandsimulation-140521153543-phpapp02
cloudworkloadanalysisandsimulation-140521153543-phpapp02PRIYANKA MEHTA
 
Project template for projects looks like this
Project template for projects looks like thisProject template for projects looks like this
Project template for projects looks like thiskaniuppu
 
Agile Prediction Model EASE 2016 V2
Agile Prediction Model EASE 2016 V2Agile Prediction Model EASE 2016 V2
Agile Prediction Model EASE 2016 V2Mathieu Carsique
 
Info461ProjectCharterEskridgeAs08
Info461ProjectCharterEskridgeAs08Info461ProjectCharterEskridgeAs08
Info461ProjectCharterEskridgeAs08Greg Eskridge
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperDerek Diamond
 
Data Insight-Driven Project Delivery ACADIA 2017
Data Insight-Driven Project Delivery ACADIA 2017Data Insight-Driven Project Delivery ACADIA 2017
Data Insight-Driven Project Delivery ACADIA 2017gapariciojr
 
Data analytics - The CloudMiner Ltd.
Data analytics - The CloudMiner Ltd. Data analytics - The CloudMiner Ltd.
Data analytics - The CloudMiner Ltd. James Xue
 
IRJET- Foster Hashtag from Image and Text
IRJET-  	  Foster Hashtag from Image and TextIRJET-  	  Foster Hashtag from Image and Text
IRJET- Foster Hashtag from Image and TextIRJET Journal
 
Fall 15 Internship Poster- Namrata Nath
Fall 15 Internship Poster- Namrata NathFall 15 Internship Poster- Namrata Nath
Fall 15 Internship Poster- Namrata NathNamrata Nath
 
Challenges Faced by Novices While Developing and Designing the Visualization ...
Challenges Faced by Novices While Developing and Designing the Visualization ...Challenges Faced by Novices While Developing and Designing the Visualization ...
Challenges Faced by Novices While Developing and Designing the Visualization ...IRJET Journal
 
Detection of Behavior using Machine Learning
Detection of Behavior using Machine LearningDetection of Behavior using Machine Learning
Detection of Behavior using Machine LearningIRJET Journal
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics DomainDrjabez
 
Supply Chain Network Strategy with SCOR
Supply Chain Network Strategy with SCORSupply Chain Network Strategy with SCOR
Supply Chain Network Strategy with SCORRichard Freggi
 
Semantic Web Based Sentiment Engine
Semantic Web Based Sentiment EngineSemantic Web Based Sentiment Engine
Semantic Web Based Sentiment EngineJames Dellinger
 

Similar to Predictive modeling Paper-Team8 V0.1 (20)

Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report
 
cloudworkloadanalysisandsimulation-140521153543-phpapp02
cloudworkloadanalysisandsimulation-140521153543-phpapp02cloudworkloadanalysisandsimulation-140521153543-phpapp02
cloudworkloadanalysisandsimulation-140521153543-phpapp02
 
INTERNSHIP PPT.pptx
INTERNSHIP PPT.pptxINTERNSHIP PPT.pptx
INTERNSHIP PPT.pptx
 
Project template for projects looks like this
Project template for projects looks like thisProject template for projects looks like this
Project template for projects looks like this
 
Agile Prediction Model EASE 2016 V2
Agile Prediction Model EASE 2016 V2Agile Prediction Model EASE 2016 V2
Agile Prediction Model EASE 2016 V2
 
Info461ProjectCharterEskridgeAs08
Info461ProjectCharterEskridgeAs08Info461ProjectCharterEskridgeAs08
Info461ProjectCharterEskridgeAs08
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White Paper
 
Internship Presentation.pdf
Internship Presentation.pdfInternship Presentation.pdf
Internship Presentation.pdf
 
Data Insight-Driven Project Delivery ACADIA 2017
Data Insight-Driven Project Delivery ACADIA 2017Data Insight-Driven Project Delivery ACADIA 2017
Data Insight-Driven Project Delivery ACADIA 2017
 
Data analytics - The CloudMiner Ltd.
Data analytics - The CloudMiner Ltd. Data analytics - The CloudMiner Ltd.
Data analytics - The CloudMiner Ltd.
 
IRJET- Foster Hashtag from Image and Text
IRJET-  	  Foster Hashtag from Image and TextIRJET-  	  Foster Hashtag from Image and Text
IRJET- Foster Hashtag from Image and Text
 
Fall 15 Internship Poster- Namrata Nath
Fall 15 Internship Poster- Namrata NathFall 15 Internship Poster- Namrata Nath
Fall 15 Internship Poster- Namrata Nath
 
Challenges Faced by Novices While Developing and Designing the Visualization ...
Challenges Faced by Novices While Developing and Designing the Visualization ...Challenges Faced by Novices While Developing and Designing the Visualization ...
Challenges Faced by Novices While Developing and Designing the Visualization ...
 
Dss project analytics writeup
Dss project analytics writeup Dss project analytics writeup
Dss project analytics writeup
 
Detection of Behavior using Machine Learning
Detection of Behavior using Machine LearningDetection of Behavior using Machine Learning
Detection of Behavior using Machine Learning
 
Master thesis
Master thesisMaster thesis
Master thesis
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics Domain
 
Supply Chain Network Strategy with SCOR
Supply Chain Network Strategy with SCORSupply Chain Network Strategy with SCOR
Supply Chain Network Strategy with SCOR
 
rscript_paper-1
rscript_paper-1rscript_paper-1
rscript_paper-1
 
Semantic Web Based Sentiment Engine
Semantic Web Based Sentiment EngineSemantic Web Based Sentiment Engine
Semantic Web Based Sentiment Engine
 

Predictive modeling Paper-Team8 V0.1

  • 1. July 26, 2016 Bike Sharing Team 8 AUTHORS Arpita Majumder Jenny(Qian) Zhao Alicia Ramharack Rajarshi Das
  • 2. 1 | P a g e Table of Contents 1. Project Objective........................................................................................................................................2 2. Description................................................................................................................................................2 3. Data Source...............................................................................................................................................2 4. Data Definition ..........................................................................................................................................2 5. Project Approach .......................................................................................................................................4 6. Data Preparation (Explore-Modify Phase): Adding new variables to the date set..............................................5 7. Data Preparation (Explore-Modify Phase): Missing value check......................................................................6 8. Explore Phase: Distribution and outlier analysis and key observations ............................................................6 9. Explore Phase: Robust outlier analysis and decision to delete....................................................................... 11 10. Explore Phase: K-means Clustering ............................................................................................................ 13 11. Explore Phase: Hierarchical Clustering........................................................................................................ 14 12. Modify Phase: Data set Split...................................................................................................................... 14 13. Modeling phase: Multiple regression model ............................................................................................... 15 14. Modeling phase: (Single) Decision tree Model............................................................................................. 19 15. Modeling phase: Boosted tree model......................................................................................................... 23 16. Modeling phase: Bootstrap forest model.................................................................................................... 25 17. Modeling phase: Neural network model..................................................................................................... 27 18. Assess Phase: Model comparison............................................................................................................... 30
  • 3. 2 | P a g e 1.Project Objective Objective of this project is to predict the Bike sharing and rental demand, using the data generated by kiosk system throughout a city. The project aims to predict the bike demand per hour based on some key available data like for example, weather and other associated factors like season (summer/winter/fall/spring), temperature, wind speed etc. From a business perspective, the model can be utilized to forecast the customer’s demand and be prepared for it in terms of the rental inventory as well as using the demand data, the rental company can also promote their business, showcasing their considerable demand handling capacity, the company can also think of promoting other ancillary services like biking gears, biking attires etc. in future if they can forecast considerable demands, assuming some repeat customers who will be willing to take otheroffersas well infuture. 2.Description The project is using a publicly available data-set, containing the data for the first 19 days of each month from year 2011 to 2012. Each record contains the number of rented bikes based on date and timestamp (per hour basis). Other than this, seasonal and weather related details are also available in the dataset. It also reflects the detailswhetherbike isrentedbythe registeredcustomerorcasual customers. 3.DataSource Followingis the linkforBike Sharingdemanddataset – https://www.kaggle.com/c/bike-sharing-demand/data 4.DataDefinition Following are the high level definitions for the different attributes available in the data-set being used by the projectteam.
  • 4. 3 | P a g e Table 1: Attribute-Name Attribute Definition Sample value(s) Daytime Hourly date + timestamp 1/20/2011 12:00:00 AM Season 1 = spring,2 = summer,3 = fall, 4 = winter 1 Holiday Whetherthe dayis considered a holiday 0 Workingday Whetherthe dayis neithera weekendnorholiday 1 Weather 1: Clear,few clouds,partly cloudy,partlycloudy 2: Mist + Cloudy,Mist+ Broken clouds,Mist+ Few clouds, Mist 3: LightSnow,LightRain + Thunderstorm+ Scattered clouds,LightRain+ Scattered clouds 4: HeavyRain+ Ice Pallets+ Thunderstorm+ Mist,Snow + Fog 1 Temperature Actual temperature inCelsius 10.66 Feelslike "Feelslike"temperature in Celsius 11.365 Humidity Relative humidity 56 Windspeed Windspeed 26.0027 Casual Numberof non-registereduser rentalsinitiated 3 Registered Numberof registereduser rentalsinitiated 13 Count numberof total rentals (Casual +Registered) 16
  • 5. 4 | P a g e 5.Project Approach  For this project conventional SEMMA approach is being followed for the predictive analysis and modelling,foranalyzingdataandretrievingunderstandable informationfromthe dataset.  Following is a holistic description on how the SEMMA approach is being followed under this project and whatare the technical activitiesbeingexecutedundereachconstituentof the SEMMA process.  Also in the next few sections, of this project report, we have delineated with necessary graphical representationsfromJMP, the different stageswe have executedunderthe SEMMA process.  Sample: The project team, started the sample process, with the data sampling, where we have scavenged through a wide variety of the publicly available data-sets from a vast range of domains, ranging from healthcare insurance, scientific clinical trials, presidential elections, customer demands (like the Bike Sharing rental) etc. Based on our project timeline and scope,we have ultimately decided at the end of our sampling phase, to select the ‘Bike Sharing and rental Demand’ data set, considering its data volume, which would be ideal for analysis for our project with a stringent schedule, and also we will be able to learn some aspect of consumer demand analysis. We have also did some minor data partitioning in this phase to make sure we have data set with optimal range of data rows (Neither too bignor too small).  Explore: Under the explore phase, our project team, worked on to understand the data, digging a little deeper into the data definitions, discovering the anticipated and unanticipated relationships between the variables, and also we explored the few abnormalities with in the variables with the aid of some data visualization techniques in JMP that we have learned in our class. We have also explored to identifyif there are any missing available in the data-set or not so that we are prepared to correct them as needed.  Modify: After the data exploration, our project team progressed towards the modification phase, where we looked closely again into each of the variables under the bike sharing demand data-set, decided with a team consensus, to select certain variables as key variables to watch for, some of our team members rightly explained the need for the ‘massaging’ & minor ‘transformation’ of certain data attributes and some addition of new variables as part of the data preparation, which we have adhered to considering, the fact that this will give the data more adequate variability, and also it will enrichthe predictorvariablesultimately.  Model: Under the modelling phase, our project team, focused on applying various modeling techniques like for example, regression, Decision tree algorithms including boosted tree, and bootstrap forest, neural network algorithm, towards the prepared data-set we have come up with some possible outcomes of our targetvariable (Count) todemonstrate the predictedvaluesof the bike rental demand.
  • 6. 5 | P a g e  Assess: Under the assess phase, our team, worked on the comparison of the predicted response of our target variable, which we have obtained using the different modelling vehicles as explained under the model section above. This comparison helped us in the evaluation of the effectiveness, reliability and usefulness of the different models that we have utilized to come up with the forecasting of our target variable. 6.DataPreparation(Explore-ModifyPhase):Adding new variables to the date set  Project team, worked on the modification of some of the existing data attributes and came up with some new modified columns and added them under the data-set.  These seven new manufactured attributes are added to the data-set for better understanding and interpretation of the data, so that we can use them in our modelling effectively.  Following is a tabular representation on how we have modified the existing attributes; the table represents the following details. o Existing available attribute o Derived Attribute o Derivation formula, used to createthe resulting new variables. o Note: For detail definition of the Existing attribute, please refer the Table 1 above. Table 2: ExistingAttribute (Available) DerivedAttribute (New) DerivationFormula Datetime Date AbbrevDate(:datetime) Datetime Time (hourof the day) Hour(:datetime) Date Day numberof Week Day Of Week(Informat(:Date)) Day numberof Week Day of the week If(:Daynumberof Week== 1, "Sunday",If(:Daynumberof Week== 2, "Monday",If(:Daynumber of Week== 3, "Tuesday",If(:Daynumberof Week==4, "Wednesday", If(:Daynumberof Week== 5, "Thursday",If(:Daynumberof Week== 6, "Friday","Saturday")))))) season Seasonelaborated If(:season==1, "Spring",If(:season==2, "Summer", If(:season==3, "Fall","Winter"))) holiday National Holiday If(:holiday==0, "NotHoliday","National Holiday") weather Weatherelaborated If(:weather== 1, "Clear,few clouds,partlycloudy", If(:weather== 2, "Mist + Cloudy,Mist+ Brokenclouds,Mist+ Few clouds,Mist", If(:weather== 3, "LightSnow,LightRain + Thunderstorm+ Scatteredclouds,LightRain+ Scattered clouds", "HeavyRain+ Ice Pallets+Thunderstorm+ Mist, Snow + Fog")))
  • 7. 6 | P a g e 7. DataPreparation(Explore-ModifyPhase):Missing value check  Project teamalso analyzed the data-set to check if there areany missing values available or not  Based on the analysis, in JMP missing value exploration, we did not encounter any missing values.  Fig 1 below represents our missing value analysis in JMP. Fig1: 8. ExplorePhase:Distribution and outlier analysis and key observations  Bike dataset has few continuous variable and few Nominal variables.  The data set used in the project, a mixture of Continuous and Nominal variables (as documented below in each section of type of variables)  Before starting our modelling, our team analyzed some of these variable a little deeper, to come up with some observations as delineated below, which helped us to understand the data and the relationships in details. These are some preliminary prediction observations we made based on individual analysis of the data, not necessary all of them affected directly the final prediction when we ran these through the modelling algorithms, however, these are key factors in understanding the pattern or the behavior how these individual data items can influence the decision collectively. This exploration helped us to analyze and predict informally without modelling, and enriched the analytical ability of each of ourproject team member.  List of Nominal/ordinal variables Available in theData-set: o Datetime o Season o Holiday o Working day o Weather o Date o Time o Day number of the week o Season elaborated o National Holiday o Weather elaborated  Few Nominal variables are derived from another Nominal variable as well as you have seen in Table 2 above.  Below arethe few observations on of the Nominal variables:
  • 8. 7 | P a g e Fig 2: Fig 2a:  Like for example, the above tabulation (fig 2) shows, that there is a propensity towards higher bike demand on Saturdays.  We can also see from the graph representation(fig2a) the higher bike demand also shifts towards late afternoon to early evening  Similarly, the tabulation below (Fig 3) shows that people are more interested to rent bike on Fall and the demand is least in spring Fig 3:
  • 9. 8 | P a g e Fig 3a: Fig 3b: Fig 3c:  Fig 4 below also shows a pattern that people tend to rent bikes more on weeks where there are no holidays.  Also from the graphical representation (fig 3a, to 3c) we can observe the following patterns of the bike rental demands o Fall season is the peak of demand. o Higher temperature is preferred for the renters, however less or moderate humidity is preferred as well, high humidity or extreme low temperature days can observe very low or weak demand. o We can also see one very important item from these individual analyses that, each individual observation is affecting the target but it’s contributing towards the collective
  • 10. 9 | P a g e influence of all variables (Some more, some less) towards the target as well. Like we know from individual results that moderate temperature with moderate humidity leads to high demand, we can understand from this, why Fall is also showing as season for high demand, because it has comfortable temperature (not too high or low) and moderatehumidity as well. Fig 4:  Continuous variable:  We have explored the 3 continuous variable as well, Temp, humidity, wind speed  The distribution for the variables areas below:  As per the below observation, ‘Temp’ variable does not have any outlier data whereas ‘humidity ‘and ‘wind speed’ has few outliers
  • 11. 10 | P a g e Fig 5  ‘Johnson Si’ transformation for the variables (Humidity and Wind speed) (see in fig 6) shows some detail representation of the outliers.
  • 12. 11 | P a g e Fig 6: 9.ExplorePhase:Robust outlier analysis and decision to delete  As some of the outliers are detected in the data-sets based on the project team’s analysis above, the team went on to use the robust outlier analysis to assess what is the volume of the outlier in the entire data set.  As you can see in from fig 7-9, we have explored the Mahalanobis Distance with respect to the correlation structure in our robust outlier analysis, there are many points/rows which are above the distance line (UCL = 3.75). These points are considered as outlier  The Mahalanobis Distance is saved in dataset for each row, and marked the rows where distance is more than 3.75. This is done to find out the number of outlier rows  We found that 669 rows are having outlier among 10886 rows which is around 6% of data. As the outlier % is very low we havedecided as a team to delete the rows.
  • 13. 12 | P a g e Fig 7: Fig 8:
  • 14. 13 | P a g e Fig 9: 10. ExplorePhase:K-means Clustering  The project also went on executing the different clustering methods learned in class on the data-set (like you can see in in section 10 and section 11 followed)  However, this helped us to understand the distribution of the data, but we did not have to take any further action on the data preparation or modification based on these clustering analysis. Fig 10
  • 15. 14 | P a g e 11. ExplorePhase:Hierarchical Clustering Fig 11 12. Modify Phase:Data set Split  After all the individual data exploration, modification and preparation our team moved towards modelling, however before modelling we have segregated our entire data set into 3 categories as follows. o TrainingData Set o ValidationDataset. o TestingData Set.  Though this is a forecasting type of model and NOT classification, we still went to use a stratified partition using the stratification on the Target variable, so that we have an optimized proportion, thoughit wasnot mandatory.  All our subsequent modelling exercise was constructed based on these partitioned data, so that we couldcompare the modellingeffectsandefficiencyoneachpartitioneddataset.  A figurative representationof the datasetis givenbelow,afterthe partition.
  • 16. 15 | P a g e Fig 12 13. Modeling phase:Multiple regression model  Responsevariable: o Bike rent count  Predictor variables: o Time (hour of the day) o Day of the week o Season elaborated o National Holiday o Atemp o Humidity o Windspeed  Prediction model outcomes:
  • 17. 16 | P a g e Fig 13 Fig 14  Based on primary modeling outcome, National Holiday and Wind speed appeared to be less effective in prediction as the PValue is very high for thesevariables.  So these two variables are removed from the model.  After removing these variables, we have re-executed the regression model again and came up with the following outcome.
  • 18. 17 | P a g e Fig 15  The RSquare value for the current model is 0.378.  The prediction profile is represented as below. Fig 16: Importanceof thevariables as perthe prediction profiler analysis:  Based on the prediction profiler analysis of the influence of the individual prediction variable, we have observed the following patterns from this model. o Bike rent demand is increasing as the day progresses. o Between noon to evening and beyond the demand increases. o Saturday is the day of the week, where the demand is very high. Whereas on other days of the week the demand does not vary that much. o This modelling shows that during fall to early winter the bike renting peaks. o Also temperature and humidity is a significant predictor of the bike renting demand. Medium to high temperature and moderate humidity is key to higher demands.  Prediction model formula is saved into the data-set. The prediction formula for this model is depicted below
  • 19. 18 | P a g e Fig 17:  Error for this model calculated as below :
  • 20. 19 | P a g e Fig 18: 14. Modeling phase:(Single) Decision tree Model  Responsevariable: o Bike rent count  Predictor variables: o Time (hour of the day) o Day of the week o Season elaborated o National Holiday o Atemp o Humidity o Windspeed  Predictive model outcomes: Fig 19:
  • 21. 20 | P a g e  RSquare value for dataset given below:  RSquarevalueis more for this model as compared to theprevious model. Fig 20:
  • 22. 21 | P a g e  Column contribution in this model is given below: Fig 21:  Model prediction is saved in the dataset.
  • 23. 22 | P a g e Fig 22:  Error is calculated for this dataset as well.
  • 24. 23 | P a g e Fig 23: 15. Modeling phase:Boostedtreemodel  Responsevariable: Bike rent count  Predictor variables: o Time (hour of the day) o Day of the week o Season elaborated o National Holiday o Atemp o Humidity o Windspeed  Prediction model outcomes:
  • 25. 24 | P a g e Fig 24:  Prediction Formula is saved in the dataset. Fig 25:
  • 26. 25 | P a g e  Error is calculated for this model: Fig 26: 16. Modeling phase:Bootstrap forest model  Responsevariable: Bike rent count  Predictor variables: o Time (hour of the day) o Day of the week o Season elaborated o National Holiday o Atemp o Humidity o Windspeed  Prediction model outcomes:
  • 27. 26 | P a g e Fig 27:  Prediction model formula is saved in the dataset: Fig 28:
  • 28. 27 | P a g e  Error is calculated for this model: Fig 29: 17. Modeling phase:Neuralnetwork model  Responsevariable: Bike rent count  Predictor variables: o Time (hour of the day) o Day of the week o Season elaborated o National Holiday o Atemp o Humidity o Windspeed  Prediction model outcomes: Fig 30:
  • 29. 28 | P a g e Fig 31:  Prediction model formula is saved in the dataset: Fig 31:
  • 30. 29 | P a g e  Error is calculated for this model: Fig 32:
  • 31. 30 | P a g e 18. AssessPhase:Model comparison  After running multiple modelling on this data-set and obtaining multiple different prediction outcomes of the bike rent count from each of the model, we are now at a stage where we should compare our modelling results from each of the modelling to evaluate the best possible prediction model, which can be employed on this data set.  Following are the steps we have performed as a team using the available JMP software to compare each of our models across all of the partitioned data-set e.g. Training, Validation and testing. Modeling comparison outcomefortraining data: Fig 33: Modeling comparison outcome forvalidation data: Fig 34: Modeling comparison outcomefortesting data: Fig 35:
  • 32. 31 | P a g e Prediction Metrics – Numeric Distribution of Prediction error for each model Fig 36: Conclusion: Based on the modelling comparison and analysis of the prediction error distribution for each model that we have executed on this data-set, we have come up to the following conclusion.  Based on the statistics of the comparison data it is evident that the Decision tree model is giving us the most efficient and effective prediction model to count theRental demand.  The next in orderof ranking is the Boosted TreeModel.  From the error distribution also, we can see evidently that decision tree model has the smallest error % (Error Mean =0.33), Boosted Tree model is giving slightly higher % of error (Error Mean =0.44) whereas the multiple regression model is giving us the highest error % (Error mean = 0.82) for which we haveconsidered theregression modelas the least effective.  However, we had some important learning during our exploration phase that, individual analysis of the data as well, can also help us understanding the prediction outcome, even when we ran regression, even though the prediction error was high, still we found that under regression model, the prediction profiler gave us the same predictor variable with influence characteristic, which we observed in the individual observation as well. So even if the regression model did not give us the best efficient and accurate result, it certainly helped us corroborating the fact that our exploration and analysis was going in right direction in terms of understanding the influence of each variable. Which we ultimately confirmed when we had the column contribution in our decision tree model which is the best model as per ourevaluation. Business Solution:
  • 33. 32 | P a g e In walking through SEMMA, we find that the data helps us draw conclusions that address business problems. From the data, we find that there are different bike rental habits between the casual customers and registered customers. This is valuable data that can help grow the customer base of both populations. Rental trends show that we can manage our inventory according to the seasons, offering more inventories during the peak months to accommodate more users. Casual customers include tourists and infrequent bike renters. For tourists to Washington DC, bike rentals are a cost effective way of getting around the city for exploring and sightseeing. As a company, we can offer recommendations and coupons to visit other attractions which they can access by bike. By offering this type of incentive, we are not cutting into profit by reducing the price of a rental with offering a bike rental coupon. In order to attract new customers, a first time renter’s discount can be offered. This can allow the user to try the bike rental with low risk. Our registered customers are most valuable. In order to retain them, accessory options can be offered. By registering, you are now a member of the loyalty program where you have exclusive access to amenities such as cooling centers or coupons for related products.