Here we got a quite small sample data which contained a count of candies from 2008-2018 at Hamilton city. It is interesting to see if we can identify any independent variable which may impact the count of candies. Remember, Halloween is celebrated on October 31st so it may be a weekday or weekend.
I decided to choose "Weather" conditions to identify the impact of rain, cold, or severe wind. Seems wind has more impact on human nature because we can protect ourselves from rain, cold but not from the wind.
3. Problem Statement:
We have been given a dataset of Halloween-day for 10 years from 2008-2018. This data set contains the total Counts of
Candy has been distributed each year between 6:00 PM-8:15 PM.
Now, we have to analyze the given dataset and provide some meaningful visualization to see the trends and hidden
patterns.
Here is the snapshot of the given dataset.
Solution:
The dataset has been given is already cleansed so, I do not have to add or delete the given data.
However, in order to check the total counts of candy distribution to each respective year; we need to append some
relational independent-variable to compare and then analyze, to see if the dependent variable Candy (Count) has any
proportionate relationship with the suggested independent-variable.
Thus, I have added independent-variable as Weather condition. Which has contained independent variable such as,
Cloudy, Very Cloudy, Partly Cloudy, Fair, Light-Rain & Windy
4. The source of the dataset is: https://www.wunderground.com/history/monthly/ca/toronto/CYTZ/date/2013-10
Here is the preview of the dataset after adding independent variables with the original dataset for final analysis and
visualization.
6. Bar Chart: 01
We can see Count on the Y-Axis and Years in X-Axis to visualize the total Counts of Candy to each respective year,
as well as the Weather condition during those hours between 06:00 to 08:30 PM.
1. 2015 has the most counts (869) Vs 2013 has the least counts 358 +33 = (391).
2. 2014 has Light rain in 10 years, though still has great count of (673) candies.
3. Year 2018 has frequent changes in Weather as Cloud, Mostly Cloudy to Partly Cloudy between those hours.
4. Only 2010 has fair weather in 10 years of history.
7. Tree Map: 02
This TREE-MAP shows the hierarchy of the Weather and the impact of count through the size and color of each box
as well as the Weather dominance.
1. Seems most of the time the Weather was either remained Mostly Cloudy or Cloudy
8. Lollypop number:03
The Lollypop Chart shows the most and least Counts to their respective years.
1. Year 2015 & 2010 has most-darker and longer bar, while 2013 & 2010 has the least numbers of count thus smaller and have
light colors.
9. Heat Map:04
The Heat Map is another interesting chart to represent the most active year & time.
1. Clearly 7:00 PM to 7:30 PM is the most active hour and then eventually by 8:00 PM the activity cools down.
10. Side-by-Side Bars:05
The Side-by-Side Bars has a best comparison to see the Count impact with looking at Wind-Speed and Temperature.
1. We can visualize what was the Wind-Speed & Temperature with respective Count and the hours.
2. We can observe that there was little bit impact of Counts due to Wind-Speed in 2013 during at 6PM & once the wind
went little slower again the Candy-collection has increased.
11. Scattered Plot & Regression Line:06
Primarily the Scattered Plot and Regression line shows there is no correlationship between Wind & Count of
Candies.
1. The R-Squared is not strong and ideally it should be 0.80 (80%) or above to accept the sample null-hypothesis.
2. The p value is = > 0.05, which suggest that at 95% confidence level we retain the null hypothesis and reject the
alternative hypothesis. Which means Wind has not significant evidence on Counts.
12. Summary:
1. The external-weather data shows that there is significance impact of wind on the count of candy
but the R-squared and p -value not provides any evidence to support the alternative hypothesis. On
the other hand, Weekend days or Weekdays does not have any significance impact as seen in
Visualization No. 05.
2. The Most active hours were 7:00 PM to 7:30 PM
3. The p value (0.228346) is > 0.05 which means we fail to reject the null hypothesis. However, the
sample size in our case its fairly small to do hypothesis around.