Exploring australian economy and diversity

TaskA:InvestigatingJob Vacancyand UnemploymentRateData
A1. Investigating the Population Data
Have a look at the resident population data. You will see many columns. We are
interested only in the total values for each state (marked "Persons"), so you can drop
the other columns and rename the columns for each state if you wish.
(HINT: The file isn't very big so you can make the changes in Excel if you want.)
1. In Python (or R) plot the population of Victoria, New South Wales and Queensland over time.
(HINT: You don't need to put the dates on the x-axis, just showing the index of each quarter is
fine)
a) Are the population values increasing or decreasing over time?
b) Does the population data exhibit a trend and if so, what type?
Answer: The below relation is obtained while tracing the count of the population for the three
states viz Victoria, New South Wales and Queensland over the time.
As the graphs are plotted it is evident that the count of the population is gradually increasing for the
three states over the time. Queensland has the least population among the three states while New
South Wales has the maximum population. The trend is linearly increasing one with a positive slope
over the time.

2. Fit a linear regression using Python (or R) to the Victorian population data and plot the linear fit.
(HINT: In Python, you can use the "range (1, n)" function to generate a sequence of integer
values: 1, 2..., n)
a) Does the linear fit look good?
b) Use the linear fit to predict the resident population in Victoria for the dates: 1/9/15,
1/12/15, 1/12/16, and 1/12/17.
Answer: The values of the Victorian population is first scattered plotted and then linear regression is
applied on the data for best fit line. The linear fit looks definitely good. The graph is as follows:
The predicted population for the given dates are as below:
A2. Investigating the Job Vacancies Data
Now have a look at the job vacancies data.
1. Use Python (or R) to plot the job vacancy counts for Victoria over time. (HINT: Pandas contains
a "transpose ()" method and Excel can also be used to transpose data.)
a) What are maximum and minimum values for job vacancies in Victoria over time
period?
Date Population
1/9/15 5739516.54838
1/12/15 5979953.5504
1/12/16 6076128.35121
1/12/17 6172303.15202

Answer: The vacancy count of Victoria is plotted over time. The graph is as follows:
The maximum and the minimum values of the population are 71971 and 32322 respectively.
2. Fit a linear regression to the data and plot it.
a) Does it look like a good fit to you? Would you believe the predictions of the linear model
going forward?
b) Instead of fitting the linear regression to all of the data, try fitting it to just the most recent
data points (say from the 85th data point onwards). How is the fit? Which model would
give better predictions of future vacancies do you think?
Answer: Firstly, the linear regression is implemented on the total Victorian population data. Then
the linear regression is implemented on the 85th data onwards. The below graphs are obtained.

The line is definitely not a good fit. The data is arranged as a function of polynomial equation rather
than a linear one. In this case a linear fit line will not be able to provide correct estimations of the
data. Hence, the linear model based on all the data is not plausible for any prediction.
Choosing the data from the 85th row onwards provides a linear arrangement of data. In this
scenario, a linear fit line is desirable. As per the plotted graph, it can be seen that the line fits very
close to all the data linearly. Hence to predict a data WITHIN the time interval [85th Row] to [130th
Row], the second model suits the best.
However, to predict the FUTURE data, none of the above models fits best as it is evident from the
history value, the data shows linear trend (both positive and negative slopes) at certain intervals
only. It might be the case that the interval from the 131th row onwards shows a linear trend with a
downward slope. In this case, the second model fails as well, to predict the data correctly. Here,
regression using a polynomial model definitely holds an upper hand than the linear model.
A3. Investigating the Unemployment Data
Now have a look at the unemployment data.
1. Use Python (or R) to plot the Unemployment Rate for Victoria over time.
a) It looks like the rate has been very high at times in the past. What was the maximum
unemployment rate in Victoria recorded in the dataset and when did that occur?
Answer: Next Page (Contd.)

The maximum unemployment rate was: 12.5533377 during the year 1993 in the month of August.
A4. Visualising the Relationship between
Unemployment and Job Vacancies
Now let's look at the relationship between unemployment levels and job vacancies.
1. Python (or R) to combine the data from the different files into a single table. The table
should contain population values, job vacancy counts and unemployment rates for the different
dates and different States/Territories.
a) What is the first date and last date for the combined data?’
Answer: The first date and the last date for the combined data are as below:
2. Now that you have the data aggregated, we can see whether there is a relationship between
unemployment and the number of job vacancies. Plot the values against each other.
a) Can you see a relationship there?
Answer: The merged data is now used to plot the unemployment and the vacancy of all the states.
A scatter plot has been used instead of a line plot as the graph generated from the scatter plot is
more legible in this case. The graph is as below:
Argument Value
Min Date 2015/03/01
Max Date 2015/06/01

The above picture shows that the vacancies are quite high when the unemployment rate is between
4 and 6. However the graph fails to produce any meaningful insight. This can be due to the fact that
the plotted data contains vacancy rate and unemployment rate of all the States for all quarters in an
unstructured way, without any correlation among them.
An approach to deduce a more meaningful relation between unemployment rate and vacancies wou
ld be to group the cumulative values (for all states) based on each quarter. On plotting the data, it pr
oduces the following graph:
This graph clearly shows that the Vacancy and Unemployment has an inverse relation. As the Vacanc
y increases gradually the unemployment decreases. This is in accordance to the real-life scenario.

3. Try selecting and plotting only the data from Victoria.
a) Can you see a relationship now? If so, what relationship is there?
Answer: Unlike the previous graph to establish relationship for all states, in this case, the
unemployment and the vacancy data is plotted against the state of Victoria only. The below graph is
obtained.
The graph correlates to the previous finding of grouped data. Here the Vacancy for the state of
Victoria is gradually decreasing as the Unemployment Rate grows. Noteworthy, the vacancies for the
state of Victoria are quite high and seemingly unaffected until the unemployment rate reaches the
value of 5.
4. The different populations across the states will influence the number of job vacancies in each.
Remove this effect by introducing a new column called 'Vacancy Rate' which contains the
vacancy count divided by the population size, multiplied by 100.
a) Is there a relationship between the unemployment rate and the job vacancy rate across all
the data?
Answer: The column is added to the source data. Now, the vacancy rate and the unemployment
rate are plotted for both type of data (Grouped and Ungrouped).
Next Page (Contd.)

Both the above methodology suggests that the Vacancy rate is inversely related to the
Unemployment Rate. The Vacancy Rate has clearly shaped the trend in to a more linearly degrading
form by omitting the effect of population count.
Mention worthy, in all the above cases the vacancies are not impacted by the unemployment rate
until it reaches a certain threshold unemployment rate of around 4.5

A5. Visualising the Relationship over Time
Now let's look at the relationship between unemployment levels and job vacancies
over time.
1. Use Python (or R) to build a Motion Chart comparing the job vacancy rate, the unemployment
rate, and the population of each state over time. The motion chart should show the job vacancy
rate on the x-axis, the unemployment rate on the y-axis, and the bubble size should depend on
the population. (HINT: A Jupyter notebook containing a tutorial on building motion charts in
Python is available here.)
Answer: The motion chart is in the video below:
2. Run the visualisation from start to finish. (Hint: In Python, to speed up the animation, set timer
bar next to the play/pause button to the minimum value.) And then answer the following
questions:
a) Which state generally has the lowest job vacancy rate?
b) Is the economy generally getting better or worse? I.e. was the Australian economy better in
2006/7 or 2014/5? Explain your answer.
c) Compared to the states, does the Northern Territory generally have higher or lower
unemployment and higher or lower job vacancy rates? What might cause this? Would it
make sense economically to move to NT?
d) According to the graph what happened at the end of 2008 and start of 2009? What might
have caused this?
e) Any other interesting things you notice in the data?

Answer:
a) Tasmania has the lowest job vacancy rate
b) A high unemployment rate does not necessarily mean a bad economy. Similarly, a lower
unemployment rate does not signify a strong economy. Australian economy is a benign
economy rather than a volatile one. If we look through the motion chart data, Australia
began with an average unemployment rate lower than 5% in 2006. However, the average
unemployment rate slipped more downward to around 4% until the end of 2008. Then from
2009 onwards a gradual rise in unemployment rate is observed between 5.5% to 7.0%. This
trend is continued until 2015. As per OECD, the rate of unemployment between 5.5% and
8.3% is good for an economy to thrive and sustain. Hence the data supports that the
Australian economy in 2015 is doing better than earlier and is getting stronger.
Reference: http://www.adamhoward.com.au/blog/2015/3/31/unemployment-when-is-it-
good-and-when-is-it-bad
https://www.focus-economics.com/country-indicator/australia/gdp
c) The Northern Territory have lower unemployment rate and higher vacancy rates than other
states. This might be due to the size of the population. Being one of the smallest state in
terms of population, most of the individuals are employed within the available opportunities
leading to lower unemployment rate. However, the demand for labour may not be
supplemented well by its population, thus creating more vacancies than others.
As we see the population of the state have not increased much, the unemployment rate has
remained more or less the same with reduced vacancies over the time period. This implies,
people from different states have already migrated to the state of Northern Territory, thus
filling up the vacancies. Compared to other states, Northern Territory did not have a higher
unemployment rate along with reduced vacancies. Hence it won’t be very economical to
move to the state.
d) At the end of 2008 and start of 2009 there was a spike in the unemployment rate. This might
be due to the fact that the world economy was hit with a major financial crisis, during this
period. The spike in the unemployment rate and the reduced vacancy rate is indicative of
the period of Great Recession.
e) New South Wales, Victoria and Queensland forms the major part of the Australian Economy.

TaskB:ExploratoryAnalysison BigData
B1. Summarising the Data
Load the InsuranceRates.csv data in Python (or R) and answer the following questions:
1. How many rows and columns are there?
2. How many years does the data cover? (Hint: pandas provide functionality to see 'unique'
values.)
3. What are the possible values for 'Age'?
4. How many states are there?
5. How many insurance providers are there?
6. What are the average, maximum and minimum values for the monthly insurance premium cost
for an individual? Do those values seem reasonable to you?
7. How much more on average do plans for smokers cost?
Answer:
1) There are 12694445 rows and 7 columns
2) The data covers 3 years: 2014, 2015 and 2016
3) The possible values of ages are: '0-20', 'Family Option', '21', '22', '23', '24', '25', '26', '27',
'28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46',
'47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65 and
over'
4) There are 39 states
5) There are 910 insurance providers
6) The aggregate values are:
The Max and Min values are not plausible as the values are too extreme on both ends.
Probably junk records.
7) Plans for smoker costs 88.90566067009055 more on average
Key Insurance Cost
Mean 4098.026458581588
Max 999999
Min 0.0

B2. Investigating Individual Insurance
Costs
Now let's look more in detail at the individual insurance costs.
1. Show the distribution of ‘IndividualRate’ values using a histogram.
a) Does the distribution make sense to? What might be going on?
Answer: The distribution of Individual Rate is shown below using a histogram:
The above histogram doesn’t make much sense due to the fact that the data for the distribution
consists of all the Insurance Rates. The majority of the Insurance rates are paid in the first bar while
a seemingly invisible outlier is observed at the end. The outlier cannot be a plausible value as the
Insurance Rates are too high to be true. To get a proper insight we must delve into the data of the
first bar.
2. Remove rows with insurance premiums of 0 (or less) and over 2000. (Use this data from now
on). Generate a new histogram with a larger number of bins (say 200).
a) Does this data look more sensible?
b) Describe the data. How many groups can you see?
Answer: The distribution of Individual Rate is shown below using a histogram:
Next Page (Contd.)

The histogram data makes more sense now as we can clearly see the distribution of different
Insurance Rates excluding the extreme values.
There are three groups of data in the histogram, which can be categorised into: Low, Medium and
High insurance rates. There are significantly large number of users who are paying a Low insurance
rates but have less options to choose from. For the Medium insurance rates, there is considerable a
widest variety of rates to choose from. There is a small spike in High insurance rates indicating that
there is a very small section of people paying at higher rates.
B3. Variation in Costs across States
How do insurance costs vary across states?
1. Generate a graph containing boxplots summarising the distribution of values for each state.
a) Which state has the lowest median insurance rates and which one has the highest? (Hint:
you may need to rotate the state labels to be able to read the plot.)
b) Is there much variation in costs across states?
Answer: The insurance rates for the various states are shown in the below graph via box plots.
Next Page (Contd.)

The state of ‘MO’ has the least median insurance rates while ‘AK’ has the highest median insurance
rate. There is not much variation in the median insurance rates across each state. Most of the states
have similar median insurance rate, close to between 250 and 350 [approximated]. However, on
inspecting the outliers it can be seen that there is a wide variation in the price of highest insurance
rate across different states. For example, the highest insurance rate in the state of ‘HI’ is around
1000 and that of NC is around 1800.
2. Does the number of insurance issuers vary greatly across states?
a) Create a bar chart of the number of insurance companies in each state to see. (Hint: you will
need to aggregate the data by state to do this.)
Answer: The number of insurance companies are plotted in the graph below:
Next Page (Contd.)

The bar graph clearly shows that the state of ‘TX’ has the highest number of issuers and the state of
‘HI’ has the least number of issuers. The graph depicts that the number of issuers across states in the
descending order does not vary greatly against each other.
3. Could competition explain the difference in insurance premiums across states?
a) Use a scatterplot to plot the number of insurance issuers against the median insurance cost
for each state.
b) Do you observe a relationship?
Answer: The scatter plot is plotted between median insurance rates and issuer count. The relation is
as below:

In every state, there is a strong competition amongst insurance issuers where the insurance rate is
close to between 250 and 350 [approximated]. Most insurance issuers are providing insurances in
the previous mentioned rates with minute differences than that in the other state, attracting various
customers as per their need. Insurance rates above 350 and below 250 holds minimum competition
across insurance issuers across various states.
B4. Variation in Costs over Time and with
Age
Generate boxplots (or other plots) of insurance costs versus year and age to answer
the following questions:
1. Are insurance policies becoming cheaper or more expensive over time?
a) Is the median insurance cost increasing or decreasing?
Answer: The insurance cost is plotted over the year, yielding the below boxed graph:
The box plot shows that the median of the insurance cost is more or less same over the years. Also,
it can be seen that there is a gradual increase in the number of high insurance rate policies over the
years. However, on closer analysis, the median can be found to be gradually increasing as well by a
little margin. The values are as follows:
Year Median
Rate
2014 299.31
2015 307.51
2016 317.37

Hence it can be assumed from the above data that the insurance policies are becoming expensive
over time.
2. How does insurance costs vary with the age of the person being insured? (Hint: filter out the
value 'Family Option' before plotting the data.)
a) Do older people pay more or less for insurance than younger people? How much more/less
do they pay?
Answer: The insurance cost is box plotted against each age and the below graph is obtained:
From the graph, it is clearly evident that the older people pay at a higher insurance rate that the
younger people. The younger people [age: 0-20] pay an average insurance rate of 122.333209 while
the older people [age: 65 and over] pay an average insurance rate of 584.594017. Thus, on an
average the older people pay 462.26 more than the younger people.
TaskC:ExploratoryAnalysison Other Data
Find some publicly available data and repeat some of the analysis performed in Tasks
A and B above. Good sources of data are government websites, such as: data.gov.au,
data.gov, data.gov.in, data.gov.uk, ...
Data source: “All STATS19 data (accident, casualties and vehicle tables) for 2005 to
2014 in England” [Download the data here]
C. Summary and Analysis:
The number of accidents are plotted against each day of the week.
Next Page (Contd.)

It can be seen the more number of accidents are during the start of the weekend i.e. on Friday while
the least number of the accident is on Sunday. This might be due to the fact that a large section of
the crowd prefers to return home after Friday night recreation/party leading to higher number of
accidents. While on Sunday most prefers to stay at home reducing the number of accidents.
The total number of accidents have gradually decreased over the years, however 2014 saw an
increase in the number of accidents.

The number of Fatal injuries have been consistent over the years. However, the count of the least
severe injuries has gradually reduced over the years.
Below graph shows the top 20 UK cities with maximum number of accidents:
Clearly Birmingham, Leeds and Manchester accounts for the most number of accidents in UK and
thus would definitely require a higher number of Police than other districts.

The following visualisation provides the number of accident calls handled by each department of the
police in UK.
The Metropolitan police, West Midlands, Greater Manchester departments of police has served the
top three most numbers of accident cases over the years. The higher number of Metropolitan police

is due to their operations in all the suburbs around London that shares a considerable amount of
accidents every year. However Birmingham may require more police force to address the high
number of accidents (analysed later).
Finding the root cause to the accidents, analysis is done on the Light Conditions for the top 20
accident prone districts.
Accident due to NO LIGHTING:

This box plot clearly shows that there is a high number of accidents in the districts of Doncaster,
Edinburgh, Leeds and Sheffield due to NO LIGHTING. This insight can be used to put more lights
across the streets in those districts to reduce similar accidents.
Accident due to LIGHTS UNLIT:
The above graph shows that the district of Edinburgh, Bristol, Glasgow and Birmingham had more
accidents than others due to unlit lights. The most impacted district is Edinburgh. These 5 districts
require repair in their road lighting service to prevent similar accidents.
In all the city of Edinburgh is most impacted by darkness leading to accidents. The analysis shows
that the city of Edinburgh needs most focus on street lighting than others, by the district
administrators.

The above histogram shows distribution of the age over the number of accidents. The spread depicts
that drivers close to the age of 30 and 47 have most numbers of accidents. Teenagers are the third
most group of drivers in the distribution causing accidents.

Exploring australian economy and diversity

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Exploring australian economy and diversity

Similar to Exploring australian economy and diversity (20)

Recently uploaded

Recently uploaded (20)

Exploring australian economy and diversity