1) The document is a final assignment analyzing Olympic sports stats data over 120 years to investigate how factors like age, sex, height, weight, and country affect athlete performance and medal chances.
2) Three main questions are posed: how age and sex impact medals won, which countries have best performance over time, and if height/weight correlate with sport/medals.
3) Initial findings show differences in medals won by sex and age, and heights and weights vary by sport as expected. Deeper analysis is then described.
2. CLIENT
■ For this module assignment I chose de Sports Stats database, ”, conformed by 120
years of Olympic data.
■ This dataset was selected because it is an area of personal interest and because I
consider there can be some interesting and curious observations to be made on a
dataset that wide.
■ This project will be an investigation/observation on the statistics of athlete events
and how some factors like the age, height, weight or country may be decisive
statistically on whether they win or not a medal. This is an investigation that might
be of general interest for sport fans and also to trainers or sports brands.
3. QUESTIONS, HYPOTHESES AND APPROACH
1.Does the sex and age of the participants affect their performance or their probability to win a medal
statistically?
-Age might be a factor that influences the probability of a participant winning a medal.
- I will be analyzing the Age, Sex and Medals columns mainly to make an analysis on
how these three rows are related. To take the investigation further I plan to include the
Year and Event column.
2.What countries show the best athlete performance throughout the years?
-Some countries like USA or Germany might have the best athletes.
-I will be looking primarily at the noc_regions table and the Years, Team, Medals, Season
and NOC columns from the athlete_events database.
3.Is there any relationship between the height and weight of the athlete and the sport they practice?
-There must be a relationship between the height and weight of the athletes with the sport
they practice
-I will be using the height, weight, sport, medal and event columns to prove or disprove the
hypotheses. I will average the height and weight of the athletes and choose some sports with
very different averages to analyze in detail the effect of the columns considered.
4. TECHINCAL CHALLENGES
■ For practicality manipulating the data a Jupyter Notebook was used, coding in python 3 using
the pandas and sqlite3 libraries to import and analyze the data.
■ When making some initial observations of the dataset it was noticed that some rows had Null
Height, Weight, Age or Medal values. The data is cleaned by deleting null rows, however, I
decided to leave the rows with null values on the ‘Medal’ column as it could be an aspect to
analyze later in the assignment as not all participants necessarily win a medal.
6. INITIAL FINDINGS
■ A simple query is done to make an initial
observation on what the differences are
between sex and medals won.
■ Some initial observations are made on the
height and weight of the athletes grouped
by sports
7. DEEPER ANALYSIS
■ For the first question, I separated the data in
two frames of male and female athletes. For
each one I made a plot showing the quantity
of medals won grouped by age. After that, I
evaluated how long each group has been
participating in events, how many events in
total they have participated in and what is the
total amount of medals won.
8. ■ It was discovered that in both sexes, the most significant amount of medals won lays
on athletes between the ages of 20 and 30. Also, it was noticed that male athletes
have won more medals, however, the first year with female athletes was almost 50
years after the first event with male athletes. This means that men have been
participating in events for more time than women which translates to more medals
won, but, if you analyze the quantity of events participated in vs the quantity of
medals won between both sexes, women actually show a better overall performance
winning medals.
9. ■ For the second question a simple query was done to get the result of the countries
with most medals.
10. METRIC CREATED
■ For the third question, I developed a new metric to help find if there exists any correlation between
the height and weight of the athletes and the probability that they might win a medal.
■ For this, I averaged the height and weight of all athletes and grouped it by Sport, then I did the
same for only the athletes that have won a medal. After this, two new columns where added that
compare the results and show if there is any correlation between the height and weight and their
chance of winning a medal. Finally, these two columns where averaged to find the general
correlation considering all sports.
11.
12. ■ To go broader, the age variable was also added into the query, however, the results
showed that the average difference between the height, weight and age of regular
athletes and athletes who have won a medal is so small that there is apparently no
significant correlation.
13. FINAL FINDINGS
■ The first hypotheses was proved to be true as it was found a correlation between the
age and athletes who have won medals. It was also discovered what genre of
athletes show the best overall performance.
■ For the second question, it was observed which countries posses the most medals.
■ For the last question, the initial hypotheses was incorrect. Through the metric
created it was observed that there is no apparent correlation between the average
overall difference in height, weight and age between regular athletes and athletes
who have won a medal. However, as this is a general observation and result, and
investigation done more precisely focusing on a specific sport might show different
results.