by Devon Ankar for class DSE 6000 @ Wayne State University
Mean age at first marriage of women from gapminder.org
Contraceptive prevalence (% of women ages 15-49) from gapminder.org via World Bank - percentage of women who are practicing, or whose sexual partners are practicing, any form of contraception
Question: Does mean age at first marriage correlate with contraceptive prevalence? Is it a positive or negative correlation?
Initial Hypothesis: Mean age at first marriage correlates positively with contraceptive prevalence. That is, countries with higher mean age at first marriage will also have higher contraceptive prevalence.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
age_at_marriage = pd.read_csv('age_at_marriage.csv')
contraceptive_prevalence = pd.read_csv('contraceptive_prevalence.csv')
age_at_marriage.head(5)
contraceptive_prevalence.head(5)
age_at_marriage.shape
contraceptive_prevalence.shape
age_at_marriage.info()
contraceptive_prevalence.info()
There are a lot of years and countries for which data is missing. Epidemiological data can be difficult to obtain so it is understandable that there is missing data. Since we are only looking for overall trends, it is ok that there is some data missing. We will see if this becomes a problem later.
#age_at_marriage = age_at_marriage[age_at_marriage['MEAN'] > 0]
#contraceptive_prevalence = contraceptive_prevalence[contraceptive_prevalence['MEAN'] > 0]
Create a plot of year vs. mean age at first marriage across all countries. We want to get a quick visual if there is a correlation.
sns.lmplot(x='Year', y='MEAN', data=age_at_marriage,
fit_reg=True,
)
As expected, the mean age at first marriage increases with the year. That is, as the years progress, women tend to marry at later ages. This is true as an average across all countries, though it may not be true for individual countries. Let's check the US, since that is where we live. I expect this trend will also be present in the US.
sns.lmplot(x='Year', y='United States', data=age_at_marriage,
fit_reg=True,
)
Indeed, this trend is markedly present in the US.
Create a plot of year vs. contraceptive prevalence across all countries. Again, this is to get a quick visual.
sns.lmplot(x='Year', y='MEAN', data=contraceptive_prevalence,
fit_reg=True,
)
We do see a weak trend in increased contraceptive prevalence as the years progress. It is not as strong as the previous trend on age at first marriage.
Since we live in the US, I am curious how this plot would look for the US alone. Create a plot of year vs. contraceptive prevalence for the US alone.
sns.lmplot(x='Year', y='United States', data=contraceptive_prevalence,
fit_reg=True,
)
This is telling us that contraceptive prevalence has slightly increased across time as a global average, so it's a positive correlation, but the correlation is weak.
For the US alone, the correlation is more apparent - as the years have gone by, contraceptive prevalence has increased in the US.
The next thing we want to do is check if there is a correlation between the mean age at first marriage vs. contraceptive prevalence. This is to confirm or disconfirm the initial hypothesis.
x = age_at_marriage['MEAN']
y = contraceptive_prevalence['MEAN']
plt.scatter(x, y, c='b', alpha=0.5)
plt.xlabel('Mean age at first marriage')
plt.ylabel('Contraceptive prevalence (% of women ages 15-49)')
plt.show()
It looks like contraceptive prevalence stays in a similar band of about 35-60% on average across all years and all countries, almost irrespective of the average age at first marriage. The correlation appears to be very weak, if any. My hypothesis was that as mean age at first marriage increased, so would contraceptive prevalence, but this appears to be false. I cannot conclude that there is a positive correlation between mean age at first marriage and contraceptive prevalence.
I would have more confidence in this analysis if there was less missing data. Both of the csv files had a lot of missing data.
Also, the most recent data in the set is from 2005, which is now 13 years ago; it is possible that age at first marriage and/or contraceptive prevalence has changed since then.
Lastly, the mean age at first marriage data was compiled by Gapminder using several sources, including their own estimates. According to Gapminder, "The data are based on multple sources and definitions might vary." This could indicate the data itself is unreliable or inaccurate. Gapminder themselves warn "We discourage the use of this dataset for statistical analysis." It may be enough to get a rough estimate here, but I would have more confidence in this analysis if we had better data to begin with.