The World Bank's sustainable development goals aim to improve of measures reflecting the environment, social factors, and governance of a state. The purpose of these goals is to improve the well-being of populations.
For me, it is fascinating that this myriad of measures can have such a significant impact on population well-being. How does one measure well-being? One rough proxy is lifespan. Obviously, factors such as clean water and good healthcare can affect lifespan; but we are discovering that even factors such as intergenerational stress can have an effect on lifespan.
Project goals & Dataset¶
In this project, I explored the relationship between sustainable development indicators reported by the World Bank and life expectancy. I perform multiple linear regression and create a model to predict country-wide mean life expectancy using variables reported in the Environment, Social, And Governance (ESG) dataset.
Summary of findings¶
The final linear regression model significantly predicts life expectancy (F(8,120) = 130.7, p < 0.01) and accounts for 89% of variance in life expectancy.
The following predictors had a main effects on life expectancy:
- renewable electricity output (beta=-0.05, p=0.05)
- agricultural land (beta=0.08, p=0.02)
- CO2 emissions (beta=-0.22, p=<0.01)
- government effectiveness score (beta=0.41, p<0.01)
- proportion of seats held by women (beta=0.011, p=0.02)
- unemployment rate (beta=-0.011, p<0.01)
- cause of death by communicabel diseases, and maternal and prenatal nutrition conditions (beta=0.55, p<0.01)
Some of these coefficients relate the predictor to life expectancy in a linear manner, while others modify the relationship between life expectancy and the square root of the predictor or the square root of the negative of the predictor.
Full project¶
Import and check the data¶
header<- read.csv("data/ESGData.csv", nrows = 1) # load the header separately
wb <- read.csv("data/ESGData.csv", header= FALSE, skip = 3084) # load the data itself
colnames(wb)<-colnames(header) # assign column names
rm(header) #clean up by removing extra assignment
print(paste("The WB dataframe has", dim(wb)[1], "rows and", dim(wb)[2], "columns"))
head(wb)
First step, as always: clean the data¶
To do:¶
- I only want to look at current years. In this dataframe, the columns represent a measure of an Indicator from the 'Indicator.Name' column, every year from 1960 onwards.
- Since there are many missing values in the data, I decided to average measurements per country over 2013-2017.
library(dplyr)
library(tidyr)
wb_tidy<-wb%>%
select(Country.Name, Indicator.Name, X2013, X2014, X2015, X2016, X2017)%>% # select recent years
droplevels()%>% # reset index
gather(Year, Value, -Country.Name, -Indicator.Name)%>% # make 1 row for each year for each indicator
group_by(Country.Name, Indicator.Name)%>% # group by country and indicator and average the value over all years listed
summarise(Avg.Value=mean(Value, na.rm = TRUE))%>%
spread(Indicator.Name, Avg.Value) # make each row one country and each column a different indicator
To get a sense of all the variables in this data set...¶
colnames(wb_tidy)
I don't want to include all these variables in my model because many of them are filled in sparcely (many NAs)¶
So, I selected a subset that had minimul NA values and created a new data frame with the data selected for regression analysis.
wb_select <- wb_tidy%>%
select(c(Country.Name, #select columns
`Renewable electricity output (% of total electricity output)`,
`Access to electricity (% of population)`,
`Renewable energy consumption (% of total final energy consumption)`,
`Terrestrial and marine protected areas (% of total territorial area)`,
`Agricultural land (% of land area)`,
`Individuals using the Internet (% of population)`,
`CO2 emissions (metric tons per capita)`,
`GDP growth (annual %)`,
`Political Stability and Absence of Violence/Terrorism: Estimate`,
`Rule of Law: Estimate`,
`Government Effectiveness: Estimate`,
`PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)`,
`Proportion of seats held by women in national parliaments (%)`,
`Access to clean fuels and technologies for cooking (% of population)`,
`Labor force participation rate, total (% of total population ages 15-64) (modeled ILO estimate)`,
`Unemployment, total (% of total labor force) (modeled ILO estimate)`,
`School enrollment, primary (% gross)`,
`Cause of death, by communicable diseases and maternal, prenatal and nutrition conditions (% of total)`,
`Adjusted savings: natural resources depletion (% of GNI)`,
`Prevalence of undernourishment (% of population)`,
`Life expectancy at birth, total (years)`))%>%
na.omit()%>% # remove NAs
ungroup()
Time for visual exploration¶
Next, we can take a peek at the data and check for things like skewness, which affect linear modeling
library(ggplot2)
check_skew<-gather(wb_select, Measure, Value, -Country) # reshape the dataframe for plotting
# Create named list for facet lables on plot (shortened predictor names)
labels<-substr(unique(check_skew$Measure), 1, 20)
names(labels)<-unique(check_skew$Measure)
ggplot(check_skew, aes(Value))+ # make histograms
geom_histogram(bins = 200)+
facet_wrap(~Measure,strip.position = "bottom", labeller = labeller(Measure=labels))+
coord_cartesian(ylim = c(0,50))+
theme(strip.text=element_text(size=9))
rm(check_skew) #remove this dataframe so it doesnt clutter our environment
Next, it's time to prepare the data for regression¶
First, that means transforming the data¶
Some of these variables are skewed, so they should be transformed for regression to fit a normal distribution. The right skewed ones can be transformed with a natural log transformation. Others are left skewed, so they can be log-transformed after subtracting from a constant.
Here is the transformation:
wb_log<-wb_select%>%
mutate_at(vars(Renewable.electricity.output, # these are the variable to natual log transform
Renewable.energy.consumption,
Terrestrial.and.marine.protected.areas,
CO2.emissions,
PM2.5.air.pollution,
Unemployment,
Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions,
natural.resources.depletion,
Prevalence.of.undernourishment),
~log1p(1+.))%>% # natural log after adding 1 to make all data positive
mutate_at(vars(Access.to.electricity, # these are the left-skewed variables
Access.to.clean.fuels.and.technologies.for.cooking),
~log1p(101-.)) # left skewed variables can be log transformed after being subtracted from 101
#to turn it into a left skew
Next, we must scale the data¶
The data is scaled between 0 and 1 because regression performs better on data that is uniformly scaled:
wb_scaled<-wb_log%>%
mutate_at(vars(Renewable.electricity.output, # all variables, to be scaled
Access.to.electricity,
Renewable.energy.consumption,
Terrestrial.and.marine.protected.areas,
Agricultural.land,
Individuals.using.the.Internet,
CO2.emissions,
GDP.growth,
Political.Stability.and.Absence.of.Terrorism,
Rule.of.Law,
Government.Effectiveness,
PM2.5.air.pollution,
Proportion.of.seats.held.by.women.in.parliaments,
Access.to.clean.fuels.and.technologies.for.cooking,
Labor.force.participation.rate,
Unemployment,
School.enrollment,
Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions,
natural.resources.depletion,
Prevalence.of.undernourishment,
Life.expectancy.at.birth),
~(.-min(.))/(max(.)-min(.))) # formula to rescale
Visualizing again...¶
to make sure the transformation worked as anticipated
check_scaling<-wb_scaled%>%
select(-c(Access.to.clean.fuels.and.technologies.for.cooking, Access.to.electricity, Prevalence.of.undernourishment))%>%
gather(Measure, Value, -Country) # re-shape for plotting
check_scaling$Measure<-as.factor(check_scaling$Measure) # change measure to factor class for ggplot
ggplot(check_scaling, aes(Value))+ # make histograms
geom_histogram(bins = 100)+
facet_wrap(~Measure,strip.position = "bottom", labeller = labeller(Measure=labels))+
theme(strip.text=element_text(size=9))
rm(check_scaling) #remove this dataframe so it doesnt clutter our environment
Now the data looks way more normally distributed
Can we quickly discover some relationships in the data?¶
I looked at correlations between each variable because that can give a hint at what we will find with regression
options(repr.plot.width=7, repr.plot.height=6) #resize plot
wb_transformed<-read.csv("data/wb_transformed.csv", row.names = 1)
library(psych)
pairs.panels(wb_transformed[,2:10],
method = "pearson",
hist.col = "#00AFBB")
Here is another way to visualize correlations that I like to use:
library(corrplot)
corrplot(cor(wb_transformed[2:10]),order="hclust",tl.col="black",tl.cex=.9)
I wanted to peek directly at the relationships between life expectancy and the predictors¶
because it is the relationship I am interested in
# re-shape the data for plotting
wb_long<-gather(wb_transformed, Measure, Value, -Country, -Life.expectancy.at.birth)
wb_long$Measure<-as.factor(wb_long$Measure)
options(repr.plot.width=8) #resize plot
ggplot(wb_long, aes(x=Value,y=Life.expectancy.at.birth))+
geom_point(size=1,color='forestgreen')+
facet_wrap(~Measure,strip.position = "bottom", labeller = labeller(Measure=labels))+
theme(strip.text=element_text(size=6))
Some of these variable clearly have a nonlinear relationship with life expectancy. Specifically, for CO2.emissions, Rule.of.law, Indiviuals.using.the.internet, and Government.effectiveness, their relationship with life expectancy follows a square root function.
Also, for Access.to.electricity, Renewable.energy.consumption, Access.to.clean.fuels.and.technologies.for.cooking, and Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions, they follow a negative square root relationship.
These relationships can be added into the linear model.
Finally, it's time to make the linear regression model!¶
I start by constructing a full multiple linear regression model with all the variables:
fullmodel<-lm(Life.expectancy.at.birth ~
Renewable.electricity.output +
I(sqrt(-(Access.to.electricity)+1)) + # square root of the negative plus 1 (move to positive side of x axis)
I(sqrt(-(Renewable.energy.consumption)+1)) +
Terrestrial.and.marine.protected.areas +
Agricultural.land +
I(sqrt(Individuals.using.the.Internet)) +
I(sqrt(CO2.emissions)) +
GDP.growth +
Political.Stability.and.Absence.of.Terrorism +
I(sqrt(Rule.of.Law)) +
I(sqrt(Government.Effectiveness)) +
PM2.5.air.pollution +
Proportion.of.seats.held.by.women.in.parliaments +
I(sqrt(-(Access.to.clean.fuels.and.technologies.for.cooking)+1)) +
Labor.force.participation.rate +
Unemployment +
School.enrollment +
I(sqrt(-(Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions)+1)) +
natural.resources.depletion +
Prevalence.of.undernourishment, data=wb_transformed)
Backward elimination¶
I used backward elimination to select the variables that are predictive of life expectancy.
Backward elimination uses the model selection strategy of minimizing AIC. Akaike Information Criteria is based in information theory. It estimates how much information is lost in a model and thereby gives us information about the quality of a model. AIC deals with the trade off between model complexity and generalizability (overfitting v underfitting).
During each iteration, backward elimination drops the variable that gives the highest AIC until there is no significant drop in AIC. The model that yields the lowest AIC is retained in 'model'.
model <- step(fullmodel,direction="backward",trace=10)
Next, we inspect the model closer:
library(jtools)
summ(model)$coeftable
print(paste("R-squared:", summary(model)$r.squared))
print(paste("F-statistic:", summary(model)$fstatistic[1], "on", summary(model)$fstatistic[2], "and", summary(model)$fstatistic[3], "DF, p-value:", pf(summary(model)$fstatistic[1],summary(model)$fstatistic[2],summary(model)$fstatistic[3],lower=F) ))
This output shows is that the model of life expectancy as a function of the above variables was significant (F(11,126) = 99.71, p < 0.01). In addition, R-squared tells us that 89% of variance in life expectancy is accounted for by the variables in this model. This is quite high considering the messy/complicated system of country-wide social and economic factors. This seems to be a good model.
FYI I ran a similar model without square root relationship accounted for, and including the square roots improved the model (F(10,127) = 79.1, p < 0.01, R-squared for 0.86).
This isn't the end of the story¶
We need to make sure that the assumptions of linear models are satisfied in order to be sure that the model is meaningful.
One assumption to inspect is absence of collinearity¶
If predictors are highly correlated (collinearity) the statistical significance of the model is undermined. We can look at this by looking at the variance inflation factor (VIF) of each predictor. VIF measures how easily the predictor is predicted from a linear regression using the other predictors. Any predictors with a VIF above 4 should be removed from the model.
library(car)
vifs<-dplyr::bind_cols(as.data.frame(names(vif(model))), as.data.frame(unname(vif(model))));vifs
Since several of these variables have a large VIF, some should be removed.
While there is no rule in terms of what is a good VIF threshold, the following code removes the variable with the largest VIF and re-builds the model until no variables have VIF above 10.
while(any(vifs$`unname(vif(model))`>10)){ #while any VIFS are greater than 10eac
vifs<-filter(vifs, `unname(vif(model))`!=max(`unname(vif(model))`)) #take out the variable with the largest VIF
formula<-paste("Life.expectancy.at.birth ~", paste(vifs$`names(vif(model))`, collapse = ' + '), sep=' ') #create a new formula without that variables
newmodel<-lm(formula, data = wb_transformed) #create a new model
vifs=bind_cols(as.data.frame(names(vif(model))), as.data.frame(unname(vif(model)))) #create a new dataframe with the VIFs from the new model
}
finalmodel<-step(newmodel,direction="backward",trace=10)
This new model satisfies the assumption of no collinearity.
Another assumption of that of homoskedasticity¶
This means that the residuals of the model have a similar amount of deviation from predicted values. To assess this assumption, we can look at a residual plot. This plot should look like a blob; if it doesn't, for instance if higher values have higher residuals, the variance would not be homoskedastic.
resid <- bind_cols(as.data.frame(fitted(finalmodel)), as.data.frame(residuals(finalmodel)))
p1 <- ggplot(resid,aes(x= fitted(finalmodel), y=residuals(finalmodel))) + geom_point() + labs(title='Residual Plot', x='Fitted Values', y='Residuals');p1
# with this plot, we are look for absence of a pattern
Last, there is the assumption of the normality of residuals¶
The residuals should follow a normal distribution. If they don't, this indicates that the model is not fit optimally.
p2 <- ggplot(resid, aes(residuals(finalmodel))) + geom_histogram(bins=100) + labs(title = 'Histogram of Residuals')
# with the histogram, we're looking for a normal distribution
p3 <- ggplot(resid, aes(sample=residuals(finalmodel))) + geom_qq() + geom_qq_line() + labs(title = "Normal Q-Q Plot of Residuals")
# with the qqnorm plot, we are looking for a straight line
gridExtra::grid.arrange(p3, p2, nrow=1)
Summary of the final model and plot:¶
summ(finalmodel)$coeftable
print(paste("R-squared:", summary(finalmodel)$r.squared))
print(paste("F-statistic:", summary(finalmodel)$fstatistic[1], "on", summary(finalmodel)$fstatistic[2], "and", summary(finalmodel)$fstatistic[3], "DF, p-value:", pf(summary(finalmodel)$fstatistic[1],summary(finalmodel)$fstatistic[2],summary(finalmodel)$fstatistic[3],lower=F) ))
The final model significantly predicts life expectancy (F(8,120) = 130.7, p < 0.01) and accounts for 89% of variance in life expectancy.
This output also tells at that there are main effects on life expectancy of renewable electricity output (beta=-0.05, p=0.05), agricultural land (beta=0.08, p=0.02), CO2 emissions (beta=-0.22, p=<0.01), government effectiveness score (beta=0.41, p<0.01), proportion of seats held by women (beta=0.011, p=0.02), unemployment rate (beta=-0.011, p<0.01), and cause of death by communicabel diseases, and maternal and prenatal nutrition conditions (beta=0.55, p<0.01).
Some of these coefficients relate the predictor to life expectancy in a linear manner, while others modify the relationship between life expectancy and the square root of the predictor or the square root of the negative of the predictor.
Plotting the model predictors and outcome¶
library(plotly)
# store outcome
wb_transformed<-read.csv("../../../r_projects/world-bank-sustainability-themes/data/wb_transformed.csv",row.names = 1)
wb_outcome<-read.csv("../../../r_projects/world-bank-sustainability-themes/data/wb_tidy.csv", row.names = 1)%>%
select(Country, Life.expectancy.at.birth)
Preparing the data for plotting:
# merge dataframe with scaled predictors and dataframe with outcome in original scaling,
# and select significant predictors
for_plot<-left_join(wb_transformed, wb_outcome, by='Country')%>%
select(Country,
Renewable.electricity.output,
Agricultural.land,
CO2.emissions,
Government.Effectiveness,
Proportion.of.seats.held.by.women.in.parliaments,
Access.to.clean.fuels.and.technologies.for.cooking,
Unemployment,
Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions,
Life.expectancy.at.birth.y)
# Pretty column names for graph
colnames(for_plot) <- c('Country',
'Renewable energy consumption (% of total)',
'Agricultural land (% of land area)',
'CO2 emissions (metric tons per capita)',
'Government Effectiveness: Estimate',
'Seats held by women in national parliaments (%)',
'Access to clean fuels and technologies for cooking (%)',
'Unemployment (% of total labor force)',
'CoD by communicable diseases, maternal, and prenatal cond (%)',
'Life expectancy at birth, total (years)')
# Re-shape into long form for ggplot2
for_plot<-gather(for_plot, Measure, Value, -Country, -`Life expectancy at birth, total (years)`)
for_plot$Measure<-as.factor(for_plot$Measure)
Using plotly to render the ggplot2 graph as an interactive object!
p <-ggplotly(ggplot(for_plot, # ggplot2 aes assignment:
aes(x = Value,
y = `Life expectancy at birth, total (years)`,
group = Measure, # text assignment for hovering pointer for plotly:
text = paste("Country:",Country,'<br>', Measure, ':', Value, '<br>Life Expectancy:', `Life expectancy at birth, total (years)`, ' years')))+
geom_point(aes(color=Measure))+ # adding the points and lines
geom_smooth(aes(color=Measure), method=lm,size=.7,se=F)+
labs(title= "Normalized predictors of life expectancy", x="Normalized Measure Value", y="Life expectancy at birth, total (years)")+
theme_minimal(), tooltip = "text")%>% # tooltip makes sure the "text" above appears on hover
layout(legend = list(font=list(size=10)))
p$x$layout$legend$x=1
p$x$layout$legend$y=-.2
p$x$layout$legend$orientation='h'
p$x$layout$height=700
p
It is important to keep in mind that several transformations have been done to this data before running the linear model. Looking back at the correlations in the original data can therefore be informative:
for_cor<-read.csv("../../../r_projects/world-bank-sustainability-themes/data/wb_tidy.csv", row.names = 1)%>% #re-import tidy
select(Country, # select columns that were selected through linear modelling
Renewable.electricity.output,
Agricultural.land,
CO2.emissions,
Government.Effectiveness,
Proportion.of.seats.held.by.women.in.parliaments,
Access.to.clean.fuels.and.technologies.for.cooking,
Unemployment,
Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions,
Life.expectancy.at.birth)
# assign correlation of each predictor and life expectancy to a named list
corlist<- list()
for (i in 2:length(finalmodel$coefficients)){
var<-colnames(for_cor)[i]
col<-select(for_cor, var)
corlist[i-1]<-cor(col, for_cor$Life.expectancy.at.birth)
names(corlist)[i-1]<-var
}
# print dataframe of correlations and coefficients
bind_cols(gather(as.data.frame(corlist), variable, corr), as.data.frame(finalmodel$coefficients[2:9]))