The World Bank's sustainable development goals aim to improve of measures reflecting the environment, social factors, and governance of a state. The purpose of these goals is to improve the well-being of populations.

For me, it is fascinating that this myriad of measures can have such a significant impact on population well-being. How does one measure well-being? One rough proxy is lifespan. Obviously, factors such as clean water and good healthcare can affect lifespan; but we are discovering that even factors such as intergenerational stress can have an effect on lifespan.

Project goals & Dataset¶

In this project, I explored the relationship between sustainable development indicators reported by the World Bank and life expectancy. I perform multiple linear regression and create a model to predict country-wide mean life expectancy using variables reported in the Environment, Social, And Governance (ESG) dataset.

Summary of findings¶

The final linear regression model significantly predicts life expectancy (F(8,120) = 130.7, p < 0.01) and accounts for 89% of variance in life expectancy.

The following predictors had a main effects on life expectancy:

renewable electricity output (beta=-0.05, p=0.05)
agricultural land (beta=0.08, p=0.02)
CO2 emissions (beta=-0.22, p=<0.01)
government effectiveness score (beta=0.41, p<0.01)
proportion of seats held by women (beta=0.011, p=0.02)
unemployment rate (beta=-0.011, p<0.01)
cause of death by communicabel diseases, and maternal and prenatal nutrition conditions (beta=0.55, p<0.01)

Some of these coefficients relate the predictor to life expectancy in a linear manner, while others modify the relationship between life expectancy and the square root of the predictor or the square root of the negative of the predictor.

Full project¶

Import and check the data¶

In []:

header<- read.csv("data/ESGData.csv", nrows = 1) # load the header separately
wb <- read.csv("data/ESGData.csv", header= FALSE, skip = 3084) # load the data itself
colnames(wb)<-colnames(header) # assign column names
rm(header) #clean up by removing extra assignment

print(paste("The WB dataframe has", dim(wb)[1], "rows and", dim(wb)[2], "columns"))
head(wb)

Out[]:

[1] "The WB dataframe has 14538 rows and 65 columns"

Out[]:

Country.Name	Country.Code	Indicator.Name	Indicator.Code	X1960	X1961	X1962	X1963	X1964	X1965	...	X2011	X2012	X2013	X2014	X2015	X2016	X2017	X2018	X2040	X
Afghanistan	AFG	Access to electricity (% of population)	EG.ELC.ACCS.ZS	NA	NA	NA	NA	NA	NA	...	43.2220189	69.1000000	70.1534805	89.5000000	71.5000000	97.7000000	97.7000000	NA	NA	NA
Afghanistan	AFG	Adjusted savings: natural resources depletion (% of GNI)	NY.ADJ.DRES.GN.ZS	NA	NA	NA	NA	NA	NA	...	0.3211679	0.3220366	0.2231001	0.2259098	0.2309967	0.2991603	0.3169646	NA	NA	NA
Afghanistan	AFG	Adjusted savings: net forest depletion (% of GNI)	NY.ADJ.DFOR.GN.ZS	NA	NA	NA	NA	NA	NA	...	0.1806410	0.1556279	0.1529173	0.1586292	0.1814527	0.2291977	0.2263378	NA	NA	NA
Afghanistan	AFG	Agricultural land (% of land area)	AG.LND.AGRI.ZS	NA	57.74592	57.83782	57.91441	58.01091	58.01397	...	58.0675796	58.0675796	58.0675796	58.0675796	58.0675796	58.0675796	NA	NA	NA	NA
Afghanistan	AFG	Agriculture, forestry, and fishing, value added (% of GDP)	NV.AGR.TOTL.ZS	NA	NA	NA	NA	NA	NA	...	23.7436640	24.3908736	22.8106627	22.1370414	20.6343227	21.0810862	20.4665052	NA	NA	NA
Afghanistan	AFG	Annual freshwater withdrawals, total (% of internal resources)	ER.H2O.FWTL.ZS	NA	NA	NA	NA	NA	NA	...	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

First step, as always: clean the data¶

To do:¶

I only want to look at current years. In this dataframe, the columns represent a measure of an Indicator from the 'Indicator.Name' column, every year from 1960 onwards.
Since there are many missing values in the data, I decided to average measurements per country over 2013-2017.

In []:

library(dplyr)
library(tidyr)
wb_tidy<-wb%>% 
  select(Country.Name, Indicator.Name, X2013, X2014, X2015, X2016, X2017)%>% # select recent years
  droplevels()%>% # reset index
  gather(Year, Value, -Country.Name, -Indicator.Name)%>%    # make 1 row for each year for each indicator
  group_by(Country.Name, Indicator.Name)%>%  # group by country and indicator and average the value over all years listed 
  summarise(Avg.Value=mean(Value, na.rm = TRUE))%>%
  spread(Indicator.Name, Avg.Value) # make each row one country and each column a different indicator

To get a sense of all the variables in this data set...¶

In []:

colnames(wb_tidy)

Out[]:

'Country.Name'
'Access to clean fuels and technologies for cooking (% of population)'
'Access to electricity (% of population)'
'Adjusted savings: natural resources depletion (% of GNI)'
'Adjusted savings: net forest depletion (% of GNI)'
'Agricultural land (% of land area)'
'Agriculture, forestry, and fishing, value added (% of GDP)'
'Annual freshwater withdrawals, total (% of internal resources)'
'Annualized average growth rate in per capita real survey mean consumption or income, total population (%)'
'Cause of death, by communicable diseases and maternal, prenatal and nutrition conditions (% of total)'
'Children in employment, total (% of children ages 7-14)'
'CO2 emissions (metric tons per capita)'
'Control of Corruption: Estimate'
'Cooling Degree Days'
'Droughts, floods, extreme temperatures (% of population, average 1990-2009)'
'Ease of doing business index (1=most business-friendly regulations)'
'Electricity production from coal sources (% of total)'
'Energy imports, net (% of energy use)'
'Energy intensity level of primary energy (MJ/$2011 PPP GDP)'
'Energy use (kg of oil equivalent per capita)'
'Fertility rate, total (births per woman)'
'Food production index (2004-2006 = 100)'
'Forest area (% of land area)'
'Fossil fuel energy consumption (% of total)'
'GDP growth (annual %)'
'GHG net emissions/removals by LUCF (Mt of CO2 equivalent)'
'GINI index (World Bank estimate)'
'Government Effectiveness: Estimate'
'Government expenditure on education, total (% of government expenditure)'
'Health Index 35'
'Hospital beds (per 1,000 people)'
'Income share held by lowest 20%'
'Individuals using the Internet (% of population)'
'Labor force participation rate, total (% of total population ages 15-64) (modeled ILO estimate)'
'Life expectancy at birth, total (years)'
'Literacy rate, adult total (% of people ages 15 and above)'
'Mammal species, threatened'
'Maximum 5-day Rainfall (25-yr RL)'
'Mean Drought Index (SPEI)'
'Methane emissions (metric tons of CO2 equivalent per capita)'
'Mortality rate, under-5 (per 1,000 live births)'
'Net migration'
'Nitrous oxide emissions (metric tons of CO2 equivalent per capita)'
'Patent applications, residents'
'People using safely managed drinking water services (% of population)'
'People using safely managed sanitation services (% of population)'
'PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)'
'Political Stability and Absence of Violence/Terrorism: Estimate'
'Population ages 65 and above (% of total population)'
'Population density (people per sq. km of land area)'
'Poverty headcount ratio at national poverty lines (% of population)'
'Prevalence of overweight (% of adults)'
'Prevalence of undernourishment (% of population)'
'Proportion of seats held by women in national parliaments (%)'
'Ratio of female to male labor force participation rate (%) (modeled ILO estimate)'
'Regulatory Quality: Estimate'
'Renewable electricity output (% of total electricity output)'
'Renewable energy consumption (% of total final energy consumption)'
'Research and development expenditure (% of GDP)'
'Rule of Law: Estimate'
'School enrollment, primary (% gross)'
'School enrollment, primary and secondary (gross), gender parity index (GPI)'
'Scientific and technical journal articles'
'Strength of legal rights index (0=weak to 12=strong)'
'Terrestrial and marine protected areas (% of total territorial area)'
'Unemployment, total (% of total labor force) (modeled ILO estimate)'
'Unmet need for contraception (% of married women ages 15-49)'
'Voice and Accountability: Estimate'

I don't want to include all these variables in my model because many of them are filled in sparcely (many NAs)¶

So, I selected a subset that had minimul NA values and created a new data frame with the data selected for regression analysis.

In []:

wb_select <- wb_tidy%>%
  select(c(Country.Name,  #select columns
           `Renewable electricity output (% of total electricity output)`,
           `Access to electricity (% of population)`, 
           `Renewable energy consumption (% of total final energy consumption)`,
           `Terrestrial and marine protected areas (% of total territorial area)`, 
           `Agricultural land (% of land area)`,  
           `Individuals using the Internet (% of population)`,
           `CO2 emissions (metric tons per capita)`, 
           `GDP growth (annual %)`,
           `Political Stability and Absence of Violence/Terrorism: Estimate`,
           `Rule of Law: Estimate`,
           `Government Effectiveness: Estimate`,
           `PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)`,
           `Proportion of seats held by women in national parliaments (%)`,
           `Access to clean fuels and technologies for cooking (% of population)`,
           `Labor force participation rate, total (% of total population ages 15-64) (modeled ILO estimate)`,
           `Unemployment, total (% of total labor force) (modeled ILO estimate)`,
           `School enrollment, primary (% gross)`,
           `Cause of death, by communicable diseases and maternal, prenatal and nutrition conditions (% of total)`,
           `Adjusted savings: natural resources depletion (% of GNI)`,
           `Prevalence of undernourishment (% of population)`,
           `Life expectancy at birth, total (years)`))%>%
  na.omit()%>%  # remove NAs
  ungroup()

Time for visual exploration¶

Next, we can take a peek at the data and check for things like skewness, which affect linear modeling

In []:

library(ggplot2)

check_skew<-gather(wb_select, Measure, Value, -Country)  # reshape the dataframe for plotting

# Create named list for facet lables on plot (shortened predictor names)
labels<-substr(unique(check_skew$Measure), 1, 20)
names(labels)<-unique(check_skew$Measure)

ggplot(check_skew, aes(Value))+ # make histograms
  geom_histogram(bins = 200)+
  facet_wrap(~Measure,strip.position = "bottom", labeller = labeller(Measure=labels))+
  coord_cartesian(ylim = c(0,50))+
  theme(strip.text=element_text(size=9))


rm(check_skew) #remove this dataframe so it doesnt clutter our environment

Out[]:

Next, it's time to prepare the data for regression¶

First, that means transforming the data¶

Some of these variables are skewed, so they should be transformed for regression to fit a normal distribution. The right skewed ones can be transformed with a natural log transformation. Others are left skewed, so they can be log-transformed after subtracting from a constant.

Here is the transformation:

In []:

wb_log<-wb_select%>%
  mutate_at(vars(Renewable.electricity.output, # these are the variable to natual log transform
                 Renewable.energy.consumption, 
                 Terrestrial.and.marine.protected.areas, 
                 CO2.emissions, 
                 PM2.5.air.pollution, 
                 Unemployment, 
                 Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions, 
                 natural.resources.depletion,
                 Prevalence.of.undernourishment), 
            ~log1p(1+.))%>% # natural log after adding 1 to make all data positive
  mutate_at(vars(Access.to.electricity, # these are the left-skewed variables
                 Access.to.clean.fuels.and.technologies.for.cooking), 
            ~log1p(101-.)) # left skewed variables can be log transformed after being subtracted from 101
#to turn it into a left skew

Next, we must scale the data¶

The data is scaled between 0 and 1 because regression performs better on data that is uniformly scaled:

In []:

wb_scaled<-wb_log%>%
  mutate_at(vars(Renewable.electricity.output,  # all variables, to be scaled
               Access.to.electricity, 
               Renewable.energy.consumption,
               Terrestrial.and.marine.protected.areas, 
               Agricultural.land,  
               Individuals.using.the.Internet,
               CO2.emissions, 
               GDP.growth,
               Political.Stability.and.Absence.of.Terrorism,
               Rule.of.Law,
               Government.Effectiveness,
               PM2.5.air.pollution,
               Proportion.of.seats.held.by.women.in.parliaments,
               Access.to.clean.fuels.and.technologies.for.cooking,
               Labor.force.participation.rate,
               Unemployment,
               School.enrollment,
               Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions,
               natural.resources.depletion,
               Prevalence.of.undernourishment,
               Life.expectancy.at.birth),
            ~(.-min(.))/(max(.)-min(.)))  # formula to rescale

Visualizing again...¶

to make sure the transformation worked as anticipated

In []:

check_scaling<-wb_scaled%>%
  select(-c(Access.to.clean.fuels.and.technologies.for.cooking, Access.to.electricity, Prevalence.of.undernourishment))%>%
  gather(Measure, Value, -Country) # re-shape for plotting
check_scaling$Measure<-as.factor(check_scaling$Measure) # change measure to factor class for ggplot

ggplot(check_scaling, aes(Value))+ # make histograms
  geom_histogram(bins = 100)+
  facet_wrap(~Measure,strip.position = "bottom", labeller = labeller(Measure=labels))+
  theme(strip.text=element_text(size=9))
rm(check_scaling) #remove this dataframe so it doesnt clutter our environment

Out[]:

Now the data looks way more normally distributed

Can we quickly discover some relationships in the data?¶

I looked at correlations between each variable because that can give a hint at what we will find with regression

In []:

options(repr.plot.width=7, repr.plot.height=6) #resize plot
wb_transformed<-read.csv("data/wb_transformed.csv", row.names = 1)
library(psych)
pairs.panels(wb_transformed[,2:10],
             method = "pearson",
             hist.col = "#00AFBB")

Out[]:

Here is another way to visualize correlations that I like to use:

In []:

library(corrplot)
corrplot(cor(wb_transformed[2:10]),order="hclust",tl.col="black",tl.cex=.9)

Out[]:

I wanted to peek directly at the relationships between life expectancy and the predictors¶

because it is the relationship I am interested in

In []:

# re-shape the data for plotting
wb_long<-gather(wb_transformed, Measure, Value, -Country, -Life.expectancy.at.birth)
wb_long$Measure<-as.factor(wb_long$Measure)

options(repr.plot.width=8) #resize plot
ggplot(wb_long, aes(x=Value,y=Life.expectancy.at.birth))+
  geom_point(size=1,color='forestgreen')+
  facet_wrap(~Measure,strip.position = "bottom", labeller = labeller(Measure=labels))+
  theme(strip.text=element_text(size=6))

Out[]:

Some of these variable clearly have a nonlinear relationship with life expectancy. Specifically, for CO2.emissions, Rule.of.law, Indiviuals.using.the.internet, and Government.effectiveness, their relationship with life expectancy follows a square root function.

Also, for Access.to.electricity, Renewable.energy.consumption, Access.to.clean.fuels.and.technologies.for.cooking, and Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions, they follow a negative square root relationship.

These relationships can be added into the linear model.

Finally, it's time to make the linear regression model!¶

I start by constructing a full multiple linear regression model with all the variables:

In []:

fullmodel<-lm(Life.expectancy.at.birth ~ 
                Renewable.electricity.output + 
                I(sqrt(-(Access.to.electricity)+1)) + # square root of the negative plus 1 (move to positive side of x axis)
                I(sqrt(-(Renewable.energy.consumption)+1)) + 
                Terrestrial.and.marine.protected.areas +
                Agricultural.land +
                I(sqrt(Individuals.using.the.Internet)) +
                I(sqrt(CO2.emissions)) +
                GDP.growth +
                Political.Stability.and.Absence.of.Terrorism +
                I(sqrt(Rule.of.Law)) +
                I(sqrt(Government.Effectiveness)) +
                PM2.5.air.pollution +
                Proportion.of.seats.held.by.women.in.parliaments +
                I(sqrt(-(Access.to.clean.fuels.and.technologies.for.cooking)+1)) +
                Labor.force.participation.rate +
                Unemployment +
                School.enrollment +
                I(sqrt(-(Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions)+1)) +
                natural.resources.depletion +
                Prevalence.of.undernourishment, data=wb_transformed)

Backward elimination¶

I used backward elimination to select the variables that are predictive of life expectancy.

Backward elimination uses the model selection strategy of minimizing AIC. Akaike Information Criteria is based in information theory. It estimates how much information is lost in a model and thereby gives us information about the quality of a model. AIC deals with the trade off between model complexity and generalizability (overfitting v underfitting).

During each iteration, backward elimination drops the variable that gives the highest AIC until there is no significant drop in AIC. The model that yields the lowest AIC is retained in 'model'.

In []:

model <- step(fullmodel,direction="backward",trace=10)

Next, we inspect the model closer:

In []:

library(jtools)
summ(model)$coeftable
print(paste("R-squared:", summary(model)$r.squared))
print(paste("F-statistic:", summary(model)$fstatistic[1], "on", summary(model)$fstatistic[2], "and", summary(model)$fstatistic[3], "DF, p-value:", pf(summary(model)$fstatistic[1],summary(model)$fstatistic[2],summary(model)$fstatistic[3],lower=F) ))

Out[]:

	Est.	S.E.	t val.	p
(Intercept)	0.04633471	0.07520928	0.6160771	5.389550e-01
Renewable.electricity.output	-0.07103790	0.02741859	-2.5908661	1.070224e-02
I(sqrt(-(Access.to.electricity) + 1))	0.19896412	0.08548893	2.3273671	2.154064e-02
Agricultural.land	-0.07051728	0.02995597	-2.3540308	2.011757e-02
I(sqrt(CO2.emissions))	-0.24298834	0.07327317	-3.3161982	1.192448e-03
I(sqrt(Government.Effectiveness))	0.42352928	0.07209363	5.8747113	3.543014e-08
PM2.5.air.pollution	-0.07073071	0.04340424	-1.6295806	1.056877e-01
Proportion.of.seats.held.by.women.in.parliaments	0.10665761	0.04406113	2.4206736	1.691768e-02
I(sqrt(-(Access.to.clean.fuels.and.technologies.for.cooking) + 1))	0.18494202	0.06069852	3.0468949	2.817212e-03
Unemployment	-0.10846097	0.03688891	-2.9402052	3.903386e-03
I(sqrt(-(Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions) + 1))	0.45358686	0.07595024	5.9721579	2.228745e-08
Prevalence.of.undernourishment	0.08774696	0.05196308	1.6886406	9.376173e-02

Out[]:

[1] "R-squared: 0.896961510504357"
[1] "F-statistic: 99.7130921012224 on 11 and 126 DF, p-value: 1.17835571290943e-56"

This output shows is that the model of life expectancy as a function of the above variables was significant (F(11,126) = 99.71, p < 0.01). In addition, R-squared tells us that 89% of variance in life expectancy is accounted for by the variables in this model. This is quite high considering the messy/complicated system of country-wide social and economic factors. This seems to be a good model.

FYI I ran a similar model without square root relationship accounted for, and including the square roots improved the model (F(10,127) = 79.1, p < 0.01, R-squared for 0.86).

This isn't the end of the story¶

We need to make sure that the assumptions of linear models are satisfied in order to be sure that the model is meaningful.

One assumption to inspect is absence of collinearity¶

If predictors are highly correlated (collinearity) the statistical significance of the model is undermined. We can look at this by looking at the variance inflation factor (VIF) of each predictor. VIF measures how easily the predictor is predicted from a linear regression using the other predictors. Any predictors with a VIF above 4 should be removed from the model.

In []:

library(car)
vifs<-dplyr::bind_cols(as.data.frame(names(vif(model))), as.data.frame(unname(vif(model))));vifs

Out[]:

names(vif(model))	unname(vif(model))
Renewable.electricity.output	1.461414
I(sqrt(-(Access.to.electricity) + 1))	12.897719
Agricultural.land	1.134781
I(sqrt(CO2.emissions))	6.435395
I(sqrt(Government.Effectiveness))	3.510702
PM2.5.air.pollution	1.803665
Proportion.of.seats.held.by.women.in.parliaments	1.332606
I(sqrt(-(Access.to.clean.fuels.and.technologies.for.cooking) + 1))	8.326179
Unemployment	1.250752
I(sqrt(-(Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions) + 1))	7.956509
Prevalence.of.undernourishment	4.153037

Since several of these variables have a large VIF, some should be removed.

While there is no rule in terms of what is a good VIF threshold, the following code removes the variable with the largest VIF and re-builds the model until no variables have VIF above 10.

In []:

while(any(vifs$`unname(vif(model))`>10)){  #while any VIFS are greater than 10eac
  vifs<-filter(vifs,  `unname(vif(model))`!=max(`unname(vif(model))`))  #take out the variable with the largest VIF
  formula<-paste("Life.expectancy.at.birth ~", paste(vifs$`names(vif(model))`, collapse = ' + '), sep=' ')  #create a new formula without that variables
  newmodel<-lm(formula, data = wb_transformed) #create a new model
  vifs=bind_cols(as.data.frame(names(vif(model))), as.data.frame(unname(vif(model))))  #create a new dataframe with the VIFs from the new model
}
finalmodel<-step(newmodel,direction="backward",trace=10)

This new model satisfies the assumption of no collinearity.

Another assumption of that of homoskedasticity¶

This means that the residuals of the model have a similar amount of deviation from predicted values. To assess this assumption, we can look at a residual plot. This plot should look like a blob; if it doesn't, for instance if higher values have higher residuals, the variance would not be homoskedastic.

In []:

resid <- bind_cols(as.data.frame(fitted(finalmodel)), as.data.frame(residuals(finalmodel))) 

p1 <- ggplot(resid,aes(x= fitted(finalmodel), y=residuals(finalmodel))) + geom_point() + labs(title='Residual Plot', x='Fitted Values', y='Residuals');p1
# with this plot, we are look for absence of a pattern

Out[]:

Last, there is the assumption of the normality of residuals¶

The residuals should follow a normal distribution. If they don't, this indicates that the model is not fit optimally.

In []:

p2 <- ggplot(resid, aes(residuals(finalmodel))) + geom_histogram(bins=100) + labs(title = 'Histogram of Residuals')
# with the histogram, we're looking for a normal distribution
p3 <- ggplot(resid, aes(sample=residuals(finalmodel))) + geom_qq() + geom_qq_line() + labs(title = "Normal Q-Q Plot of Residuals") 
# with the qqnorm plot, we are looking for a straight line

gridExtra::grid.arrange(p3, p2, nrow=1)

Out[]:

Summary of the final model and plot:¶

In []:

summ(finalmodel)$coeftable
print(paste("R-squared:", summary(finalmodel)$r.squared))
print(paste("F-statistic:", summary(finalmodel)$fstatistic[1], "on", summary(finalmodel)$fstatistic[2], "and", summary(finalmodel)$fstatistic[3], "DF, p-value:", pf(summary(finalmodel)$fstatistic[1],summary(finalmodel)$fstatistic[2],summary(finalmodel)$fstatistic[3],lower=F) ))

Out[]:

	Est.	S.E.	t val.	p
(Intercept)	0.09308954	0.03872990	2.403558	1.765881e-02
Renewable.electricity.output	-0.05392426	0.02685338	-2.008099	4.672125e-02
Agricultural.land	-0.07850695	0.03038741	-2.583536	1.089329e-02
I(sqrt(CO2.emissions))	-0.21989181	0.07364001	-2.986037	3.382733e-03
I(sqrt(Government.Effectiveness))	0.40667231	0.06612301	6.150239	9.005362e-09
Proportion.of.seats.held.by.women.in.parliaments	0.10579308	0.04449862	2.377446	1.890156e-02
I(sqrt(-(Access.to.clean.fuels.and.technologies.for.cooking) + 1))	0.23561031	0.05516212	4.271233	3.741580e-05
Unemployment	-0.11075051	0.03755025	-2.949395	3.781645e-03
I(sqrt(-(Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions) + 1))	0.55441091	0.05238094	10.584210	3.150733e-19

Out[]:

[1] "R-squared: 0.890197329125791"
[1] "F-statistic: 130.729351279605 on 8 and 129 DF, p-value: 4.57785820176136e-58"

The final model significantly predicts life expectancy (F(8,120) = 130.7, p < 0.01) and accounts for 89% of variance in life expectancy.

This output also tells at that there are main effects on life expectancy of renewable electricity output (beta=-0.05, p=0.05), agricultural land (beta=0.08, p=0.02), CO2 emissions (beta=-0.22, p=<0.01), government effectiveness score (beta=0.41, p<0.01), proportion of seats held by women (beta=0.011, p=0.02), unemployment rate (beta=-0.011, p<0.01), and cause of death by communicabel diseases, and maternal and prenatal nutrition conditions (beta=0.55, p<0.01).

Some of these coefficients relate the predictor to life expectancy in a linear manner, while others modify the relationship between life expectancy and the square root of the predictor or the square root of the negative of the predictor.

Plotting the model predictors and outcome¶

In []:

library(plotly)
# store outcome
wb_transformed<-read.csv("../../../r_projects/world-bank-sustainability-themes/data/wb_transformed.csv",row.names = 1)
wb_outcome<-read.csv("../../../r_projects/world-bank-sustainability-themes/data/wb_tidy.csv", row.names = 1)%>%  
  select(Country, Life.expectancy.at.birth)

Preparing the data for plotting:

In []:

# merge dataframe with scaled predictors and dataframe with outcome in original scaling, 
# and select significant predictors
for_plot<-left_join(wb_transformed, wb_outcome, by='Country')%>%
  select(Country, 
         Renewable.electricity.output, 
         Agricultural.land, 
         CO2.emissions, 
         Government.Effectiveness, 
         Proportion.of.seats.held.by.women.in.parliaments, 
         Access.to.clean.fuels.and.technologies.for.cooking,
         Unemployment, 
         Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions,
         Life.expectancy.at.birth.y)

# Pretty column names for graph
colnames(for_plot) <- c('Country',
                        'Renewable energy consumption (% of total)',
                        'Agricultural land (% of land area)', 
                        'CO2 emissions (metric tons per capita)',
                        'Government Effectiveness: Estimate',
                        'Seats held by women in national parliaments (%)',
                        'Access to clean fuels and technologies for cooking (%)',
                        'Unemployment (% of total labor force)',
                        'CoD by communicable diseases, maternal, and prenatal cond (%)',
                        'Life expectancy at birth, total (years)')

# Re-shape into long form for ggplot2
for_plot<-gather(for_plot, Measure, Value, -Country, -`Life expectancy at birth, total (years)`)
for_plot$Measure<-as.factor(for_plot$Measure)

Using plotly to render the ggplot2 graph as an interactive object!

In []:

p <-ggplotly(ggplot(for_plot, # ggplot2 aes assignment: 
                    aes(x = Value,  
                        y = `Life expectancy at birth, total (years)`, 
                        group = Measure,    # text assignment for hovering pointer for plotly:
                        text = paste("Country:",Country,'<br>', Measure, ':', Value, '<br>Life Expectancy:', `Life expectancy at birth, total (years)`, ' years')))+
               geom_point(aes(color=Measure))+ # adding the points and lines
               geom_smooth(aes(color=Measure), method=lm,size=.7,se=F)+
               labs(title= "Normalized predictors of life expectancy", x="Normalized Measure Value", y="Life expectancy at birth, total (years)")+
               theme_minimal(), tooltip = "text")%>%  # tooltip makes sure the "text" above appears on hover
  layout(legend = list(font=list(size=10)))
p$x$layout$legend$x=1
p$x$layout$legend$y=-.2
p$x$layout$legend$orientation='h'
p$x$layout$height=700
p

Out[]:

It is important to keep in mind that several transformations have been done to this data before running the linear model. Looking back at the correlations in the original data can therefore be informative:

In []:

for_cor<-read.csv("../../../r_projects/world-bank-sustainability-themes/data/wb_tidy.csv", row.names = 1)%>%  #re-import tidy
  select(Country,                # select columns that were selected through linear modelling
         Renewable.electricity.output, 
         Agricultural.land, 
         CO2.emissions, 
         Government.Effectiveness, 
         Proportion.of.seats.held.by.women.in.parliaments, 
         Access.to.clean.fuels.and.technologies.for.cooking,
         Unemployment, 
         Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions,
         Life.expectancy.at.birth)

# assign correlation of each predictor and life expectancy to a named list
corlist<- list()
for (i in 2:length(finalmodel$coefficients)){
  var<-colnames(for_cor)[i]
  col<-select(for_cor, var)
  corlist[i-1]<-cor(col, for_cor$Life.expectancy.at.birth)
  names(corlist)[i-1]<-var
}

# print dataframe of correlations and coefficients
bind_cols(gather(as.data.frame(corlist), variable, corr), as.data.frame(finalmodel$coefficients[2:9]))

Out[]:

variable	corr	finalmodel$coefficients[2:9]
Renewable.electricity.output	-0.1556605	-0.05392426
Agricultural.land	-0.2421077	-0.07850695
CO2.emissions	0.5193392	-0.21989181
Government.Effectiveness	0.7877637	0.40667231
Proportion.of.seats.held.by.women.in.parliaments	0.2432809	0.10579308
Access.to.clean.fuels.and.technologies.for.cooking	0.8263873	0.23561031
Unemployment	0.0925241	-0.11075051
Cause.of.death.communicable.diseases.and.maternal.prenatal.and.nutrition.conditions	-0.9009446	0.55441091

Predicting life expectancy from World Bank data

Project goals & Dataset¶

Summary of findings¶

Full project¶

Import and check the data¶

First step, as always: clean the data¶

To do:¶

To get a sense of all the variables in this data set...¶

I don't want to include all these variables in my model because many of them are filled in sparcely (many NAs)¶

Time for visual exploration¶

Next, it's time to prepare the data for regression¶

First, that means transforming the data¶

Next, we must scale the data¶

Visualizing again...¶

Can we quickly discover some relationships in the data?¶

I wanted to peek directly at the relationships between life expectancy and the predictors¶

Finally, it's time to make the linear regression model!¶

Backward elimination¶

This isn't the end of the story¶

One assumption to inspect is absence of collinearity¶

Another assumption of that of homoskedasticity¶

Last, there is the assumption of the normality of residuals¶

Summary of the final model and plot:¶

Plotting the model predictors and outcome¶

Return to the top