Midterm Project

Author
Affiliation

Ben Akyureklier

George Washington University

Published

October 20, 2025

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
from sklearn.datasets import load_iris
import plotly.express as px
import plotly.io as pio
pio.renderers.default='plotly_mimetype+notebook_connected'
import warnings
warnings.filterwarnings('ignore')

Dataset:

Source: World Bank (World Development Indicators Database)

Timeframe: 2005-2024

Reasearch Questions:

How do different factors of population (size, growth, life expectancy, migration) relate to GDP per capita?

Can we learn anything about wealth inequality by looking at any metrics of population or GDP?

Are there any trends between countries or time periods? Are there any deviations from ususal patterns?

Code
wb=pd.read_csv('data/wbMT.csv')
wb.columns.values[-20:]=wb.columns[-20:].str[:-9]
wb=wb.dropna()
wb2=pd.read_csv("Data/WBnew.csv")
wb2.columns.values[-20:]=wb2.columns[-20:].str[:-9]
wb2=wb2.dropna()
cc=wb['Country Code'].unique()
wb2=wb2[wb2['Country Code'].isin(cc)]

def mp(x):
    years=[]
    for i in range (2005, 2025):
        x[str(i)]=pd.to_numeric(x[str(i)], errors='coerce')
        years.append(str(i))
    x=pd.melt(x, id_vars=['Country Name','Series Name'], value_vars=years, var_name='Year', value_name='Value')
    x=x.pivot(index=['Country Name', 'Year'], columns='Series Name', values='Value').reset_index()
    return x

wb=mp(wb)
wb2=mp(wb2)
dt=pd.merge(wb, wb2, on=['Country Name', 'Year'])
dt
Series Name Country Name Year GNI per capita growth (annual %) Gini index Population growth (annual %) Population, total Unemployment, total (% of total labor force) (national estimate) GDP per capita (current US$) Hospital beds (per 1,000 people) Income share held by highest 10% Life expectancy at birth, total (years) Net migration Real interest rate (%) Researchers in R&D (per million people) Secure Internet servers (per 1 million people)
0 Argentina 2005 6.447488 47.8 1.027458 39216789.0 11.506 5067.653423 4.00 35.0 75.231000 -22068.0 NaN 816.841187 NaN
1 Argentina 2006 17.601678 46.4 1.028248 39622115.0 10.078 5869.380290 NaN 33.8 75.279000 -20109.0 NaN 888.891663 NaN
2 Argentina 2007 9.531659 46.3 0.991102 40016763.0 8.470 7185.251551 NaN 33.7 74.783000 -18099.0 NaN 971.290100 NaN
3 Argentina 2008 5.235697 45.0 1.012889 40424148.0 7.837 8944.110266 NaN 32.5 75.428000 -12517.0 NaN 1032.674805 NaN
4 Argentina 2009 -8.149816 43.8 1.059775 40854831.0 8.645 8150.235270 NaN 31.3 75.577000 -7882.0 NaN 1031.641357 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
235 United States 2020 -2.866961 40.0 0.408428 331577720.0 8.055 64401.507435 2.74 29.3 76.980488 329769.0 2.186282 4464.094727 140775.773475
236 United States 2021 5.528107 39.7 0.157317 332099760.0 5.349 71307.401728 NaN 29.7 76.329268 674787.0 -1.258522 4825.180176 156949.432905
237 United States 2022 1.852199 41.7 0.575745 334017321.0 3.650 77860.911291 NaN 30.4 77.434146 1319009.0 NaN NaN 180213.052484
238 United States 2023 1.821216 41.8 0.831493 336806231.0 3.638 82304.620427 NaN 30.4 78.385366 1322668.0 NaN NaN 186692.908897
239 United States 2024 1.633877 NaN 0.976422 340110988.0 4.022 85809.900385 NaN NaN NaN 1286132.0 NaN NaN 196554.134852

240 rows × 15 columns

First, lets take a look at the make-up of the dataset. We will want to see the years and countries that are included, and a visaulization of GDP per Capita.

Code
plt.figure(figsize=(12, 6))
sns.lineplot(data=dt, x='Year', y='GDP per capita (current US$)', hue='Country Name', palette='tab10')
plt.title('GDP per Capita (2005-2024)')
plt.xlabel('Year')
plt.ylabel('GDP per Capita (current US$)')
plt.tight_layout()
plt.legend(title='Country', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

Code
dy=dt.groupby('Country Name')['GDP per capita (current US$)'].mean()
filter=dy[dy>dy.mean()]
dt['Wealthy']=dt['Country Name'].isin(filter.index)
wdy=dt.groupby('Wealthy')['GDP per capita (current US$)'].mean()

plt.figure(figsize=(12, 6))
p=sns.lineplot(data=dt, x='Year', y='GDP per capita (current US$)', hue='Country Name', palette='tab10')
p.axhline(y=wdy[0], color='black', linestyle='--', linewidth=1.1)
p.axhline(y=wdy[1], color='black', linestyle='--', linewidth=1.1)
p.text(x='2005', y=wdy[0]+7000, s='Average GDP per capita (Poorer) $'+str(round(wdy[0], 2)), color='black', va='top')
p.text(x='2005', y=wdy[1]+7000, s='Average GDP per capita (Wealthy) $'+str(round(wdy[1], 2)), color='black', va='top')
plt.title('GDP per Capita (2005-2024)')
plt.xlabel('Year')
plt.ylabel('GDP per Capita (current US$)')
plt.tight_layout()
plt.legend(title='Country', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

There are clearly two groups of countries in this dataset, with the dividing factor being economies. We are going to want to seperate the strong and weak economies so we can take a closer look at each group. Hopefully this will allow is to see more details and organize the data for comparisons.

Code
dy=dt.groupby('Country Name')['GDP per capita (current US$)'].mean()
filter=dy[dy>dy.mean()]
dt['Wealthy']=dt['Country Name'].isin(filter.index)

filtered_dt=dt[dt['Wealthy']==True]

plt.figure(figsize=(12, 6))
sns.lineplot(data=filtered_dt, x='Year', y='GDP per capita (current US$)', hue='Country Name', palette='tab10')
plt.title('GDP per Capita of Wealthy Nations (2004-2025)')
plt.xlabel('Year')
plt.ylabel('GDP per Capita (current US$)')
plt.legend(title='Country', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

One part that stands out is the United States growth following 2013, as they had a GDP per Capita $10,000 higher than all other countries every year after. Looking at the other 4 countries, the dips and rises seem to follow a pattern; this gives us a key insight that France, Germany, the U.K, and Canada will have similar variance throughout the years as they seem to have equal impact from world events.

Code
wc=dt[dt["Country Name"].isin(["Germany", "Canada", "United Kingdom"])]
pc=dt[dt["Country Name"].isin(["China", "Mexico", "Brazil"])]
i=0
fig, axes=plt.subplots(1, 3, figsize=(15, 5), sharey=True)
for c in wc['Country Name'].unique():
    cur=wc[wc['Country Name']==c]
    axes[i].plot(cur['Year'], cur['GDP per capita (current US$)'], marker='o')
    axes[i].set_xticks(cur['Year'][::3])
    axes[i].set_title(c)
    axes[i].set_xlabel("Year")
    axes[0].set_ylabel("GDP per Capita (current US$)")
    i+=1
fig.suptitle('GDP per Capita (Wealthy nations 2005-2024)')
plt.tight_layout()
plt.show()
i=0
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)
for c in pc['Country Name'].unique():
    cur=pc[pc['Country Name']==c]
    axes[i].plot(cur['Year'], cur['GDP per capita (current US$)'], marker='o')
    axes[i].set_xticks(cur['Year'][::3])
    axes[i].set_title(c)
    axes[i].set_xlabel("Year")
    axes[0].set_ylabel("GDP per Capita (current US$)")
    i+=1
fig.suptitle('GDP per Capita (Poorer nations 2005-2024)')
plt.tight_layout()
plt.show()

For this visualization I chose countries that shared a similar range of values in order to increase readability. This graph shows us the trends that each economy had over the last 20 years. China is the only country that was able to keep consistent growth, which may be due to size of their economy because the United States also saw stable growth. It seems the nations that are not major powers have a lot more volatility throughout the years, and are impacted more by world events like the 2008 financial crash. Next, lets look at how GDP per capita relates to another measurement in the dataset.

Code
fig=px.scatter(dt, x='Net migration', y='GDP per capita (current US$)', trendline='ols', hover_data=['Country Name', 'Year'], 
                  title='Net Migration vs GDP per Capita (2005-2024)')
fig.update_layout(xaxis_title='Net Migration', yaxis_title='GDP per Capita (current US$)', width=950, height=600)
fig

From this visualization we can see a trend of people migrating from poorer nations to wealthier ones in the dataset. Although we do not have many countries in the sample, it is reasonable to state that migration is positively correlated with GDP per capita; this is understandable as people often migrate for economic reasons. For the next comparison, we will start by analyzing the distribution for life expectancy in our dataset, and then life expectancy with GDP per capita.

Code
dl=dt.sort_values('Life expectancy at birth, total (years)')
plt.figure(figsize=(15, 7))
sns.boxplot(x='Country Name', y='Life expectancy at birth, total (years)', data=dl)
plt.title('Life Expectancy Distribution (2005-2024)')
plt.show()

It looks as though the distribution is fairly uniform for all the countries asides from South Africa. In order to get a clean comparison of life expectancy and GDP per capita, it would be best to drop South Africa from the final chart as the results may be skewed and uneven if we keep it.

Code
dl=dl[dl["Country Name"]!="South Africa"]
fig=px.scatter(dl, x='Life expectancy at birth, total (years)', y='GDP per capita (current US$)', trendline='lowess', hover_data=['Country Name', 'Year'], 
                  title='Life Expectancy vs GDP per Capita (2005-2024)')
fig.update_layout(xaxis_title='Life expectancy at birth, total (years)', yaxis_title='GDP per Capita (current US$)', width=950, height=600)
fig

We can once again see our two groups of countries seperated by GDP per capita. The obvious takeaway from this chart is countries with higher life expectancies are likely to have a high GDP per capita. Another insight is the high variance in poorer countries, with some having 10 years less life expectancy than other countries with similar GDP per capita. Meanwhile, the wealthy countries all have a life expectancy around 80 years, with the only deviation being the United States. Maybe looking at income inequality can tell us more about the different relationships of life expectancy and wealth.

Code
dlf=dl[~dl['Year'].isin(['2024', '2023', '2022', '2021', '2020'])]
plt.figure(figsize=(12, 6))
sns.scatterplot(data=dlf, x='Life expectancy at birth, total (years)', y='Gini index', hue='Wealthy', palette='tab10')
sns.regplot(data=dlf, x='Life expectancy at birth, total (years)', y='Gini index', color='grey', scatter=False, ci=None)
plt.title('Life Expectancy vs Gini Index (2005-2019)')
plt.xlabel('Life expectancy at birth, total (years)')
plt.ylabel('Gini Index (%)')
plt.legend(title='Above average GDP', loc='upper right')
plt.show()

For context, the Gini Index measures income inequality, with higher values indicating more disparity. The 5 countries that have an above average GDP per capita all have relatively low Gini scores, with the United States being the likely outlier. Looking at all the other countries, we see a lot of variance in Gini scores compared to life expectance. It seems as though Russia and India, the countries with the lowest life expectancy in the dataset, have lower Gini scores than expected, while most of the other non-wealthy nations have relatively high income inequality. While the correlation between income inequality and life expectancy are ambiguous for countries with low GDPs, the wealthy nations follow a trend of negative correlation between income inequality and life expectancy.

Findings:

  • The United Kingdom, United States, France, Germany, and Canada have drastically higher GDP per capita than the other countries in the dataset.
    • The United States has had much more GDP growth over the last 10 years.
  • The largest economies are able to have stable GDP growth during global market fluctuations.
  • On average, people migrate from low to high GDP per capita countries.
  • South Africa has seen lots of change in life expectancy over the last 20 years and very low totals overall.
  • Life expectancy is varied in poorer nations, but consistantly high in countries with high GDP per capita.
  • Wealthy nations with less income inequality are likely to have higher life expectancies.

Reflection:

For this project, I wanted to find some relationships between some measurements of important aspects of a nation such as wealth, health, and equality. Although the sample size is not very large, by looking at various countries we can see trends emerging from the visualizations. I decided to revise my first model of GDP per capita to include averages of the two groups I focused on throughout the research. I think adding that extra layer of information is great for setting up the rest of my work because it is emphasizing the divide in the dataset. The small multiples chart was adequate in showing the variation within the two groups of wealth, however, I do think the overall layout is a bit messy and can be hard for readers to decipher. Nonetheless, it does the job it is meant to so I decided to keep it in the project. The last point I’d like to touch on are the trendlines in the data. I will be using trendlines more often in my visaulizations as I believe it does a great job of helping viewers classify the information and get a general sense of the trends formed by each point. Overall, the data wrangling for this project was very managable once I was able to get everything setup and I hope the visualizations do a good job at representing the dataset and global trends.