Exploration and Visualization of the 2023 World Population Dataset

Exploration and Visualization of the 2023 World Population Dataset

Visualization and Data Exploration using Python

In this article, we'll explore and visualize the 2023 world population as part of the requirement for AI Saturdays Lagos Cohort 8 assignment on Data Visualization and Exploration lecture and lab

About the Dataset

The dataset World Population by Country 2023 containing information about Countries in the world by population 2023 is available on Kaggle.

This list includes both countries and dependent territories. Data based on the latest United Nations Population Division estimates.

  • Country - Name of countries and dependent territories.

  • Population2023 - Population in the year 2023

  • YearlyChange - Percentage Yearly Change in Population

  • NetChange - Net Change in Population

  • Density(P/Km²)- Population density (population per square km)

  • Land Area(Km²) - Land area of countries / dependent territories.

  • Migrants(net) - Total number of migrants

  • Fert.Rate - Fertility rate

  • Med.Age - Median age of the population

  • UrbanPop%- Percentage of urban population

  • WorldShare - Population share

The visualization and data exploration will be done using this dataset.

Economic growth and development are affected by population positively or negatively hence the need to understand the essence of using this dataset of the world population by country as an example.

Tools

The following are the tools and libraries used in this data exploration analysis and visualization.

For this article, I'll be using Google Colab but any Integrated development environment or code editor can be used such as Pycharm, Anaconda, and Vscode.

Data Exploration

  • Dataset overview

Here's the outlook of Colab environment

Let's start by importing the modules needed for the data exploration

import pandas as pd                   # for data manipulation and cleaning

import matplotlib.pyplot as plt         # for data visulization
import seaborn as sns                   # for data visulization
import plotly.express as px             # for data visulization

Click the play button by the left top to execute the block of code. If it shows the tick sign it means the libraries have been imported and are ready for use.

Every bit of code to be executed should be done in a new code block.

I added the downloaded Kaggle dataset of the world population of 2023 to the Google Drive folder. This could be any location depending on your choice of tools.

Next, import the drive from colab, and read the file using the pandas module.

# TODO: Read WorldPopulation2023.csv dataset into a well named dataframe

from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('drive/MyDrive/WorldPopulation2023.csv')

After clicking on the execute button it'll request to connect and grant access permission to the Google Drive folder.

if successfully mounted it'll show this message Mounted at /content/drive

Next, let's show the first 5 records using the head() method in pandas module.

df.head()

#df.head(10) for the first 10 records
#df.tail() for the last records on the file

From the display of the sample records it is important to do data cleaning to ensure smooth visualization.

  • To view the columns in the dataframe
df.columns

  • To view the size of the dataset

      df.size
    

  • To view the shape of the dataset
#Check the shape of the data
df.shape

  • To view more features of the dataset
df.info(verbose = False)

  • To view the data types
# Data types for each column in dataframe
df.dtypes

By observing the information from the df.head() you'll notice that there is % sign attached to some values, N.A. value in a column, and values with float number represented as int or object in the datatype.

Exploratory Data Analysis (EDA)

The following steps are taken during exploratory data analysis to understand the dataset properly.

  • To describe the dataset
# Dataframe description
df.describe()

This contains the count, mean, standard deviation, minimum and maximum number, and percentage classification.

You can use the two icons on the top right to display more information even with visualization.

  • To show the info
# Dataframe information
df.info()

Data Preprocessing and Cleaning

Data preprocessing and cleaning is the stage of reviewing the accuracy of the data, cleaning up unwanted values and modifying column data types to ensure proper data exploration and visualization.

  • To check for columns with null values
print("Number of missing values")
df.isnull().sum()

From the image above, It means that Fert.Rate and MedianAge have missing values.

  • To check if there are duplicate records
# Check for duplicates
duplicates = df.duplicated()
duplicates.sum()

There are no duplicate records.

Duplicate records can be removed|dropped using the code below during data cleaning.

# Drop duplicates from apps_with_duplicates
apps = df.drop_duplicates()
  • To keep a copy of the original dataset use the code below
# save a copy of the data
df_data_copy = df.copy()

To have a clean code we'll define a wrangle_data function for the preprocessing and cleaning of data.

The wrangle_data function does the following

  • checks for % across the detected column and replaces it

  • remove the N.A. value in the UrbanPop% column

  • change the datatype of columns according to the values.

  • replace empty values with 0.0

def wrange_data(data):

  #replace the % with an empty space to allow for easy calculation
  data['YearlyChange'] = data['YearlyChange'].str.replace('[%]', '', regex=True)
  data['WorldShare'] = data['WorldShare'].str.replace('[%]', '', regex=True)
  data['UrbanPop%'] = data['UrbanPop%'].str.replace('[%]', '', regex=True)

  #find and replace the N.A. string
  data['UrbanPop%'] = data['UrbanPop%'].str.replace('N.A.', '0', regex=True)

  # Convert YearlyChange to float data type from character
  data['YearlyChange'] = data['YearlyChange'].astype(float)

  # Convert UrbanPop% to integer data type from character
  data['UrbanPop%'] = data['UrbanPop%'].astype(int)

  # Convert WorldShare  to float data type from character
  data['WorldShare'] = data['WorldShare'].astype(float)

#clean NaN values from the data
# names of the columns
  columns = data.columns

# looping through the columns to fill the entries with NaN values with ""
  for column in columns:
      data[column] = data[column].fillna(0.0)

  return data

Next, execute the wrangle data function by passing the dataframe as a parameter and using the pandas head method to show the cleaned data.

#execute function block
df_data = wrange_data(df)

df_data.head()

You can see that the N.A., % value and datatype have been taken care of...

# Checking dtypes of the apps dataframe to confirm if the data type is in order
df_data.dtypes

Data Visualization

Top 10 Most Populated Countries

most_popular_countries = df_data.sort_values('Population2023', ascending=False)
most_popular_countries = most_popular_countries[["Country", "Population2023"]]
print(most_popular_countries.head(10))

Least 10 Populated Countries

least_popular_countries = df_data.sort_values('Population2023', ascending=False)
least_popular_countries = least_popular_countries[["Country", "Population2023"]]
print(least_popular_countries.tail(10))

Correlation Analysis

Using heatmap to visualize the correlation between the numeric features in the dataset

#Using heatmap
# df_data.corr(numeric_only=True)

plt.figure(figsize=(10,8))
plt.title("World population Heatmap");
dataplot = sns.heatmap(df_data.corr(), cmap="YlGnBu", annot=True)

Relationships Between Features

  • Migrants vs. Population Density
plt.figure(figsize=(10,8))

# displaying heatmap
# - Migrants vs. Population Density
plt.title("Relationship between Migrants and Density");
print(df_data['Migrants(net)'].corr(df_data['Density(P/Km²)']))

  • Relationship between MedianAge and Yearly Change
#Relationship between MedianAge and Yearly Change
print(df_data['MedianAge'].corr(df_data['YearlyChange']))

  • Relationship between Fert.Rate and Yearly Change
#Relationship between Fert.Rate and Yearly Change
print(df_data['Fert.Rate'].corr(df_data['YearlyChange']))

Additional Insights

  • Bar chart representation of 10 countries and their counts
population_in_countries = df_data.Country.value_counts().head(10)
plt.figure(figsize=(15, 6)) #, barh

# population_in_countries
population_in_countries.head(10).plot(x = population_in_countries.index, y = population_in_countries.values, kind='bar');

  • Using seaborn to display the data
#using seaborn
fig, ax = plt.subplots()
fig.set_size_inches(15, 6)

ax = sns.barplot(x = population_in_countries.index, y = population_in_countries.values)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);

Seaborn is more visually presentable than plotly

Conclusion

In this article, we have covered data exploration and visualization using the world population 2023 dataset from Kaggle. Data can be analyzed and visualized using Python, the modules and other tools. Storytelling on data can be done using Python.

The learning experience and the article development have been amazing and will be a continuous process.

Article Resources

Acknowledgments

AI Saturdays Lagos Cohort 8 organization team has made my artificial intelligence an awesome experience. Great tutoring sessions in the Data Visualization and Exploration class | lab by Aseda Addai-Deseh and Oluwaseun Ajayi.

My motivation for this assignment is how much impact Data science has in aiding decision-making and visualization of datasets and due to my passion for Artificial Intelligence, Machine learning and Data science.

X: Alemsbaja | Youtube: Tech with Alemsbaja to stay updated on more articles

Find this helpful or resourceful?? kindly share and feel free to use the comment section for questions, answers, and contributions.

Did you find this article valuable?

Support Alemoh Rapheal Baja by becoming a sponsor. Any amount is appreciated!