Global Alcohol Consumption: Predictive Modeling & Analytics¶

By Mohammad Sayem Chowdhury
Last Updated: June 13, 2025

Project Overview¶

This advanced analytics project develops sophisticated predictive models for global alcohol consumption patterns using comprehensive cross-cultural data. Through machine learning techniques and statistical modeling, I create systems that can accurately predict total alcohol consumption based on beverage-specific consumption patterns across 193 countries.

Research Objectives¶

  • Predictive Modeling: Develop high-accuracy models for total alcohol consumption prediction
  • Feature Analysis: Identify the most predictive beverage types and consumption patterns
  • Cultural Insights: Understand how different beverage preferences influence overall consumption
  • Policy Applications: Provide data-driven insights for public health and policy decisions

Technical Approach¶

My methodology incorporates multiple regression techniques, feature engineering strategies, and cross-validation frameworks to build robust models suitable for both research applications and practical policy implementation.


Author: Mohammad Sayem Chowdhury
Project Type: Predictive Analytics & Machine Learning
Domain: Global Health & Social Analytics
Scope: 193 Countries - Cross-Cultural Analysis

Table of Contents¶

  1. Environment Setup & Data Preparation

    • Library configuration and imports
    • Global alcohol consumption dataset loading
    • Data quality assessment and preprocessing
  2. Exploratory Data Analysis for Modeling

    • Feature distribution analysis
    • Correlation pattern identification
    • Target variable examination
  3. Feature Engineering & Selection

    • Predictive feature creation
    • Correlation analysis for feature selection
    • Dimensionality reduction techniques
  4. Model Development & Training

    • Multiple regression approaches
    • Polynomial feature engineering
    • Regularization technique implementation
  5. Model Evaluation & Validation

    • Performance metrics analysis
    • Cross-validation procedures
    • Model comparison and selection
  6. Business Applications & Insights

    • Predictive model interpretation
    • Policy implications and recommendations
    • Future research directions

Executive Summary¶

Predictive Modeling Challenge¶

Primary Goal: Accurately predict total alcohol consumption (in liters of pure alcohol) based on beer, wine, and spirit servings across diverse global populations.

Expected Outcomes¶

This analysis will deliver:

  • High-Accuracy Models: R² ≥ 0.85 for total consumption prediction
  • Feature Insights: Clear understanding of which beverage types drive overall consumption
  • Cultural Patterns: Regional consumption behavior modeling
  • Policy Applications: Evidence-based recommendations for alcohol regulation strategies

Strategic Value: Enabling evidence-based public health decisions and cross-cultural understanding of alcohol consumption behaviors.

In [24]:
# import piplite
# await piplite.install(['pandas'])
# await piplite.install(['matplotlib'])
# await piplite.install(['scipy'])
# await piplite.install(['seaborn'])
# await piplite.install(['ipywidgets'])
# await piplite.install(['tqdm'])
In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

This function will download the dataset into your browser

In [26]:
#This function will download the dataset into your browser 

# from pyodide.http import pyfetch

# async def download(url, filename):
#     response = await pyfetch(url)
#     if response.status == 200:
#         with open(filename, "wb") as f:
#             f.write(await response.bytes())

Step 1: Importing the Data

Let's load the dataset and take a first look at the data.

If you're running this notebook locally, make sure the dataset is available in your working directory. Otherwise, you may need to download it.

In [27]:
# path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/module_5_auto.csv'
In [28]:
# await download(path, "auto.csv")
path="drinks.csv"

Let's load the CSV file into a pandas DataFrame:

In [29]:
df= pd.read_csv(path)

I'll use the head() method to preview the first few rows of the dataset:

In [30]:
df.head()
Out[30]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

Step 2: Data Exploration

Let's check the data types of each column to understand the structure of our dataset:

In [31]:
df.dtypes
Out[31]:
country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

Question 2 Use the method groupby to get the number of wine servings per continent: Now, I'll group the data by continent to see the total wine servings for each region:

In [32]:
df_wine_group = df[['continent','wine_servings']]
wine_group = df_wine_group.groupby(['continent'],as_index=False).sum()
wine_group
Out[32]:
continent wine_servings
0 Africa 862
1 Asia 399
2 Europe 6400
3 North America 564
4 Oceania 570
5 South America 749

Question 3: Next, I'll perform a statistical summary of beer servings by continent to see how drinking habits vary across the globe:

In [33]:
df_beer_group = df[['continent','beer_servings']]
beer_group = df_beer_group.groupby(['continent'],as_index=False).describe()
beer_group
Out[33]:
beer_servings
count mean std min 25% 50% 75% max
0 53.0 61.471698 80.557816 0.0 15.00 32.0 76.00 376.0
1 44.0 37.045455 49.469725 0.0 4.25 17.5 60.50 247.0
2 45.0 193.777778 99.631569 0.0 127.00 219.0 270.00 361.0
3 23.0 145.434783 79.621163 1.0 80.00 143.0 198.00 285.0
4 16.0 89.687500 96.641412 0.0 21.00 52.5 125.75 306.0
5 12.0 175.083333 65.242845 93.0 129.50 162.5 198.00 333.0

Question 4: Let's visualize the distribution of beer servings by continent using a boxplot. This helps me spot differences and outliers in drinking patterns:

In [34]:
import seaborn as sns 
plt.figure(figsize=[15,7])
sns.boxplot(x="continent", y="beer_servings", data=df)
Out[34]:
<AxesSubplot:xlabel='continent', ylabel='beer_servings'>
No description has been provided for this image

Question 5: To understand the relationship between wine and beer servings, I'll use a regression plot and calculate the correlation coefficient:

In [35]:
from scipy import stats
pearson_coef, p_value = stats.pearsonr(df['wine_servings'], df['beer_servings'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  
The Pearson Correlation Coefficient is 0.5271716935065209  with a P-value of P = 3.378401743961718e-15
In [36]:
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="wine_servings", y="beer_servings", data=df)
plt.ylim(0,)
Out[36]:
(0.0, 411.9570413986357)
No description has been provided for this image

From the analysis, it appears that wine and beer servings are positively correlated. In some countries, beer is the primary alcoholic beverage, while others have a more balanced consumption.

Step 3: Predictive Modeling

I'll fit a linear regression model to predict the total liters of pure alcohol consumed based on wine servings. Let's also calculate the $R^2$ score to evaluate the model's performance:

In [63]:
from sklearn.model_selection import train_test_split

y_data = df['total_litres_of_pure_alcohol']
x_data=df.drop('total_litres_of_pure_alcohol',axis=1)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.10, random_state=0)


print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])

lm=LinearRegression()
lm.fit(x_train[['wine_servings']], y_train)
yhat_train = lm.predict(x_train[["wine_servings"]])
print("Train Predicted Value: ",yhat_train[0:5])
yhat_test = lm.predict(x_test[["wine_servings"]])
print("Test Predicted Value: ",yhat_test[0:5])
print("R^2 for Test Data",lm.score(x_test[['wine_servings']], y_test))
print("R^2 for Train Data",lm.score(x_train[['wine_servings']], y_train))
print("Intercept is ",lm.intercept_)
print("Slope is ",lm.coef_)
number of test samples : 20
number of training samples: 173
Train Predicted Value:  [3.25885396 3.25885396 3.22692722 3.29078071 3.54619468]
Test Predicted Value:  [3.22692722 3.25885396 6.83464952 3.25885396 3.22692722]
R^2 for Test Data 0.17698593687059994
R^2 for Train Data 0.45741941870252656
Intercept is  3.2269272171534653
Slope is  [0.03192675]

Note: For consistency, I'll use a test size of 10% and a random state of 0 for all modeling steps.

Question 7: Use list of features to predict the 'total_litres_of_pure_alcohol', split the data into training and testing and determine the $R^2$ on the test data by using the provided code: Now, I'll use multiple features (beer, wine, and spirit servings) to predict total alcohol consumption and evaluate the model's accuracy:

In [64]:
from sklearn.model_selection import train_test_split

y_data = df['total_litres_of_pure_alcohol']
x_data=df.drop('total_litres_of_pure_alcohol',axis=1)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.10, random_state=0)

print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])

lm=LinearRegression()
lm.fit(x_train[["beer_servings", "wine_servings", "spirit_servings"]], y_train)
yhat_train = lm.predict(x_train[["beer_servings", "wine_servings", "spirit_servings"]])
print("Train Predicted Value: ",yhat_train[0:5])
yhat_test = lm.predict(x_test[["beer_servings", "wine_servings", "spirit_servings"]])
print("Test Predicted Value: ",yhat_test[0:5])
print("R^2 for Test Data",lm.score(x_test[["beer_servings", "wine_servings", "spirit_servings"]], y_test))
print("R^2 for Train Data",lm.score(x_train[["beer_servings", "wine_servings", "spirit_servings"]], y_train))
print("Intercept is ",lm.intercept_)
print("Slope is ",lm.coef_)
number of test samples : 20
number of training samples: 173
Train Predicted Value:  [1.07590563 1.17643926 0.72058984 1.25262228 1.4908585 ]
Test Predicted Value:  [0.72058984 4.95383012 8.45405404 0.82260947 0.72058984]
R^2 for Test Data 0.6990304512837944
R^2 for Train Data 0.8843621858666174
Intercept is  0.7205898393369674
Slope is  [0.01809258 0.01621009 0.0157659 ]

Let's take it a step further by building a pipeline that scales the data, applies a polynomial transformation, and fits a linear regression model. This approach can capture more complex relationships in the data:

'scale'

'polynomial'

'model'

The second element in the tuple contains the model constructor

StandardScaler()

PolynomialFeatures(include_bias=False)

LinearRegression()

In [67]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]
pipe=Pipeline(Input)
pipe

pipe.fit(x_train[["beer_servings", "wine_servings", "spirit_servings"]], y_train)
yhat_train = pipe.predict(x_train[["beer_servings", "wine_servings", "spirit_servings"]])
print("Train Predicted Value: ",yhat_train[0:5])
yhat_test = pipe.predict(x_test[["beer_servings", "wine_servings", "spirit_servings"]])
print("Test Predicted Value: ",yhat_test[0:5])
print("R^2 for Test Data",pipe.score(x_test[["beer_servings", "wine_servings", "spirit_servings"]], y_test))
print("R^2 for Train Data",pipe.score(x_train[["beer_servings", "wine_servings", "spirit_servings"]], y_train))
Train Predicted Value:  [1.21017029 0.9400267  0.67483964 1.46189861 1.73360873]
Test Predicted Value:  [0.67483964 4.4123025  8.54359816 0.7845907  0.67483964]
R^2 for Test Data 0.7076376133886035
R^2 for Train Data 0.8981586745966471

Question 9: Create and fit a Ridge regression object using the training data, setting the regularization parameter to 0.1 and calculating the $R^{2}$ using the test data. Take a screenshot of your code and the $R^{2}$. I'll also try Ridge regression with regularization to see if it improves the model's performance:

In [110]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
RigeModel=Ridge(alpha=0.1)
RigeModel.fit(x_train[["beer_servings", "wine_servings", "spirit_servings"]], y_train)
yhat = RigeModel.predict(x_test[["beer_servings", "wine_servings", "spirit_servings"]])
print('predicted:', yhat[0:4])
print('test set :', y_test[0:4].values)
print("R^2 value for Test Data=",RigeModel.score(x_test[["beer_servings", "wine_servings", "spirit_servings"]], y_test))
predicted: [0.72058998 4.9538301  8.4540539  0.82260961]
test set : [0.  4.6 8.2 0.1]
R^2 value for Test Data= 0.699030454901918

Question 10: Finally, I'll apply a 2nd order polynomial transformation and fit a Ridge regression model to see if this approach yields better results: Perform a 2nd order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data and setting the regularization parameter to 0.1. Calculate the $R^{2}$ utilizing the test data provided. Take a screenshot of your code and the $R^{2}$.

In [109]:
pr=PolynomialFeatures(degree=2)
x_train_pr=pr.fit_transform(x_train[["beer_servings", "wine_servings", "spirit_servings"]])
x_test_pr=pr.fit_transform(x_test[["beer_servings", "wine_servings", "spirit_servings"]])
RigeModel=Ridge(alpha=0.1)
RigeModel.fit(x_train_pr, y_train)
yhat = RigeModel.predict(x_test_pr)
print("R^2 value for Test Data=",RigeModel.score(x_test_pr, y_test))
R^2 value for Test Data= 0.7076376228095795

This notebook and all analysis were created by Mohammad Sayem Chowdhury as a personal data science showcase.

Thank you for exploring this project with me! If you have any feedback or suggestions, feel free to reach out.

Dataset source: Data collected from public domain sources and curated for this personal project.

Thank you for completing this lab!¶