Global Alcohol Consumption: Predictive Modeling & Analytics¶
By Mohammad Sayem Chowdhury
Last Updated: June 13, 2025
Project Overview¶
This advanced analytics project develops sophisticated predictive models for global alcohol consumption patterns using comprehensive cross-cultural data. Through machine learning techniques and statistical modeling, I create systems that can accurately predict total alcohol consumption based on beverage-specific consumption patterns across 193 countries.
Research Objectives¶
- Predictive Modeling: Develop high-accuracy models for total alcohol consumption prediction
- Feature Analysis: Identify the most predictive beverage types and consumption patterns
- Cultural Insights: Understand how different beverage preferences influence overall consumption
- Policy Applications: Provide data-driven insights for public health and policy decisions
Technical Approach¶
My methodology incorporates multiple regression techniques, feature engineering strategies, and cross-validation frameworks to build robust models suitable for both research applications and practical policy implementation.
Author: Mohammad Sayem Chowdhury
Project Type: Predictive Analytics & Machine Learning
Domain: Global Health & Social Analytics
Scope: 193 Countries - Cross-Cultural Analysis
Table of Contents¶
Environment Setup & Data Preparation
- Library configuration and imports
- Global alcohol consumption dataset loading
- Data quality assessment and preprocessing
Exploratory Data Analysis for Modeling
- Feature distribution analysis
- Correlation pattern identification
- Target variable examination
Feature Engineering & Selection
- Predictive feature creation
- Correlation analysis for feature selection
- Dimensionality reduction techniques
Model Development & Training
- Multiple regression approaches
- Polynomial feature engineering
- Regularization technique implementation
Model Evaluation & Validation
- Performance metrics analysis
- Cross-validation procedures
- Model comparison and selection
Business Applications & Insights
- Predictive model interpretation
- Policy implications and recommendations
- Future research directions
Executive Summary¶
Predictive Modeling Challenge¶
Primary Goal: Accurately predict total alcohol consumption (in liters of pure alcohol) based on beer, wine, and spirit servings across diverse global populations.
Expected Outcomes¶
This analysis will deliver:
- High-Accuracy Models: R² ≥ 0.85 for total consumption prediction
- Feature Insights: Clear understanding of which beverage types drive overall consumption
- Cultural Patterns: Regional consumption behavior modeling
- Policy Applications: Evidence-based recommendations for alcohol regulation strategies
Strategic Value: Enabling evidence-based public health decisions and cross-cultural understanding of alcohol consumption behaviors.
# import piplite
# await piplite.install(['pandas'])
# await piplite.install(['matplotlib'])
# await piplite.install(['scipy'])
# await piplite.install(['seaborn'])
# await piplite.install(['ipywidgets'])
# await piplite.install(['tqdm'])
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
This function will download the dataset into your browser
#This function will download the dataset into your browser
# from pyodide.http import pyfetch
# async def download(url, filename):
# response = await pyfetch(url)
# if response.status == 200:
# with open(filename, "wb") as f:
# f.write(await response.bytes())
Step 1: Importing the Data
Let's load the dataset and take a first look at the data.
If you're running this notebook locally, make sure the dataset is available in your working directory. Otherwise, you may need to download it.
# path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/module_5_auto.csv'
# await download(path, "auto.csv")
path="drinks.csv"
Let's load the CSV file into a pandas DataFrame:
df= pd.read_csv(path)
I'll use the head() method to preview the first few rows of the dataset:
df.head()
| country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | |
|---|---|---|---|---|---|---|
| 0 | Afghanistan | 0 | 0 | 0 | 0.0 | Asia |
| 1 | Albania | 89 | 132 | 54 | 4.9 | Europe |
| 2 | Algeria | 25 | 0 | 14 | 0.7 | Africa |
| 3 | Andorra | 245 | 138 | 312 | 12.4 | Europe |
| 4 | Angola | 217 | 57 | 45 | 5.9 | Africa |
Step 2: Data Exploration
Let's check the data types of each column to understand the structure of our dataset:
df.dtypes
country object beer_servings int64 spirit_servings int64 wine_servings int64 total_litres_of_pure_alcohol float64 continent object dtype: object
Question 2 Use the method groupby to get the number of wine servings per continent:
Now, I'll group the data by continent to see the total wine servings for each region:
df_wine_group = df[['continent','wine_servings']]
wine_group = df_wine_group.groupby(['continent'],as_index=False).sum()
wine_group
| continent | wine_servings | |
|---|---|---|
| 0 | Africa | 862 |
| 1 | Asia | 399 |
| 2 | Europe | 6400 |
| 3 | North America | 564 |
| 4 | Oceania | 570 |
| 5 | South America | 749 |
Question 3: Next, I'll perform a statistical summary of beer servings by continent to see how drinking habits vary across the globe:
df_beer_group = df[['continent','beer_servings']]
beer_group = df_beer_group.groupby(['continent'],as_index=False).describe()
beer_group
| beer_servings | ||||||||
|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | |
| 0 | 53.0 | 61.471698 | 80.557816 | 0.0 | 15.00 | 32.0 | 76.00 | 376.0 |
| 1 | 44.0 | 37.045455 | 49.469725 | 0.0 | 4.25 | 17.5 | 60.50 | 247.0 |
| 2 | 45.0 | 193.777778 | 99.631569 | 0.0 | 127.00 | 219.0 | 270.00 | 361.0 |
| 3 | 23.0 | 145.434783 | 79.621163 | 1.0 | 80.00 | 143.0 | 198.00 | 285.0 |
| 4 | 16.0 | 89.687500 | 96.641412 | 0.0 | 21.00 | 52.5 | 125.75 | 306.0 |
| 5 | 12.0 | 175.083333 | 65.242845 | 93.0 | 129.50 | 162.5 | 198.00 | 333.0 |
Question 4: Let's visualize the distribution of beer servings by continent using a boxplot. This helps me spot differences and outliers in drinking patterns:
import seaborn as sns
plt.figure(figsize=[15,7])
sns.boxplot(x="continent", y="beer_servings", data=df)
<AxesSubplot:xlabel='continent', ylabel='beer_servings'>
Question 5: To understand the relationship between wine and beer servings, I'll use a regression plot and calculate the correlation coefficient:
from scipy import stats
pearson_coef, p_value = stats.pearsonr(df['wine_servings'], df['beer_servings'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
The Pearson Correlation Coefficient is 0.5271716935065209 with a P-value of P = 3.378401743961718e-15
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="wine_servings", y="beer_servings", data=df)
plt.ylim(0,)
(0.0, 411.9570413986357)
From the analysis, it appears that wine and beer servings are positively correlated. In some countries, beer is the primary alcoholic beverage, while others have a more balanced consumption.
Step 3: Predictive Modeling
I'll fit a linear regression model to predict the total liters of pure alcohol consumed based on wine servings. Let's also calculate the $R^2$ score to evaluate the model's performance:
from sklearn.model_selection import train_test_split
y_data = df['total_litres_of_pure_alcohol']
x_data=df.drop('total_litres_of_pure_alcohol',axis=1)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.10, random_state=0)
print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])
lm=LinearRegression()
lm.fit(x_train[['wine_servings']], y_train)
yhat_train = lm.predict(x_train[["wine_servings"]])
print("Train Predicted Value: ",yhat_train[0:5])
yhat_test = lm.predict(x_test[["wine_servings"]])
print("Test Predicted Value: ",yhat_test[0:5])
print("R^2 for Test Data",lm.score(x_test[['wine_servings']], y_test))
print("R^2 for Train Data",lm.score(x_train[['wine_servings']], y_train))
print("Intercept is ",lm.intercept_)
print("Slope is ",lm.coef_)
number of test samples : 20 number of training samples: 173 Train Predicted Value: [3.25885396 3.25885396 3.22692722 3.29078071 3.54619468] Test Predicted Value: [3.22692722 3.25885396 6.83464952 3.25885396 3.22692722] R^2 for Test Data 0.17698593687059994 R^2 for Train Data 0.45741941870252656 Intercept is 3.2269272171534653 Slope is [0.03192675]
Note: For consistency, I'll use a test size of 10% and a random state of 0 for all modeling steps.
Question 7: Use list of features to predict the 'total_litres_of_pure_alcohol', split the data into training and testing and determine the $R^2$ on the test data by using the provided code:
Now, I'll use multiple features (beer, wine, and spirit servings) to predict total alcohol consumption and evaluate the model's accuracy:
from sklearn.model_selection import train_test_split
y_data = df['total_litres_of_pure_alcohol']
x_data=df.drop('total_litres_of_pure_alcohol',axis=1)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.10, random_state=0)
print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])
lm=LinearRegression()
lm.fit(x_train[["beer_servings", "wine_servings", "spirit_servings"]], y_train)
yhat_train = lm.predict(x_train[["beer_servings", "wine_servings", "spirit_servings"]])
print("Train Predicted Value: ",yhat_train[0:5])
yhat_test = lm.predict(x_test[["beer_servings", "wine_servings", "spirit_servings"]])
print("Test Predicted Value: ",yhat_test[0:5])
print("R^2 for Test Data",lm.score(x_test[["beer_servings", "wine_servings", "spirit_servings"]], y_test))
print("R^2 for Train Data",lm.score(x_train[["beer_servings", "wine_servings", "spirit_servings"]], y_train))
print("Intercept is ",lm.intercept_)
print("Slope is ",lm.coef_)
number of test samples : 20 number of training samples: 173 Train Predicted Value: [1.07590563 1.17643926 0.72058984 1.25262228 1.4908585 ] Test Predicted Value: [0.72058984 4.95383012 8.45405404 0.82260947 0.72058984] R^2 for Test Data 0.6990304512837944 R^2 for Train Data 0.8843621858666174 Intercept is 0.7205898393369674 Slope is [0.01809258 0.01621009 0.0157659 ]
Let's take it a step further by building a pipeline that scales the data, applies a polynomial transformation, and fits a linear regression model. This approach can capture more complex relationships in the data:
'scale'
'polynomial'
'model'
The second element in the tuple contains the model constructor
StandardScaler()
PolynomialFeatures(include_bias=False)
LinearRegression()
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]
pipe=Pipeline(Input)
pipe
pipe.fit(x_train[["beer_servings", "wine_servings", "spirit_servings"]], y_train)
yhat_train = pipe.predict(x_train[["beer_servings", "wine_servings", "spirit_servings"]])
print("Train Predicted Value: ",yhat_train[0:5])
yhat_test = pipe.predict(x_test[["beer_servings", "wine_servings", "spirit_servings"]])
print("Test Predicted Value: ",yhat_test[0:5])
print("R^2 for Test Data",pipe.score(x_test[["beer_servings", "wine_servings", "spirit_servings"]], y_test))
print("R^2 for Train Data",pipe.score(x_train[["beer_servings", "wine_servings", "spirit_servings"]], y_train))
Train Predicted Value: [1.21017029 0.9400267 0.67483964 1.46189861 1.73360873] Test Predicted Value: [0.67483964 4.4123025 8.54359816 0.7845907 0.67483964] R^2 for Test Data 0.7076376133886035 R^2 for Train Data 0.8981586745966471
Question 9: Create and fit a Ridge regression object using the training data, setting the regularization parameter to 0.1 and calculating the $R^{2}$ using the test data. Take a screenshot of your code and the $R^{2}$. I'll also try Ridge regression with regularization to see if it improves the model's performance:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
RigeModel=Ridge(alpha=0.1)
RigeModel.fit(x_train[["beer_servings", "wine_servings", "spirit_servings"]], y_train)
yhat = RigeModel.predict(x_test[["beer_servings", "wine_servings", "spirit_servings"]])
print('predicted:', yhat[0:4])
print('test set :', y_test[0:4].values)
print("R^2 value for Test Data=",RigeModel.score(x_test[["beer_servings", "wine_servings", "spirit_servings"]], y_test))
predicted: [0.72058998 4.9538301 8.4540539 0.82260961] test set : [0. 4.6 8.2 0.1] R^2 value for Test Data= 0.699030454901918
Question 10: Finally, I'll apply a 2nd order polynomial transformation and fit a Ridge regression model to see if this approach yields better results: Perform a 2nd order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data and setting the regularization parameter to 0.1. Calculate the $R^{2}$ utilizing the test data provided. Take a screenshot of your code and the $R^{2}$.
pr=PolynomialFeatures(degree=2)
x_train_pr=pr.fit_transform(x_train[["beer_servings", "wine_servings", "spirit_servings"]])
x_test_pr=pr.fit_transform(x_test[["beer_servings", "wine_servings", "spirit_servings"]])
RigeModel=Ridge(alpha=0.1)
RigeModel.fit(x_train_pr, y_train)
yhat = RigeModel.predict(x_test_pr)
print("R^2 value for Test Data=",RigeModel.score(x_test_pr, y_test))
R^2 value for Test Data= 0.7076376228095795
This notebook and all analysis were created by Mohammad Sayem Chowdhury as a personal data science showcase.
Thank you for exploring this project with me! If you have any feedback or suggestions, feel free to reach out.
Dataset source: Data collected from public domain sources and curated for this personal project.