About the Data¶
Chicago's open data portal provides a dataset with six socioeconomic indicators and a hardship index for each community area (2008–2012). The hardship index ranges from 1 to 100, with higher values indicating more hardship. Here are the main variables I explored:
- Community Area Number (
ca) - Community Area Name (
community_area_name) - Percent of Housing Crowded (
percent_of_housing_crowded) - Percent Households Below Poverty (
percent_households_below_poverty) - Percent Aged 16+ Unemployed (
percent_aged_16_unemployed) - Percent Aged 25+ without High School Diploma (
percent_aged_25_without_high_school_diploma) - Percent Aged Under 18 or Over 64 (
percent_aged_under_18_or_over_64) - Per Capita Income (
per_capita_income_) - Hardship Index (
hardship_index)
In this notebook, I use SQL and Python to analyze these variables and uncover patterns in Chicago's neighborhoods.
Connecting to the Database¶
I started by loading the SQL extension and connecting to my database. If you're running this elsewhere, make sure to install the required libraries.
# If you're running this outside my environment, you may need to install these libraries:
# !pip install ibm_db ibm_db_sa ipython-sql
%load_ext sql
# Format: %sql ibm_db_sa://username:password@hostname:port/dbname?security=SSL
# Enter your connection string below
%sql ibm_db_sa://
Loading the Data¶
I loaded the dataset into a table called chicago_socioeconomic_data for my analysis.
import pandas
chicago_socioeconomic_data = pandas.read_csv('https://data.cityofchicago.org/resource/jcxq-k9xf.csv')
%sql --persist chicago_socioeconomic_data
To verify the table was created, I ran a simple query to preview the data.
%sql SELECT * FROM chicago_socioeconomic_data limit 5;
My Analysis Challenges¶
Here are the questions I set out to answer using SQL and Python:
1. How many rows are in the dataset?¶
%sql SELECT COUNT(*) FROM chicago_socioeconomic_data;
2. How many community areas in Chicago have a hardship index greater than 50?¶
%sql SELECT count(*) FROM chicago_socioeconomic_data WHERE hardship_index>50;
3. What is the maximum value of hardship index in this dataset?¶
%sql SELECT max(hardship_index) FROM chicago_socioeconomic_data;
4. Which community area has the highest hardship index?¶
%sql SELECT community_area_name FROM chicago_socioeconomic_data ORDER BY hardship_index DESC NULLS LAST FETCH FIRST ROW ONLY;
5. Which Chicago community areas have per-capita incomes greater than $60,000?¶
%sql SELECT community_area_name FROM chicago_socioeconomic_data where per_capita_income_ > 60000;
6. Visualizing the Relationship: Per Capita Income vs. Hardship Index¶
I wanted to see if there's a correlation between income and hardship. Here's a scatter plot I created:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
income_vs_hardship = %sql SELECT per_capita_income_,hardship_index FROM CHICAGO_SOCIOECONOMIC_DATA;
plot = sb.jointplot(x='per_capita_income_',y='hardship_index',data=income_vs_hardship.DataFrame())
# As per capita income rises, hardship index tends to decrease. The negative correlation is visible in the plot.
Conclusion¶
From my analysis, I observed that neighborhoods with higher per capita income generally have lower hardship indices. This negative correlation is clear in the scatter plot. There are also a few community areas with exceptionally high income or hardship, which could be interesting for further study.
Summary¶
In this project, I explored socioeconomic data for Chicago using SQL and Python. I loaded the data, ran queries to answer key questions, and visualized relationships between variables. This approach helped me better understand the challenges and disparities across Chicago's neighborhoods.
Notebook and analysis by Mohammad Sayem Chowdhury.
All code and explanations are my own. This notebook is shared under the MIT License.