About the Dataset¶
This project uses data from Chicago Public Schools for the 2011-2012 school year. The dataset includes a wide range of metrics about school performance, safety, and student demographics. I downloaded a cleaned version of the data for easier analysis.
Loading the Data¶
I loaded the dataset into a table called SCHOOLS for my analysis. I used the database's import tool to ensure the data types matched my needs.
Connecting to the Database¶
I used the ipython-sql extension to connect to my database. If you're running this notebook elsewhere, make sure to install the required libraries.
%load_ext sql
# Format: %sql ibm_db_sa://username:password@hostname:port/dbname?security=SSL
# Enter your connection string below
%sql ibm_db_sa://
Verifying the Table¶
To confirm the table was created, I queried the database system catalog for the list of tables in my schema.
%sql select TABSCHEMA,TABNAME,CREATE_TIME from SYSCAT.TABLES WHERE TABSCHEMA = 'YOUR_SCHEMA'
Exploring Table Columns¶
I checked how many columns are in the SCHOOLS table and explored their names, types, and lengths.
%sql select count(*) from SYSCAT.COLUMNS where TABNAME = 'SCHOOLS'
* ibm_db_sa://pgs19748:***@6667d8e9-9d4d-4ccb-ba32-21da3bb5aafc.c1ogj3sd0tgtu0lqde00.databases.appdomain.cloud:30376/BLUDB Done.
| 1 |
|---|
| 78 |
%sql select COLNAME, TYPENAME, LENGTH from SYSCAT.COLUMNS where TABNAME = 'SCHOOLS'
* ibm_db_sa://pgs19748:***@6667d8e9-9d4d-4ccb-ba32-21da3bb5aafc.c1ogj3sd0tgtu0lqde00.databases.appdomain.cloud:30376/BLUDB Done.
| colname | typename | length |
|---|---|---|
| ENVIRONMENT_SCORE | SMALLINT | 2 |
| INSTRUCTION_ICON | VARCHAR | 11 |
| INSTRUCTION_SCORE | SMALLINT | 2 |
| LEADERS_ICON | VARCHAR | 11 |
| LEADERS_SCORE | VARCHAR | 3 |
| TEACHERS_ICON | VARCHAR | 11 |
| TEACHERS_SCORE | VARCHAR | 3 |
| PARENT_ENGAGEMENT_ICON | VARCHAR | 7 |
| PARENT_ENGAGEMENT_SCORE | VARCHAR | 3 |
| PARENT_ENVIRONMENT_ICON | VARCHAR | 7 |
| PARENT_ENVIRONMENT_SCORE | VARCHAR | 3 |
| AVERAGE_STUDENT_ATTENDANCE | VARCHAR | 6 |
| RATE_OF_MISCONDUCTS__PER_100_STUDENTS_ | DECIMAL | 5 |
| AVERAGE_TEACHER_ATTENDANCE | VARCHAR | 6 |
| INDIVIDUALIZED_EDUCATION_PROGRAM_COMPLIANCE_RATE | VARCHAR | 7 |
| PK_2_LITERACY__ | VARCHAR | 4 |
| PK_2_MATH__ | VARCHAR | 4 |
| GR3_5_GRADE_LEVEL_MATH__ | VARCHAR | 4 |
| GR3_5_GRADE_LEVEL_READ__ | VARCHAR | 4 |
| SCHOOL_ID | INTEGER | 4 |
| NAME_OF_SCHOOL | VARCHAR | 64 |
| Elementary, Middle, or High School | VARCHAR | 2 |
| STREET_ADDRESS | VARCHAR | 30 |
| CITY | VARCHAR | 7 |
| STATE | VARCHAR | 2 |
| ZIP_CODE | INTEGER | 4 |
| PHONE_NUMBER | VARCHAR | 14 |
| LINK | VARCHAR | 78 |
| NETWORK_MANAGER | VARCHAR | 40 |
| COLLABORATIVE_NAME | VARCHAR | 34 |
| ADEQUATE_YEARLY_PROGRESS_MADE_ | VARCHAR | 3 |
| TRACK_SCHEDULE | VARCHAR | 12 |
| CPS_PERFORMANCE_POLICY_STATUS | VARCHAR | 16 |
| CPS_PERFORMANCE_POLICY_LEVEL | VARCHAR | 15 |
| HEALTHY_SCHOOL_CERTIFIED | VARCHAR | 3 |
| SAFETY_ICON | VARCHAR | 11 |
| SAFETY_SCORE | SMALLINT | 2 |
| FAMILY_INVOLVEMENT_ICON | VARCHAR | 11 |
| FAMILY_INVOLVEMENT_SCORE | VARCHAR | 3 |
| ENVIRONMENT_ICON | VARCHAR | 11 |
| GR3_5_KEEP_PACE_READ__ | VARCHAR | 4 |
| GR3_5_KEEP_PACE_MATH__ | VARCHAR | 4 |
| GR6_8_GRADE_LEVEL_MATH__ | VARCHAR | 4 |
| GR6_8_GRADE_LEVEL_READ__ | VARCHAR | 4 |
| GR6_8_KEEP_PACE_MATH_ | VARCHAR | 4 |
| GR6_8_KEEP_PACE_READ__ | VARCHAR | 4 |
| GR_8_EXPLORE_MATH__ | VARCHAR | 4 |
| GR_8_EXPLORE_READ__ | VARCHAR | 4 |
| ISAT_EXCEEDING_MATH__ | DECIMAL | 4 |
| ISAT_EXCEEDING_READING__ | DECIMAL | 4 |
| ISAT_VALUE_ADD_MATH | DECIMAL | 3 |
| ISAT_VALUE_ADD_READ | DECIMAL | 3 |
| ISAT_VALUE_ADD_COLOR_MATH | VARCHAR | 6 |
| ISAT_VALUE_ADD_COLOR_READ | VARCHAR | 6 |
| STUDENTS_TAKING__ALGEBRA__ | VARCHAR | 4 |
| STUDENTS_PASSING__ALGEBRA__ | VARCHAR | 4 |
| 9th Grade EXPLORE (2009) | VARCHAR | 4 |
| 9th Grade EXPLORE (2010) | VARCHAR | 4 |
| 10th Grade PLAN (2009) | VARCHAR | 4 |
| 10th Grade PLAN (2010) | VARCHAR | 4 |
| NET_CHANGE_EXPLORE_AND_PLAN | VARCHAR | 3 |
| 11th Grade Average ACT (2011) | VARCHAR | 4 |
| NET_CHANGE_PLAN_AND_ACT | VARCHAR | 3 |
| COLLEGE_ELIGIBILITY__ | VARCHAR | 4 |
| GRADUATION_RATE__ | VARCHAR | 4 |
| COLLEGE_ENROLLMENT_RATE__ | VARCHAR | 4 |
| COLLEGE_ENROLLMENT | SMALLINT | 2 |
| GENERAL_SERVICES_ROUTE | SMALLINT | 2 |
| FRESHMAN_ON_TRACK_RATE__ | VARCHAR | 4 |
| X_COORDINATE | DECIMAL | 13 |
| Y_COORDINATE | DECIMAL | 13 |
| LATITUDE | DECIMAL | 18 |
| LONGITUDE | DECIMAL | 18 |
| COMMUNITY_AREA_NUMBER | SMALLINT | 2 |
| COMMUNITY_AREA_NAME | VARCHAR | 22 |
| WARD | SMALLINT | 2 |
| POLICE_DISTRICT | SMALLINT | 2 |
| LOCATION | VARCHAR | 27 |
Data Exploration Questions¶
- Is the column name for the "SCHOOL ID" attribute in upper or mixed case?
- What is the name of the "Community Area Name" column in your table? Does it have spaces?
- Are there any columns whose names have spaces or parentheses replaced by underscores?
My Analysis Challenges¶
Here are the questions I set out to answer using SQL and Python:
%sql SELECT COUNT(*) FROM SCHOOLS WHERE "Elementary, Middle, or High School" = 'ES'
* ibm_db_sa://pgs19748:***@6667d8e9-9d4d-4ccb-ba32-21da3bb5aafc.c1ogj3sd0tgtu0lqde00.databases.appdomain.cloud:30376/BLUDB Done.
| 1 |
|---|
| 462 |
What is the highest Safety Score?¶
%sql SELECT MAX(safety_score) FROM SCHOOLS;
* ibm_db_sa://pgs19748:***@6667d8e9-9d4d-4ccb-ba32-21da3bb5aafc.c1ogj3sd0tgtu0lqde00.databases.appdomain.cloud:30376/BLUDB Done.
| 1 |
|---|
| 99 |
Which schools have the highest Safety Score?¶
%sql SELECT NAME_OF_SCHOOL FROM SCHOOLS WHERE safety_score=99;
* ibm_db_sa://pgs19748:***@6667d8e9-9d4d-4ccb-ba32-21da3bb5aafc.c1ogj3sd0tgtu0lqde00.databases.appdomain.cloud:30376/BLUDB Done.
| name_of_school |
|---|
| Abraham Lincoln Elementary School |
| Alexander Graham Bell Elementary School |
| Annie Keller Elementary Gifted Magnet School |
| Augustus H Burley Elementary School |
| Edgar Allan Poe Elementary Classical School |
| Edgebrook Elementary School |
| Ellen Mitchell Elementary School |
| James E McDade Elementary Classical School |
| James G Blaine Elementary School |
| LaSalle Elementary Language Academy |
| Mary E Courtenay Elementary Language Arts Center |
| Northside College Preparatory High School |
| Northside Learning Center High School |
| Norwood Park Elementary School |
| Oriole Park Elementary School |
| Sauganash Elementary School |
| Stephen Decatur Classical Elementary School |
| Talman Elementary School |
| Wildwood Elementary School |
What are the top 10 schools with the highest Average Student Attendance?¶
%sql select Name_of_School, Average_Student_Attendance from SCHOOLS \
order by Average_Student_Attendance desc nulls last limit 10
* ibm_db_sa://pgs19748:***@6667d8e9-9d4d-4ccb-ba32-21da3bb5aafc.c1ogj3sd0tgtu0lqde00.databases.appdomain.cloud:30376/BLUDB Done.
| name_of_school | average_student_attendance |
|---|---|
| John Charles Haines Elementary School | 98.40% |
| James Ward Elementary School | 97.80% |
| Edgar Allan Poe Elementary Classical School | 97.60% |
| Orozco Fine Arts & Sciences Elementary School | 97.60% |
| Rachel Carson Elementary School | 97.60% |
| Annie Keller Elementary Gifted Magnet School | 97.50% |
| Andrew Jackson Elementary Language Academy | 97.40% |
| Lenart Elementary Regional Gifted Center | 97.40% |
| Disney II Magnet School | 97.30% |
| John H Vanderpoel Elementary Magnet School | 97.20% |
Retrieve the list of 5 Schools with the lowest Average Student Attendance sorted in ascending order¶
%sql SELECT Name_of_School, Average_Student_Attendance \
from SCHOOLS \
order by Average_Student_Attendance \
fetch first 5 rows only
* ibm_db_sa://pgs19748:***@6667d8e9-9d4d-4ccb-ba32-21da3bb5aafc.c1ogj3sd0tgtu0lqde00.databases.appdomain.cloud:30376/BLUDB Done.
| name_of_school | average_student_attendance |
|---|---|
| Richard T Crane Technical Preparatory High School | 57.90% |
| Barbara Vick Early Childhood & Family Center | 60.90% |
| Dyett High School | 62.50% |
| Wendell Phillips Academy High School | 63.00% |
| Orr Academy High School | 66.30% |
Remove the '%' sign from the Average Student Attendance column¶
%sql SELECT Name_of_School, REPLACE(Average_Student_Attendance,'%','') as Average_Student_Attendance\
from SCHOOLS \
order by Average_Student_Attendance \
fetch first 5 rows only
* ibm_db_sa://pgs19748:***@6667d8e9-9d4d-4ccb-ba32-21da3bb5aafc.c1ogj3sd0tgtu0lqde00.databases.appdomain.cloud:30376/BLUDB Done.
| name_of_school | average_student_attendance |
|---|---|
| Richard T Crane Technical Preparatory High School | 57.90 |
| Barbara Vick Early Childhood & Family Center | 60.90 |
| Dyett High School | 62.50 |
| Wendell Phillips Academy High School | 63.00 |
| Orr Academy High School | 66.30 |
Which Schools have Average Student Attendance lower than 70%?¶
%sql SELECT Name_of_School, Average_Student_Attendance \
from SCHOOLS \
where DECIMAL ( REPLACE(Average_Student_Attendance, '%', '') ) < 70 \
order by Average_Student_Attendance
* ibm_db_sa://pgs19748:***@6667d8e9-9d4d-4ccb-ba32-21da3bb5aafc.c1ogj3sd0tgtu0lqde00.databases.appdomain.cloud:30376/BLUDB Done.
| name_of_school | average_student_attendance |
|---|---|
| Richard T Crane Technical Preparatory High School | 57.90% |
| Barbara Vick Early Childhood & Family Center | 60.90% |
| Dyett High School | 62.50% |
| Wendell Phillips Academy High School | 63.00% |
| Orr Academy High School | 66.30% |
| Manley Career Academy High School | 66.80% |
| Chicago Vocational Career Academy High School | 68.80% |
| Roberto Clemente Community Academy High School | 69.60% |
Get the total College Enrollment for each Community Area¶
%sql select Community_Area_Name, sum(College_Enrollment) AS TOTAL_ENROLLMENT \
from SCHOOLS \
group by Community_Area_Name
* ibm_db_sa://pgs19748:***@6667d8e9-9d4d-4ccb-ba32-21da3bb5aafc.c1ogj3sd0tgtu0lqde00.databases.appdomain.cloud:30376/BLUDB Done.
| community_area_name | total_enrollment |
|---|---|
| ALBANY PARK | 6864 |
| ARCHER HEIGHTS | 4823 |
| ARMOUR SQUARE | 1458 |
| ASHBURN | 6483 |
| AUBURN GRESHAM | 4175 |
| AUSTIN | 10933 |
| AVALON PARK | 1522 |
| AVONDALE | 3640 |
| BELMONT CRAGIN | 14386 |
| BEVERLY | 1636 |
| BRIDGEPORT | 3167 |
| BRIGHTON PARK | 9647 |
| BURNSIDE | 549 |
| CALUMET HEIGHTS | 1568 |
| CHATHAM | 5042 |
| CHICAGO LAWN | 7086 |
| CLEARING | 2085 |
| DOUGLAS | 4670 |
| DUNNING | 4568 |
| EAST GARFIELD PARK | 5337 |
| EAST SIDE | 5305 |
| EDGEWATER | 4600 |
| EDISON PARK | 910 |
| ENGLEWOOD | 6832 |
| FOREST GLEN | 1431 |
| FULLER PARK | 531 |
| GAGE PARK | 9915 |
| GARFIELD RIDGE | 4552 |
| GRAND BOULEVARD | 2809 |
| GREATER GRAND CROSSING | 4051 |
| HEGEWISCH | 963 |
| HERMOSA | 3975 |
| HUMBOLDT PARK | 8620 |
| HYDE PARK | 1930 |
| IRVING PARK | 7764 |
| JEFFERSON PARK | 1755 |
| KENWOOD | 4287 |
| LAKE VIEW | 7055 |
| LINCOLN PARK | 5615 |
| LINCOLN SQUARE | 4132 |
| LOGAN SQUARE | 7351 |
| LOOP | 871 |
| LOWER WEST SIDE | 7257 |
| MCKINLEY PARK | 1552 |
| MONTCLARE | 1317 |
| MORGAN PARK | 3271 |
| MOUNT GREENWOOD | 2091 |
| NEAR NORTH SIDE | 3362 |
| NEAR SOUTH SIDE | 1378 |
| NEAR WEST SIDE | 7975 |
| NEW CITY | 7922 |
| NORTH CENTER | 7541 |
| NORTH LAWNDALE | 5146 |
| NORTH PARK | 4210 |
| NORWOOD PARK | 6469 |
| OAKLAND | 140 |
| OHARE | 786 |
| PORTAGE PARK | 6954 |
| PULLMAN | 1620 |
| RIVERDALE | 1547 |
| ROGERS PARK | 4068 |
| ROSELAND | 7020 |
| SOUTH CHICAGO | 4043 |
| SOUTH DEERING | 1859 |
| SOUTH LAWNDALE | 14793 |
| SOUTH SHORE | 4543 |
| UPTOWN | 4388 |
| WASHINGTON HEIGHTS | 4006 |
| WASHINGTON PARK | 2648 |
| WEST ELSDON | 3700 |
| WEST ENGLEWOOD | 5946 |
| WEST GARFIELD PARK | 2622 |
| WEST LAWN | 4207 |
| WEST PULLMAN | 3240 |
| WEST RIDGE | 8197 |
| WEST TOWN | 9429 |
| WOODLAWN | 4206 |
Get the 5 Community Areas with the least total College Enrollment sorted in ascending order¶
%sql select Community_Area_Name, sum(College_Enrollment) AS TOTAL_ENROLLMENT \
from SCHOOLS \
group by Community_Area_Name \
order by TOTAL_ENROLLMENT asc \
fetch first 5 rows only
* ibm_db_sa://pgs19748:***@6667d8e9-9d4d-4ccb-ba32-21da3bb5aafc.c1ogj3sd0tgtu0lqde00.databases.appdomain.cloud:30376/BLUDB Done.
| community_area_name | total_enrollment |
|---|---|
| OAKLAND | 140 |
| FULLER PARK | 531 |
| BURNSIDE | 549 |
| OHARE | 786 |
| LOOP | 871 |
List 5 schools with lowest safety score.¶
%sql select Name_of_School,safety_score\
from SCHOOLS \
order by safety_score asc\
fetch first 5 rows only
* ibm_db_sa://pgs19748:***@6667d8e9-9d4d-4ccb-ba32-21da3bb5aafc.c1ogj3sd0tgtu0lqde00.databases.appdomain.cloud:30376/BLUDB Done.
| name_of_school | safety_score |
|---|---|
| Edmond Burke Elementary School | 1 |
| Luke O'Toole Elementary School | 5 |
| George W Tilton Elementary School | 6 |
| Foster Park Elementary School | 11 |
| Emil G Hirsch Metropolitan High School | 13 |
Get the hardship index for the community area which has College Enrollment of 4368¶
%%sql
select hardship_index
from chicago_socioeconomic_data CD, schools CPS
where CD.ca = CPS.community_area_number
and college_enrollment = 4368
* ibm_db_sa://pgs19748:***@6667d8e9-9d4d-4ccb-ba32-21da3bb5aafc.c1ogj3sd0tgtu0lqde00.databases.appdomain.cloud:30376/BLUDB Done.
| hardship_index |
|---|
| 6.0 |
Get the hardship index for the community area which has the school with the highest enrollment.¶
%sql select ca, community_area_name, hardship_index from chicago_socioeconomic_data \
where ca in \
( select community_area_number from schools order by college_enrollment desc limit 1 )
* ibm_db_sa://pgs19748:***@6667d8e9-9d4d-4ccb-ba32-21da3bb5aafc.c1ogj3sd0tgtu0lqde00.databases.appdomain.cloud:30376/BLUDB Done.
| ca | community_area_name | hardship_index |
|---|---|---|
| 5.0 | North Center | 6.0 |
Summary¶
In this project, I explored Chicago Public Schools data using SQL and Python. I loaded the data, ran queries to answer key questions, and practiced working with columns that have spaces or special characters. This approach helped me better understand the structure and challenges in Chicago's schools.
Notebook and analysis by Mohammad Sayem Chowdhury.
All code and explanations are my own. This notebook is shared under the MIT License.