Dataset Overview¶
I use a dataset of US domestic airline flights, including flight times, delays, and performance. My goal is to explore and visualize this data using Plotly's expressive charting tools.
Why Plotly?¶
I'm interested in interactive visualizations that go beyond static charts. Plotly offers a lot of flexibility, and I'm using this notebook to get hands-on experience and see what I can create.
In [ ]:
# Author: Mohammad Sayem Chowdhury
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
In [ ]:
# Author: Mohammad Sayem Chowdhury
# Load the airline data into a pandas DataFrame
airline_data = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/Data%20Files/airline_data.csv',
encoding = "ISO-8859-1",
dtype={'Div1Airport': str, 'Div1TailNum': str,
'Div2Airport': str, 'Div2TailNum': str})
In [ ]:
# Preview the first 5 rows
airline_data.head()
Out[ ]:
| Unnamed: 0 | Year | Quarter | Month | DayofMonth | DayOfWeek | FlightDate | Reporting_Airline | DOT_ID_Reporting_Airline | IATA_CODE_Reporting_Airline | ... | Div4WheelsOff | Div4TailNum | Div5Airport | Div5AirportID | Div5AirportSeqID | Div5WheelsOn | Div5TotalGTime | Div5LongestGTime | Div5WheelsOff | Div5TailNum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1295781 | 1998 | 2 | 4 | 2 | 4 | 1998-04-02 | AS | 19930 | AS | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 1125375 | 2013 | 2 | 5 | 13 | 1 | 2013-05-13 | EV | 20366 | EV | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 118824 | 1993 | 3 | 9 | 25 | 6 | 1993-09-25 | UA | 19977 | UA | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 634825 | 1994 | 4 | 11 | 12 | 6 | 1994-11-12 | HP | 19991 | HP | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 1888125 | 2017 | 3 | 8 | 17 | 4 | 2017-08-17 | UA | 19977 | UA | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 110 columns
In [ ]:
# Check the shape of the data
airline_data.shape
Out[ ]:
(27000, 110)
In [ ]:
# Randomly sample 500 data points for faster plotting
sampled_data = airline_data.sample(n=500, random_state=42)
In [ ]:
# Check the shape of the sampled data
sampled_data.shape
Out[ ]:
(500, 110)
In [ ]:
fig = go.Figure(data=go.Scatter(x=sampled_data['Distance'], y=sampled_data['DepTime'], mode='markers', marker=dict(color='red')))
fig.update_layout(title='Distance vs Departure Time', xaxis_title='Distance', yaxis_title='DepTime')
fig.show()
Reflections¶
This section is for my thoughts on using Plotly, including what I found useful and what I want to explore further.
In [ ]:
# Group by Month and compute average arrival delay
delay_by_month = sampled_data.groupby('Month')['ArrDelay'].mean().reset_index()
In [ ]:
delay_by_month
Out[ ]:
| Month | ArrDelay | |
|---|---|---|
| 0 | 1 | 2.232558 |
| 1 | 2 | 2.687500 |
| 2 | 3 | 10.868421 |
| 3 | 4 | 6.229167 |
| 4 | 5 | -0.279070 |
| 5 | 6 | 17.310345 |
| 6 | 7 | 5.088889 |
| 7 | 8 | 3.121951 |
| 8 | 9 | 9.081081 |
| 9 | 10 | 1.200000 |
| 10 | 11 | -3.975000 |
| 11 | 12 | 3.240741 |
In [ ]:
fig = go.Figure(data=go.Scatter(x=delay_by_month['Month'], y=delay_by_month['ArrDelay'], mode='lines', marker=dict(color='blue')))
fig.update_layout(title='Month vs Average Delay Time', xaxis_title='Month', yaxis_title='Average Delay Time')
fig.show()
In [ ]:
bar_data = sampled_data.groupby(['DestState'])['Flights'].sum().reset_index()
In [ ]:
bar_data
Out[ ]:
| DestState | Flights | |
|---|---|---|
| 0 | AK | 4.0 |
| 1 | AL | 3.0 |
| 2 | AZ | 8.0 |
| 3 | CA | 68.0 |
| 4 | CO | 20.0 |
| 5 | CT | 5.0 |
| 6 | FL | 32.0 |
| 7 | GA | 27.0 |
| 8 | HI | 5.0 |
| 9 | IA | 1.0 |
| 10 | ID | 1.0 |
| 11 | IL | 33.0 |
| 12 | IN | 6.0 |
| 13 | KS | 1.0 |
| 14 | KY | 14.0 |
| 15 | LA | 4.0 |
| 16 | MA | 10.0 |
| 17 | MD | 7.0 |
| 18 | MI | 16.0 |
| 19 | MN | 11.0 |
| 20 | MO | 18.0 |
| 21 | MT | 3.0 |
| 22 | NC | 13.0 |
| 23 | NE | 2.0 |
| 24 | NH | 1.0 |
| 25 | NJ | 5.0 |
| 26 | NM | 1.0 |
| 27 | NV | 13.0 |
| 28 | NY | 21.0 |
| 29 | OH | 9.0 |
| 30 | OK | 6.0 |
| 31 | OR | 3.0 |
| 32 | PA | 14.0 |
| 33 | PR | 2.0 |
| 34 | RI | 1.0 |
| 35 | SC | 1.0 |
| 36 | TN | 14.0 |
| 37 | TX | 60.0 |
| 38 | UT | 7.0 |
| 39 | VA | 11.0 |
| 40 | VI | 1.0 |
| 41 | WA | 10.0 |
| 42 | WI | 8.0 |
In [ ]:
fig = px.bar(bar_data, x="DestState", y="Flights", title='Total number of flights to each destination state')
fig.show()
In [ ]:
bub_data = sampled_data.groupby('Reporting_Airline')['Flights'].sum().reset_index()
In [41]:
bub_data
Out[41]:
| Reporting_Airline | Flights | |
|---|---|---|
| 0 | 9E | 5.0 |
| 1 | AA | 57.0 |
| 2 | AS | 14.0 |
| 3 | B6 | 10.0 |
| 4 | CO | 12.0 |
| 5 | DL | 66.0 |
| 6 | EA | 4.0 |
| 7 | EV | 11.0 |
| 8 | F9 | 4.0 |
| 9 | FL | 3.0 |
| 10 | HA | 3.0 |
| 11 | HP | 7.0 |
| 12 | KH | 1.0 |
| 13 | MQ | 27.0 |
| 14 | NK | 3.0 |
| 15 | NW | 26.0 |
| 16 | OH | 8.0 |
| 17 | OO | 28.0 |
| 18 | PA (1) | 1.0 |
| 19 | PI | 1.0 |
| 20 | PS | 1.0 |
| 21 | TW | 14.0 |
| 22 | UA | 51.0 |
| 23 | US | 43.0 |
| 24 | VX | 1.0 |
| 25 | WN | 86.0 |
| 26 | XE | 6.0 |
| 27 | YV | 6.0 |
| 28 | YX | 1.0 |
In [ ]:
fig = px.scatter(bub_data, x="Reporting_Airline", y="Flights", size="Flights",
hover_name="Reporting_Airline", title='Reporting Airline vs Number of Flights', size_max=60)
fig.show()
In [ ]:
sampled_data['ArrDelay'] = sampled_data['ArrDelay'].fillna(0)
Out[ ]:
5312 32.0
18357 -1.0
6428 -5.0
15414 -2.0
10610 -11.0
...
18946 8.0
16291 -5.0
21818 -14.0
24116 88.0
16705 4.0
Name: ArrDelay, Length: 500, dtype: float64
In [ ]:
fig = px.histogram(sampled_data, x="ArrDelay", title='Distribution of Arrival Delays')
fig.show()
Additional Notes¶
Any extra observations or ideas for future projects will go here.
In [ ]:
fig = px.pie(sampled_data, values='Month', names='DistanceGroup', title='Distance group proportion by month')
fig.show()
Appendix¶
Supporting code, references, or resources for my Plotly experiments.
In [ ]:
fig = px.sunburst(sampled_data, path=['Month', 'DestStateName'], values='Flights', title='Flights by Month and Destination State')
fig.show()