SpaceX Falcon 9 Launch Data: End-to-End Personal Data Science Project¶
Welcome to my personal data science project! This notebook is the first step in a progressive, end-to-end workflow where I explore, collect, clean, and analyze SpaceX Falcon 9 and Falcon Heavy launch data. My goal is to build a robust, reproducible pipeline for real-world data science, from raw data acquisition to actionable insights and machine learning.
Project Overview¶
This project is structured as a series of notebooks, each representing a key stage in the data science lifecycle:
- Web Scraping & Data Collection (this notebook): Gather launch records from Wikipedia and SpaceX APIs.
- Data Wrangling & Cleaning: Prepare and clean the raw data for analysis.
- Exploratory Data Analysis (EDA): Visualize and understand the data, uncovering trends and patterns.
- Feature Engineering & SQL Analysis: Transform data and use SQL for deeper insights.
- Machine Learning & Prediction: Build, tune, and compare models to predict Falcon 9 first stage landing outcomes.
- Dashboarding & Communication: Present findings with interactive dashboards and clear visualizations.
Each notebook builds on the previous, creating a seamless, reproducible workflow.
Step 1: Web Scraping & Data Collection¶
In this notebook, I scrape Falcon 9 and Falcon Heavy launch records from Wikipedia, parse the HTML tables, and prepare the data for further analysis. This forms the foundation for all subsequent steps in the project.
Import Required Libraries¶
import requests
from bs4 import BeautifulSoup
import re
import unicodedata
import pandas as pd
Helper Functions¶
These functions assist in extracting and cleaning data from the HTML tables.
def date_time(table_cells):
return [data_time.strip() for data_time in list(table_cells.strings)][0:2]
def booster_version(table_cells):
out = ''.join([booster_version for i, booster_version in enumerate(table_cells.strings) if i % 2 == 0][0:-1])
return out
def landing_status(table_cells):
out = [i for i in table_cells.strings][0]
return out
def get_mass(table_cells):
mass = unicodedata.normalize("NFKD", table_cells.text).strip()
if mass:
mass.find("kg")
new_mass = mass[0:mass.find("kg") + 2]
else:
new_mass = 0
return new_mass
def extract_column_from_header(row):
if (row.br):
row.br.extract()
if row.a:
row.a.extract()
if row.sup:
row.sup.extract()
colunm_name = ' '.join(row.contents)
if not(colunm_name.strip().isdigit()):
colunm_name = colunm_name.strip()
return colunm_name
Data Collection¶
Scrape the Falcon 9 and Falcon Heavy launch records from Wikipedia (snapshot as of June 9, 2021).
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"
response = requests.get(static_url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
else:
raise Exception(f"Failed to retrieve data. Status code: {response.status_code}")
# Find all tables and select the third one (launch records)
html_tables = soup.find_all('table')
first_launch_table = html_tables[2]
# Extract column names
column_names = []
th_elements = first_launch_table.find_all('th')
for th in th_elements:
name = extract_column_from_header(th)
if name is not None and len(name) > 0:
column_names.append(name)
# Initialize launch_dict
launch_dict = dict.fromkeys(column_names)
# Remove irrelevant column if present
if 'Date and time ( )' in launch_dict:
del launch_dict['Date and time ( )']
launch_dict['Flight No.'] = []
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []
launch_dict['Version Booster'] = []
launch_dict['Booster landing'] = []
launch_dict['Date'] = []
launch_dict['Time'] = []
# Parse the launch records table and fill launch_dict
for table_number, table in enumerate(soup.find_all('table', "wikitable plainrowheaders collapsible")):
for rows in table.find_all("tr"):
if rows.th and rows.th.string:
flight_number = rows.th.string.strip()
flag = flight_number.isdigit()
else:
flag = False
row = rows.find_all('td')
if flag:
datatimelist = date_time(row[0])
date = datatimelist[0].strip(',')
time = datatimelist[1]
bv = booster_version(row[1])
if not bv and row[1].a:
bv = row[1].a.string
launch_site = row[2].a.string if row[2].a else None
payload = row[3].a.string if row[3].a else None
payload_mass = get_mass(row[4])
orbit = row[5].a.string if row[5].a else None
customer = row[6].a.string if row[6].a else None
launch_outcome = list(row[7].strings)[0] if row[7].strings else None
booster_landing = landing_status(row[8]) if len(row) > 8 else None
launch_dict['Flight No.'].append(flight_number)
launch_dict['Date'].append(date)
launch_dict['Time'].append(time)
launch_dict['Version Booster'].append(bv)
launch_dict['Launch site'].append(launch_site)
launch_dict['Payload'].append(payload)
launch_dict['Payload mass'].append(payload_mass)
launch_dict['Orbit'].append(orbit)
launch_dict['Customer'].append(customer)
launch_dict['Launch outcome'].append(launch_outcome)
launch_dict['Booster landing'].append(booster_landing)
# Create DataFrame
import pandas as pd
df = pd.DataFrame({key: pd.Series(value) for key, value in launch_dict.items()})
# Save to CSV
csv_path = 'spacex-web-scraped.csv'
df.to_csv(csv_path, index=False)
# Preview the DataFrame
df.head()