My Approach to Tweet Preprocessing¶
By Mohammad Sayem Chowdhury
In this notebook, I share my personal workflow for preprocessing tweets, a crucial step in my sentiment analysis projects. I believe understanding and customizing each step of the pipeline helps me get the most out of my data. Here, I walk through how I use Python and NLTK to clean and prepare Twitter data for analysis.
Getting Started¶
For my sentiment analysis work, I rely on the Natural Language Toolkit (NLTK) to handle and process Twitter data. NLTK provides convenient modules for collecting, cleaning, and analyzing tweets. In this notebook, I use a sample Twitter dataset included with NLTK, which is already labeled for positive and negative sentiment. This helps me quickly test and refine my preprocessing pipeline.
import nltk # My go-to library for NLP tasks
from nltk.corpus import twitter_samples # Sample Twitter dataset from NLTK
import matplotlib.pyplot as plt # For visualizing data
import random # For selecting random samples
About the Twitter Dataset¶
The NLTK sample dataset contains 5,000 positive and 5,000 negative tweets, making it perfectly balanced for testing. While real-world data is rarely this balanced, I find it useful for developing and evaluating my preprocessing steps. Later, I can adapt these methods to more complex, imbalanced datasets.
You can download the dataset in your workspace (or in your local computer) by doing:
# Download the sample Twitter dataset (if not already present)
nltk.download('twitter_samples')
[nltk_data] Downloading package twitter_samples to [nltk_data] /home/jovyan/nltk_data... [nltk_data] Package twitter_samples is already up-to-date!
True
I load the positive and negative tweets using NLTK's strings() method. This gives me two lists of tweets, ready for exploration and cleaning.
# Load positive and negative tweets from the dataset
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
I like to check the number of tweets in each category and confirm the data structure before diving deeper.
print('Number of positive tweets:', len(positive_tweets))
print('Number of negative tweets:', len(negative_tweets))
print('\nType of positive_tweets:', type(positive_tweets))
print('Type of a tweet entry:', type(negative_tweets[0]))
Number of positive tweets: 5000 Number of negative tweets: 5000 The type of all_positive_tweets is: <class 'list'> The type of a tweet entry is: <class 'str'>
The tweets are stored as lists of strings. To get a quick sense of the data balance, I visualize the counts using a pie chart. This is a simple but effective way to check class distribution before moving on.
# Create a pie chart to visualize class distribution
fig = plt.figure(figsize=(5, 5))
# labels for the two classes
labels = ['Positive', 'Negative']
# Sizes for each slide
sizes = [len(positive_tweets), len(negative_tweets)]
# Declare pie chart, where the slices will be ordered and plotted counter-clockwise:
plt.pie(sizes, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90)
# Equal aspect ratio ensures that pie is drawn as a circle.
plt.axis('equal')
# Display the chart
plt.show()
Exploring Raw Tweets¶
Before preprocessing, I always take a look at a few sample tweets. This helps me spot common patterns, quirks, or issues that might need special handling. Here, I print a random positive and negative tweet to get a feel for the data. (Note: Tweets are real and may contain explicit content.)
# Print a random positive tweet in green
print('\033[92m' + positive_tweets[random.randint(0, 4999)])
# Print a random negative tweet in red
print('\033[91m' + negative_tweets[random.randint(0, 4999)])
@steer_michael Dare you to run in the corridor :-) @cooldigangana @DiganganaS I want to attend ur birthday plssssssssssssssss :(
One observation you may have is the presence of emoticons and URLs in many of the tweets. This info will come in handy in the next steps.
My Preprocessing Pipeline for Sentiment Analysis¶
Data preprocessing is one of the critical steps in any machine learning project. It includes cleaning and formatting the data before feeding into a machine learning algorithm. For NLP, the preprocessing steps are comprised of the following tasks:
- Tokenizing the string
- Lowercasing
- Removing stop words and punctuation
- Stemming
The videos explained each of these steps and why they are important. Let's see how we can do these to a given tweet. We will choose just one and see how this is transformed by each preprocessing step.
For any NLP project, I find that careful data preprocessing is essential. My typical steps include:
- Tokenizing the text
- Converting to lowercase
- Removing stop words and punctuation
- Stemming words to their root form
I'll walk through each step using a sample tweet from the dataset, showing how I transform the raw text into something ready for analysis.
# Select a sample tweet to demonstrate preprocessing steps
sample_tweet = positive_tweets[2277]
print(sample_tweet)
My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i
Next, I import a few more libraries to help with text cleaning and tokenization.
# download the stopwords from NLTK
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data... [nltk_data] Package stopwords is already up-to-date!
True
import re # library for regular expression operations
import string # for string operations
from nltk.corpus import stopwords # module for stop words that come with NLTK
from nltk.stem import PorterStemmer # module for stemming
from nltk.tokenize import TweetTokenizer # module for tokenizing strings
Removing Twitter-Specific Text¶
Tweets often contain hashtags, retweet marks, and links. I use regular expressions to clean these out, making the text easier to analyze.
print('\033[92m' + sample_tweet)
print('\033[94m')
# Remove retweet text "RT"
cleaned_tweet = re.sub(r'^RT[\s]+', '', sample_tweet)
# Remove hyperlinks
cleaned_tweet = re.sub(r'https?://[^\s\n\r]+', '', cleaned_tweet)
# Remove hashtags (just the # symbol)
cleaned_tweet = re.sub(r'#', '', cleaned_tweet)
print(cleaned_tweet)
My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off…
Tokenizing the Text¶
Tokenization splits the tweet into individual words. I also convert everything to lowercase at this stage. NLTK's TweetTokenizer makes this process straightforward.
print()
print('\033[92m' + cleaned_tweet)
print('\033[94m')
# Initialize the tokenizer
my_tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
# Tokenize the cleaned tweet
tokens = my_tokenizer.tokenize(cleaned_tweet)
print()
print('Tokenized string:')
print(tokens)
My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… Tokenized string: ['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']
Removing Stop Words and Punctuation¶
Next, I filter out common stop words and punctuation. These words don't add much meaning and can clutter the analysis. NLTK provides a handy list of stop words, but I sometimes customize it for specific projects.
#Import the english stop words list from NLTK
stopwords_english = stopwords.words('english')
print('Stop words\n')
print(stopwords_english)
print('\nPunctuation\n')
print(string.punctuation)
Stop words
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
Punctuation
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
We can see that the stop words list above contains some words that could be important in some contexts. These could be words like i, not, between, because, won, against. You might need to customize the stop words list for some applications. For our exercise, we will use the entire list.
For the punctuation, we saw earlier that certain groupings like ':)' and '...' should be retained when dealing with tweets because they are used to express emotions. In other contexts, like medical analysis, these should also be removed.
Time to clean up our tokenized tweet!
print()
print('\033[92m')
print(tokens)
print('\033[94m')
clean_tokens = []
for word in tokens: # Go through every word in your tokens list
if (word not in stopwords_english and # remove stopwords
word not in string.punctuation): # remove punctuation
clean_tokens.append(word)
print('Removed stop words and punctuation:')
print(clean_tokens)
['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…'] removed stop words and punctuation: ['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']
Notice that words like happy and sunny are preserved after cleaning. This step helps focus on the most meaningful parts of each tweet.
Stemming¶
Stemming reduces words to their root form, which helps group similar words together. For example, 'learning', 'learned', and 'learnt' all become 'learn'. I use NLTK's PorterStemmer for this step. Sometimes, the stemmed words aren't real words (like 'happi'), but they still help reduce vocabulary size and improve analysis.
print()
print('\033[92m')
print(clean_tokens)
print('\033[94m')
# Initialize the stemmer
my_stemmer = PorterStemmer()
stemmed_tokens = []
for word in clean_tokens:
stemmed_word = my_stemmer.stem(word)
stemmed_tokens.append(stemmed_word)
print('Stemmed words:')
print(stemmed_tokens)
['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…'] stemmed words: ['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']
That's it! Now I have a clean set of words ready for the next stage of my sentiment analysis project.
My process_tweet() Function¶
To streamline preprocessing, I use a helper function called process_tweet(), which combines all the steps above. You can find its implementation in my utils.py file. This function makes it easy to preprocess any tweet with a single call.
from utils import process_tweet # Import my custom tweet preprocessing function
# Use the same sample tweet
sample_tweet = positive_tweets[2277]
print()
print('\033[92m')
print(sample_tweet)
print('\033[94m')
# Call my helper function
tweet_processed = process_tweet(sample_tweet)
print('Preprocessed tweet:')
print(tweet_processed)
My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i preprocessed tweet: ['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']
Thank you for following along with my tweet preprocessing workflow! I hope this gives you insight into my approach and inspires you to customize your own pipeline.
Notebook by Mohammad Sayem Chowdhury, June 2025