Case Study: Sentiment Analysis (Personal Project)¶
By Mohammad Sayem Chowdhury
Welcome to my personal notebook on sentiment analysis. Here, I walk through my approach to preparing data, building models, and evaluating results for sentiment classification. All code, analysis, and commentary are my own.
Author: Mohammad Sayem Chowdhury¶
Note: Some of the cells in this notebook are computationally expensive. To reduce runtime, this notebook is using a subset of the data.
Case Study: Sentiment Analysis¶
Data Prep¶
In [2]:
import pandas as pd
import numpy as np
# Read in the data
df = pd.read_csv('Amazon_Unlocked_Mobile.csv')
# Sample the data to speed up computation
# Comment out this line to match with lecture
# df = df.sample(frac=0.1, random_state=10)
df.head()
Out[2]:
| Product Name | Brand Name | Price | Rating | Reviews | Review Votes | |
|---|---|---|---|---|---|---|
| 0 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 5 | I feel so LUCKY to have found this used (phone... | 1.0 |
| 1 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 4 | nice phone, nice up grade from my pantach revu... | 0.0 |
| 2 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 5 | Very pleased | 0.0 |
| 3 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 4 | It works good but it goes slow sometimes but i... | 0.0 |
| 4 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 4 | Great phone to replace my lost phone. The only... | 0.0 |
In [3]:
# Drop missing values
df.dropna(inplace=True)
# Remove any 'neutral' ratings equal to 3
df = df[df['Rating'] != 3]
# Encode 4s and 5s as 1 (rated positively)
# Encode 1s and 2s as 0 (rated poorly)
df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)
df.head(10)
Out[3]:
| Product Name | Brand Name | Price | Rating | Reviews | Review Votes | Positively Rated | |
|---|---|---|---|---|---|---|---|
| 0 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 5 | I feel so LUCKY to have found this used (phone... | 1.0 | 1 |
| 1 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 4 | nice phone, nice up grade from my pantach revu... | 0.0 | 1 |
| 2 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 5 | Very pleased | 0.0 | 1 |
| 3 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 4 | It works good but it goes slow sometimes but i... | 0.0 | 1 |
| 4 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 4 | Great phone to replace my lost phone. The only... | 0.0 | 1 |
| 5 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 1 | I already had a phone with problems... I know ... | 1.0 | 0 |
| 6 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 2 | The charging port was loose. I got that solder... | 0.0 | 0 |
| 7 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 2 | Phone looks good but wouldn't stay charged, ha... | 0.0 | 0 |
| 8 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 5 | I originally was using the Samsung S2 Galaxy f... | 0.0 | 1 |
| 11 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 5 | This is a great product it came after two days... | 0.0 | 1 |
In [4]:
# Most ratings are positive
df['Positively Rated'].mean()
Out[4]:
0.74826860258793226
In [5]:
from sklearn.model_selection import train_test_split
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'],
df['Positively Rated'],
random_state=0)
In [6]:
print('X_train first entry:\n\n', X_train.iloc[0])
print('\n\nX_train shape: ', X_train.shape)
X_train first entry: I bought a BB Black and was deliveried a White BB.Really is not a serious provider...Next time is better to cancel the order. X_train shape: (231207,)
CountVectorizer¶
In [7]:
from sklearn.feature_extraction.text import CountVectorizer
# Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)
In [8]:
vect.get_feature_names()[::2000]
Out[8]:
['00', '4less', 'adr6275', 'assignment', 'blazingly', 'cassettes', 'condishion', 'debi', 'dollarsshipping', 'esteem', 'flashy', 'gorila', 'human', 'irullu', 'like', 'microsaudered', 'nightmarish', 'p770', 'poori', 'quirky', 'responseive', 'send', 'sos', 'synch', 'trace', 'utiles', 'withstanding']
In [9]:
len(vect.get_feature_names())
Out[9]:
53216
In [12]:
# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)
X_train_vectorized #bags of words representation. No. of times each words appears in each document
Out[12]:
<231207x53216 sparse matrix of type '<class 'numpy.int64'>' with 6117776 stored elements in Compressed Sparse Row format>
In [13]:
from sklearn.linear_model import LogisticRegression
# Train the model
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
Out[13]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
In [14]:
from sklearn.metrics import roc_auc_score
# Predict the transformed test documents
predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))
AUC: 0.92648398605
In [15]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())
# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()
# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1]
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))
Smallest Coefs: ['worst' 'false' 'worthless' 'junk' 'garbage' 'mony' 'useless' 'messing' 'unusable' 'horrible'] Largest Coefs: ['excelent' 'excelente' 'exelente' 'excellent' 'loving' 'loves' 'efficient' 'perfecto' 'amazing' 'love']
Tfidf¶
In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df=5).fit(X_train)
len(vect.get_feature_names())
Out[16]:
17951
In [17]:
X_train_vectorized = vect.transform(X_train)
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))
AUC: 0.926610066675
In [18]:
feature_names = np.array(vect.get_feature_names())
sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()
print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))
Smallest tfidf: ['commenter' 'pthalo' 'warmness' 'storageso' 'aggregration' '1300' '625nits' 'a10' 'submarket' 'brawns'] Largest tfidf: ['defective' 'batteries' 'gooood' 'epic' 'luis' 'goood' 'basico' 'aceptable' 'problems' 'excellant']
In [19]:
sorted_coef_index = model.coef_[0].argsort()
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))
Smallest Coefs: ['not' 'worst' 'useless' 'disappointed' 'terrible' 'return' 'waste' 'poor' 'horrible' 'doesn'] Largest Coefs: ['love' 'great' 'excellent' 'perfect' 'amazing' 'awesome' 'perfectly' 'easy' 'best' 'loves']
In [20]:
# These reviews are treated the same by our current model
print(model.predict(vect.transform(['not an issue, phone is working',
'an issue, phone is not working'])))
[0 0]
n-grams¶
In [21]:
# Fit the CountVectorizer to the training data specifiying a minimum
# document frequency of 5 and extracting 1-grams and 2-grams
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)
X_train_vectorized = vect.transform(X_train)
len(vect.get_feature_names())
Out[21]:
198917
In [22]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))
AUC: 0.967143758101
In [23]:
feature_names = np.array(vect.get_feature_names())
sorted_coef_index = model.coef_[0].argsort()
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))
Smallest Coefs: ['no good' 'worst' 'junk' 'not good' 'not happy' 'horrible' 'garbage' 'terrible' 'looks ok' 'nope'] Largest Coefs: ['not bad' 'excelent' 'excelente' 'excellent' 'perfect' 'no problems' 'exelente' 'awesome' 'no issues' 'great']
In [24]:
# These reviews are now correctly identified
print(model.predict(vect.transform(['not an issue, phone is working',
'an issue, phone is not working'])))
[1 0]
In [ ]:
This notebook and all analysis were created by Mohammad Sayem Chowdhury as a personal data science showcase.
Thank you for exploring my approach to sentiment analysis! If you have any feedback or suggestions, feel free to reach out.