Case Study: Sentiment Analysis (Personal Project)¶

By Mohammad Sayem Chowdhury

Welcome to my personal notebook on sentiment analysis. Here, I walk through my approach to preparing data, building models, and evaluating results for sentiment classification. All code, analysis, and commentary are my own.


Author: Mohammad Sayem Chowdhury¶

Note: Some of the cells in this notebook are computationally expensive. To reduce runtime, this notebook is using a subset of the data.

Case Study: Sentiment Analysis¶

Data Prep¶

In [2]:
import pandas as pd
import numpy as np

# Read in the data
df = pd.read_csv('Amazon_Unlocked_Mobile.csv')

# Sample the data to speed up computation
# Comment out this line to match with lecture
# df = df.sample(frac=0.1, random_state=10)

df.head()
Out[2]:
Product Name Brand Name Price Rating Reviews Review Votes
0 "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... Samsung 199.99 5 I feel so LUCKY to have found this used (phone... 1.0
1 "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... Samsung 199.99 4 nice phone, nice up grade from my pantach revu... 0.0
2 "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... Samsung 199.99 5 Very pleased 0.0
3 "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... Samsung 199.99 4 It works good but it goes slow sometimes but i... 0.0
4 "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... Samsung 199.99 4 Great phone to replace my lost phone. The only... 0.0
In [3]:
# Drop missing values
df.dropna(inplace=True)

# Remove any 'neutral' ratings equal to 3
df = df[df['Rating'] != 3]

# Encode 4s and 5s as 1 (rated positively)
# Encode 1s and 2s as 0 (rated poorly)
df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)
df.head(10)
Out[3]:
Product Name Brand Name Price Rating Reviews Review Votes Positively Rated
0 "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... Samsung 199.99 5 I feel so LUCKY to have found this used (phone... 1.0 1
1 "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... Samsung 199.99 4 nice phone, nice up grade from my pantach revu... 0.0 1
2 "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... Samsung 199.99 5 Very pleased 0.0 1
3 "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... Samsung 199.99 4 It works good but it goes slow sometimes but i... 0.0 1
4 "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... Samsung 199.99 4 Great phone to replace my lost phone. The only... 0.0 1
5 "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... Samsung 199.99 1 I already had a phone with problems... I know ... 1.0 0
6 "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... Samsung 199.99 2 The charging port was loose. I got that solder... 0.0 0
7 "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... Samsung 199.99 2 Phone looks good but wouldn't stay charged, ha... 0.0 0
8 "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... Samsung 199.99 5 I originally was using the Samsung S2 Galaxy f... 0.0 1
11 "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... Samsung 199.99 5 This is a great product it came after two days... 0.0 1
In [4]:
# Most ratings are positive
df['Positively Rated'].mean()
Out[4]:
0.74826860258793226
In [5]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], 
                                                    df['Positively Rated'], 
                                                    random_state=0)
In [6]:
print('X_train first entry:\n\n', X_train.iloc[0])
print('\n\nX_train shape: ', X_train.shape)
X_train first entry:

 I bought a BB Black and was deliveried a White BB.Really is not a serious provider...Next time is better to cancel the order.


X_train shape:  (231207,)

CountVectorizer¶

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

# Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)
In [8]:
vect.get_feature_names()[::2000]
Out[8]:
['00',
 '4less',
 'adr6275',
 'assignment',
 'blazingly',
 'cassettes',
 'condishion',
 'debi',
 'dollarsshipping',
 'esteem',
 'flashy',
 'gorila',
 'human',
 'irullu',
 'like',
 'microsaudered',
 'nightmarish',
 'p770',
 'poori',
 'quirky',
 'responseive',
 'send',
 'sos',
 'synch',
 'trace',
 'utiles',
 'withstanding']
In [9]:
len(vect.get_feature_names())
Out[9]:
53216
In [12]:
# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)

X_train_vectorized #bags of words representation. No. of times  each words appears in each document
Out[12]:
<231207x53216 sparse matrix of type '<class 'numpy.int64'>'
	with 6117776 stored elements in Compressed Sparse Row format>
In [13]:
from sklearn.linear_model import LogisticRegression

# Train the model
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
Out[13]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [14]:
from sklearn.metrics import roc_auc_score

# Predict the transformed test documents
predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))
AUC:  0.92648398605
In [15]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))
Smallest Coefs:
['worst' 'false' 'worthless' 'junk' 'garbage' 'mony' 'useless' 'messing'
 'unusable' 'horrible']

Largest Coefs: 
['excelent' 'excelente' 'exelente' 'excellent' 'loving' 'loves' 'efficient'
 'perfecto' 'amazing' 'love']

Tfidf¶

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df=5).fit(X_train)
len(vect.get_feature_names())
Out[16]:
17951
In [17]:
X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))
AUC:  0.926610066675
In [18]:
feature_names = np.array(vect.get_feature_names())

sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))
Smallest tfidf:
['commenter' 'pthalo' 'warmness' 'storageso' 'aggregration' '1300'
 '625nits' 'a10' 'submarket' 'brawns']

Largest tfidf: 
['defective' 'batteries' 'gooood' 'epic' 'luis' 'goood' 'basico'
 'aceptable' 'problems' 'excellant']
In [19]:
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))
Smallest Coefs:
['not' 'worst' 'useless' 'disappointed' 'terrible' 'return' 'waste' 'poor'
 'horrible' 'doesn']

Largest Coefs: 
['love' 'great' 'excellent' 'perfect' 'amazing' 'awesome' 'perfectly'
 'easy' 'best' 'loves']
In [20]:
# These reviews are treated the same by our current model
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))
[0 0]

n-grams¶

In [21]:
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

len(vect.get_feature_names())
Out[21]:
198917
In [22]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))
AUC:  0.967143758101
In [23]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))
Smallest Coefs:
['no good' 'worst' 'junk' 'not good' 'not happy' 'horrible' 'garbage'
 'terrible' 'looks ok' 'nope']

Largest Coefs: 
['not bad' 'excelent' 'excelente' 'excellent' 'perfect' 'no problems'
 'exelente' 'awesome' 'no issues' 'great']
In [24]:
# These reviews are now correctly identified
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))
[1 0]
In [ ]:
 

This notebook and all analysis were created by Mohammad Sayem Chowdhury as a personal data science showcase.

Thank you for exploring my approach to sentiment analysis! If you have any feedback or suggestions, feel free to reach out.