Chris Jones

Sentiment Analysis

Python

scikit-learn

BERT

Overview

A sentiment analysis pipeline built from scratch to classify IMDB movie reviews as positive or negative. The project involved a full comparative study of traditional NLP feature engineering methods against a fine-tuned BERT model, submitted as coursework and graded 82%.

Dataset

The IMDB dataset (aclImdb) contains 50,000 movie reviews split evenly between positive and negative sentiment. Reviews rated 7–10 were labelled positive and those rated 1–4 negative. The dataset was split into training (75%), test (12.5%), and evaluation (12.5%) sets with stratified sampling.

Feature Engineering

A comprehensive preprocessing and feature extraction pipeline was implemented and evaluated across hundreds of hyperparameter combinations:

Preprocessing

Tokenisation — whitespace splitting with optional lowercasing
Stopword removal — using NLTK's English stopword list
Punctuation removal — regex-based stripping
Stemming — Porter Stemmer
Lemmatisation — WordNet Lemmatiser

Feature Extraction

N-grams — bigrams and trigrams with frequency-based filtering
Constituency Parsing — noun phrase extraction via PoS tagging and RegexpParser
TF-IDF — custom implementation with consistent vocabulary matrix
L2 TF-IDF — TF-IDF with manual L2 normalisation per document

Hyperparameter Search

Over 1,400 combinations of preprocessing method, stopword removal, punctuation stripping, lowercase, frequency cut-off, n-gram size, and normalisation strategy were evaluated on a 10% data subset. The top-performing configurations were then re-evaluated on the full dataset.

Models

Naïve Bayes

A Naïve Bayes classifier was implemented from scratch using log-probability scoring with Laplace smoothing. The best configuration used:

Lemmatisation with stopword and punctuation removal
Trigrams filtered to frequency ≥ 5
Frequency cut-off of 15
L2 TF-IDF normalisation

Model	Accuracy
scikit-learn MultinomialNB (baseline)	~83%
Custom Naïve Bayes (best config)	81%

BERT

Reviews were exported to JSON and fine-tuned using the transformers library on a pre-trained BERT model. Deep contextual embeddings significantly outperformed all bag-of-words approaches.

Model	Accuracy
Naïve Bayes (best)	81%
BERT (fine-tuned)	89%

Key Findings

Lemmatisation consistently outperformed stemming for downstream classification
Trigrams with frequency filtering provided better signal than unigrams alone
L2 normalisation marginally improved Naïve Bayes accuracy over standard TF-IDF
Constituency parsing (noun phrases) underperformed n-gram approaches for sentiment
BERT's contextual representations provided an 8-point accuracy gain over the best traditional pipeline, demonstrating the limits of bag-of-words representations for nuanced sentiment

Conclusion

The project demonstrates that careful feature engineering can push traditional methods to competitive performance, but pre-trained transformer models remain substantially superior for sentiment classification. The structured hyperparameter search provided clear evidence for which preprocessing choices matter most in a classical NLP pipeline.

On this page

Links

Paper

Sentiment Analysis

Python

scikit-learn

BERT

Overview

Dataset

Feature Engineering

A comprehensive preprocessing and feature extraction pipeline was implemented and evaluated across hundreds of hyperparameter combinations:

Preprocessing

Tokenisation — whitespace splitting with optional lowercasing
Stopword removal — using NLTK's English stopword list
Punctuation removal — regex-based stripping
Stemming — Porter Stemmer
Lemmatisation — WordNet Lemmatiser

Feature Extraction

N-grams — bigrams and trigrams with frequency-based filtering
Constituency Parsing — noun phrase extraction via PoS tagging and RegexpParser
TF-IDF — custom implementation with consistent vocabulary matrix
L2 TF-IDF — TF-IDF with manual L2 normalisation per document

Hyperparameter Search

Models

Naïve Bayes

A Naïve Bayes classifier was implemented from scratch using log-probability scoring with Laplace smoothing. The best configuration used:

Lemmatisation with stopword and punctuation removal
Trigrams filtered to frequency ≥ 5
Frequency cut-off of 15
L2 TF-IDF normalisation

Model	Accuracy
scikit-learn MultinomialNB (baseline)	~83%
Custom Naïve Bayes (best config)	81%

BERT

Reviews were exported to JSON and fine-tuned using the transformers library on a pre-trained BERT model. Deep contextual embeddings significantly outperformed all bag-of-words approaches.

Model	Accuracy
Naïve Bayes (best)	81%
BERT (fine-tuned)	89%

Key Findings

Lemmatisation consistently outperformed stemming for downstream classification
Trigrams with frequency filtering provided better signal than unigrams alone
L2 normalisation marginally improved Naïve Bayes accuracy over standard TF-IDF
Constituency parsing (noun phrases) underperformed n-gram approaches for sentiment
BERT's contextual representations provided an 8-point accuracy gain over the best traditional pipeline, demonstrating the limits of bag-of-words representations for nuanced sentiment

Conclusion

On this page

Links

Paper