Sentiment Analysis
Python
scikit-learn
BERT

A sentiment analysis pipeline built from scratch to classify IMDB movie reviews as positive or negative. The project involved a full comparative study of traditional NLP feature engineering methods against a fine-tuned BERT model, submitted as coursework and graded 82%.
The IMDB dataset (aclImdb) contains 50,000 movie reviews split evenly between positive and negative sentiment. Reviews rated 7–10 were labelled positive and those rated 1–4 negative. The dataset was split into training (75%), test (12.5%), and evaluation (12.5%) sets with stratified sampling.
A comprehensive preprocessing and feature extraction pipeline was implemented and evaluated across hundreds of hyperparameter combinations:
RegexpParserOver 1,400 combinations of preprocessing method, stopword removal, punctuation stripping, lowercase, frequency cut-off, n-gram size, and normalisation strategy were evaluated on a 10% data subset. The top-performing configurations were then re-evaluated on the full dataset.
A Naïve Bayes classifier was implemented from scratch using log-probability scoring with Laplace smoothing. The best configuration used:
| Model | Accuracy |
|---|---|
| scikit-learn MultinomialNB (baseline) | ~83% |
| Custom Naïve Bayes (best config) | 81% |
Reviews were exported to JSON and fine-tuned using the transformers library on a pre-trained BERT model. Deep contextual embeddings significantly outperformed all bag-of-words approaches.
| Model | Accuracy |
|---|---|
| Naïve Bayes (best) | 81% |
| BERT (fine-tuned) | 89% |
The project demonstrates that careful feature engineering can push traditional methods to competitive performance, but pre-trained transformer models remain substantially superior for sentiment classification. The structured hyperparameter search provided clear evidence for which preprocessing choices matter most in a classical NLP pipeline.
On this page
Links