Sentiment Analysis

Python

scikit-learn

BERT

Overview

A sentiment analysis pipeline built from scratch to classify IMDB movie reviews as positive or negative. The project involved a full comparative study of traditional NLP feature engineering methods against a fine-tuned BERT model, submitted as coursework and graded 82%.

Dataset

The IMDB dataset (aclImdb) contains 50,000 movie reviews split evenly between positive and negative sentiment. Reviews rated 7–10 were labelled positive and those rated 1–4 negative. The dataset was split into training (75%), test (12.5%), and evaluation (12.5%) sets with stratified sampling.

Feature Engineering

A comprehensive preprocessing and feature extraction pipeline was implemented and evaluated across hundreds of hyperparameter combinations:

Preprocessing

  • Tokenisation — whitespace splitting with optional lowercasing
  • Stopword removal — using NLTK's English stopword list
  • Punctuation removal — regex-based stripping
  • Stemming — Porter Stemmer
  • Lemmatisation — WordNet Lemmatiser

Feature Extraction

  • N-grams — bigrams and trigrams with frequency-based filtering
  • Constituency Parsing — noun phrase extraction via PoS tagging and RegexpParser
  • TF-IDF — custom implementation with consistent vocabulary matrix
  • L2 TF-IDF — TF-IDF with manual L2 normalisation per document

Hyperparameter Search

Over 1,400 combinations of preprocessing method, stopword removal, punctuation stripping, lowercase, frequency cut-off, n-gram size, and normalisation strategy were evaluated on a 10% data subset. The top-performing configurations were then re-evaluated on the full dataset.

Models

Naïve Bayes

A Naïve Bayes classifier was implemented from scratch using log-probability scoring with Laplace smoothing. The best configuration used:

  • Lemmatisation with stopword and punctuation removal
  • Trigrams filtered to frequency ≥ 5
  • Frequency cut-off of 15
  • L2 TF-IDF normalisation
ModelAccuracy
scikit-learn MultinomialNB (baseline)~83%
Custom Naïve Bayes (best config)81%

BERT

Reviews were exported to JSON and fine-tuned using the transformers library on a pre-trained BERT model. Deep contextual embeddings significantly outperformed all bag-of-words approaches.

ModelAccuracy
Naïve Bayes (best)81%
BERT (fine-tuned)89%

Key Findings

  • Lemmatisation consistently outperformed stemming for downstream classification
  • Trigrams with frequency filtering provided better signal than unigrams alone
  • L2 normalisation marginally improved Naïve Bayes accuracy over standard TF-IDF
  • Constituency parsing (noun phrases) underperformed n-gram approaches for sentiment
  • BERT's contextual representations provided an 8-point accuracy gain over the best traditional pipeline, demonstrating the limits of bag-of-words representations for nuanced sentiment

Conclusion

The project demonstrates that careful feature engineering can push traditional methods to competitive performance, but pre-trained transformer models remain substantially superior for sentiment classification. The structured hyperparameter search provided clear evidence for which preprocessing choices matter most in a classical NLP pipeline.

On this page

OverviewDatasetFeature EngineeringPreprocessingFeature ExtractionHyperparameter SearchModelsNaïve BayesBERTKey FindingsConclusion

Links

Paper