Vehicle Recognition

Python

PyTorch

Overview

This project was my Bachelor's dissertation. I set out to tackle Vehicle Make and Model Recognition (VMMR) — essentially teaching a model to identify the exact make and model of a car from an image. The real-world challenge I focused on was: what happens when you only have a handful of training images for a new vehicle? That's the few-shot learning problem, and it's a very real constraint in traffic monitoring, law enforcement, and autonomous systems.

Research Questions

The dissertation revolved around four questions:

  1. How do established few-shot methods hold up on a vehicle recognition task?
  2. Does the quality of a part detector actually affect the graph representations built from it?
  3. What makes it hard to turn 2D vehicle images into useful graph structures?
  4. What would a better approach look like going forward?

System Architecture

Architecture diagram showing the two-pipeline system: off-the-shelf few-shot classification (top) and the novel PartGraph graph representation approach (bottom)

The system has two tracks. The first benchmarks well-known few-shot methods against cropped Stanford Cars images. The second is my own contribution — PartGraph — a pipeline that detects vehicle parts, builds a graph from them, and runs a GNN classifier on top.

Off-the-Shelf Few-Shot Methods

Traditional Meta-Learning

I tested a range of standard meta-learning approaches on Stanford Cars in a 5-way setup:

Model5-way 1-shot5-way 5-shot
Meta-Baseline—38.28 ± 1.91
Baseline++23.92 ± 1.1731.97 ± 1.24
ProtoNet27.64 ± 1.3327.51 ± 1.13
ProtoNet + Random Crop35.35 ± 1.21—
RelationNet22.03 ± 0.9325.32 ± 1.02
Negative Margin23.63 ± 1.1127.97 ± 1.19

Adding random crops to ProtoNet made a big difference — jumping from 27.64% to 35.35% on 1-shot. That said, both versions overfit; validation accuracy started dropping past the halfway point of training. Meta-Baseline was skipped for 1-shot due to the training time involved.

CLIP-Based Methods

CLIP-based adapters blew the traditional methods out of the water:

ShotsZero-Shot CLIPTip-AdapterTip-Adapter-F
555.64%61.98%66.35%
855.64%62.93%68.93%
1255.64%64.87%72.94%
1655.64%66.75%74.97%

Tip-Adapter-F hit 74.97% at 16-shot — nearly double what ProtoNet managed. The gap between CLIP-based and traditional meta-learning was pretty striking.

PartGraph: My Novel Approach

Car Parts Dataset

For training the part detectors, I used a dataset with 30,772 annotated instances across 21 vehicle part categories:

CategoryInstancesCategoryInstancesCategoryInstances
Back Bumper909Back Door1,425Back Wheel1,677
Back Window2,394Back Windshield574Fender1,820
Front Bumper1,358Front Door1,779Front Wheel1,723
Front Window1,859Grille1,079Headlight1,742
Hood1,375License Plate743Mirror1,892
Quarter Panel1,659Rocker Panel1,677Roof1,531
Tail Light1,504Trunk852Windshield1,200

Instance Segmentation Results (Mask mAP)

ModelmAPmAP50mAP75mAPsmAPmmAPl
Mask R-CNN31.10%46.00%36.00%12.50%24.60%36.60%
RTMDet37.90%49.50%41.00%4.60%26.20%44.80%
YOLOv532.40%48.00%33.50%3.40%24.10%38.70%

Segmentation quality was poor across all three models, so I switched to bounding boxes for graph construction instead, as using noisy masks would have just polluted the node features.

Bounding Box Detection Results (mAP)

ModelmAPmAP50mAP75mAPsmAPmmAPl
Mask R-CNN69.80%91.30%77.70%28.20%59.00%75.30%
RTMDet63.70%87.20%71.20%17.20%51.30%70.00%
YOLOv559.50%83.90%66.20%6.80%44.70%68.40%
Rank-DETR76.14%94.08%83.76%45.83%64.52%83.87%

Rank-DETR came out on top with 76.14% mAP, making it the strongest candidate for feeding into PartGraph.

Full Results Comparison

Method1-shot5-shot8-shot12-shot16-shot
Traditional Few-Shot (5-way)
Meta-Baseline—38.28———
Baseline++23.9231.97———
ProtoNet27.6427.51———
ProtoNet + Random Crop35.35————
RelationNet22.0325.32———
Negative Margin23.6327.97———
CLIP-Based
Zero-Shot CLIP55.6455.6455.6455.6455.64
Tip-Adapter—61.9862.9364.8766.75
Tip-Adapter-F—66.3568.9372.9474.97
PartGraph (5-way)
CLIP + RTMDet—20.11———
CLIP + Rank-DETR—19.89———
CLIP + PCA + Rank-DETR—19.77———
CLIP + Global-Local + Rank-DETR—19.61———
State-of-the-Art (Literature)
Liu et al. (2024)*91.3798.63———
Chen et al. (2020a)*73.1591.89———
Li et al. (2019)*61.5189.60———
Li et al. (2023)*76.8188.21———

Taken from literature, not reproduced here.

Key Findings

  • CLIP-based adapters are the clear winner for low-data vehicle recognition — Tip-Adapter-F at 74.97% (16-shot) was the standout result
  • Graph-based approaches have real fundamental problems to solve: sparse detections, inconsistent part localisation, and vehicle graphs that just don't look that different from one class to another
  • The part detector matters a lot — Rank-DETR's better mAP directly produced cleaner graph structures
  • PartGraph underperformed, but learning from it's failure gives insight on how to tackle this problem in the future.

Conclusion

The dissertation landed on a strong few-shot baseline for VMMR and introduced a graph-based approach that, while not competitive, demonsstrates a new idea for vehicle recognition.

On this page

OverviewResearch QuestionsSystem ArchitectureOff-the-Shelf Few-Shot MethodsTraditional Meta-LearningCLIP-Based MethodsPartGraph: My Novel ApproachCar Parts DatasetInstance Segmentation Results (Mask mAP)Bounding Box Detection Results (mAP)Full Results ComparisonKey FindingsConclusion

Links

Paper