Back to Case Studies
Engineering Case StudyDeep LearningComputer Vision

Multi-Task CNN for Age & Gender Prediction

Comparing ResNet18, EfficientNet-B0, and ConvNeXt-Tiny for simultaneous age regression and gender classification. Kaggle harmonic score: 0.821.

February 2025
3 min read
Devkumar Patel
Multi-Task LearningPyTorchConvNeXtKaggle
Multi-Task CNN for Age & Gender Prediction

Problem

Given a facial image, predict two targets simultaneously:

  • Age — continuous regression (range 0–100)
  • Gender — binary classification

The core challenge: regression and classification losses have conflicting gradient dynamics when sharing a backbone. Age gradients from MSE behave very differently from binary cross-entropy gradients.


Architecture

Shared backbone → dual MLP heads. Three CNN families tested under identical conditions:

ModelParamsKaggle Score
ResNet1811.5M0.763
EfficientNet-B04.8M0.802
ConvNeXt-Tiny 🥇28.3M0.821

ConvNeXt-Tiny won. Its larger effective receptive field captured subtle spatial patterns (skin texture, jaw geometry) significantly better than the older architectures.


Key Engineering Decisions

Sampling — Age distributions are heavily skewed toward 20–30s. Naive training biases predictions toward the mean. Used WeightedRandomSampler with inverse bin-frequency weights to flatten the distribution per epoch.

LossSmoothL1Loss for age (robust to mislabeled outliers that MSE would square-penalize) + BCEWithLogitsLoss(pos_weight=1.1) for gender. Static 1:1 loss addition worked — dynamic weighting added instability without gain.

Resize strategyLongestMaxSize → PadIfNeeded → CenterCrop instead of naive rectangular resize. Preserving aspect ratio is critical — distorting facial geometry kills the model's ability to distinguish fine age/gender features.

TTA — Two inference passes (original + horizontal flip), averaged. Cheap, reliable +0.01–0.02 score gain.


What Broke

BugRoot CauseFix
Age predictions collapsed to ~30Class imbalance ignoredWeightedRandomSampler
Harmonic score oscillated wildly earlyDual tasks fighting for gradient shareHard-coded patience bump for early stopping
Inference predictions wrongAge scale not reverted post-predictionFixed inside loss function scope
Checkpoint wouldn't load at inferencePyTorch Lightning adds model. key prefixCustom state_dict key parser

Results

MetricConvNeXt-Tiny
Age RMSE8.12
Age MAE5.66
Gender Macro F10.940
Kaggle Harmonic Score0.821

Lessons

  1. Multi-task training is data engineering first, architecture second
  2. Sampling strategy can matter more than loss function design
  3. Geometric augmentations must respect the domain — facial structure is sensitive
  4. TTA is almost always worth the 2× inference cost

Have a Better Approach?

This is my take — not the definitive solution. If you've worked on multi-task learning, know better gradient balancing strategies, or found a smarter sampling technique, I'd genuinely like to hear from you.

Get in touch →