Multi-Task CNN for Age & Gender Prediction

Problem

Given a facial image, predict two targets simultaneously:

Age — continuous regression (range 0–100)
Gender — binary classification

The core challenge: regression and classification losses have conflicting gradient dynamics when sharing a backbone. Age gradients from MSE behave very differently from binary cross-entropy gradients.

Architecture

Shared backbone → dual MLP heads. Three CNN families tested under identical conditions:

Model	Params	Kaggle Score
ResNet18	11.5M	0.763
EfficientNet-B0	4.8M	0.802
ConvNeXt-Tiny 🥇	28.3M	0.821

ConvNeXt-Tiny won. Its larger effective receptive field captured subtle spatial patterns (skin texture, jaw geometry) significantly better than the older architectures.

Key Engineering Decisions

Sampling — Age distributions are heavily skewed toward 20–30s. Naive training biases predictions toward the mean. Used WeightedRandomSampler with inverse bin-frequency weights to flatten the distribution per epoch.

Loss — SmoothL1Loss for age (robust to mislabeled outliers that MSE would square-penalize) + BCEWithLogitsLoss(pos_weight=1.1) for gender. Static 1:1 loss addition worked — dynamic weighting added instability without gain.

Resize strategy — LongestMaxSize → PadIfNeeded → CenterCrop instead of naive rectangular resize. Preserving aspect ratio is critical — distorting facial geometry kills the model's ability to distinguish fine age/gender features.

TTA — Two inference passes (original + horizontal flip), averaged. Cheap, reliable +0.01–0.02 score gain.

What Broke

Bug	Root Cause	Fix
Age predictions collapsed to ~30	Class imbalance ignored	`WeightedRandomSampler`
Harmonic score oscillated wildly early	Dual tasks fighting for gradient share	Hard-coded patience bump for early stopping
Inference predictions wrong	Age scale not reverted post-prediction	Fixed inside loss function scope
Checkpoint wouldn't load at inference	PyTorch Lightning adds `model.` key prefix	Custom `state_dict` key parser

Results

Metric	ConvNeXt-Tiny
Age RMSE	8.12
Age MAE	5.66
Gender Macro F1	0.940
Kaggle Harmonic Score	0.821

Lessons

Multi-task training is data engineering first, architecture second
Sampling strategy can matter more than loss function design
Geometric augmentations must respect the domain — facial structure is sensitive
TTA is almost always worth the 2× inference cost

Have a Better Approach?

This is my take — not the definitive solution. If you've worked on multi-task learning, know better gradient balancing strategies, or found a smarter sampling technique, I'd genuinely like to hear from you.

Get in touch →