Multi-Task CNN for Age & Gender Prediction
Comparing ResNet18, EfficientNet-B0, and ConvNeXt-Tiny for simultaneous age regression and gender classification. Kaggle harmonic score: 0.821.

Problem
Given a facial image, predict two targets simultaneously:
- Age — continuous regression (range 0–100)
- Gender — binary classification
The core challenge: regression and classification losses have conflicting gradient dynamics when sharing a backbone. Age gradients from MSE behave very differently from binary cross-entropy gradients.
Architecture
Shared backbone → dual MLP heads. Three CNN families tested under identical conditions:
| Model | Params | Kaggle Score |
|---|---|---|
| ResNet18 | 11.5M | 0.763 |
| EfficientNet-B0 | 4.8M | 0.802 |
| ConvNeXt-Tiny 🥇 | 28.3M | 0.821 |
ConvNeXt-Tiny won. Its larger effective receptive field captured subtle spatial patterns (skin texture, jaw geometry) significantly better than the older architectures.
Key Engineering Decisions
Sampling — Age distributions are heavily skewed toward 20–30s. Naive training biases predictions toward the mean. Used WeightedRandomSampler with inverse bin-frequency weights to flatten the distribution per epoch.
Loss — SmoothL1Loss for age (robust to mislabeled outliers that MSE would square-penalize) + BCEWithLogitsLoss(pos_weight=1.1) for gender. Static 1:1 loss addition worked — dynamic weighting added instability without gain.
Resize strategy — LongestMaxSize → PadIfNeeded → CenterCrop instead of naive rectangular resize. Preserving aspect ratio is critical — distorting facial geometry kills the model's ability to distinguish fine age/gender features.
TTA — Two inference passes (original + horizontal flip), averaged. Cheap, reliable +0.01–0.02 score gain.
What Broke
| Bug | Root Cause | Fix |
|---|---|---|
| Age predictions collapsed to ~30 | Class imbalance ignored | WeightedRandomSampler |
| Harmonic score oscillated wildly early | Dual tasks fighting for gradient share | Hard-coded patience bump for early stopping |
| Inference predictions wrong | Age scale not reverted post-prediction | Fixed inside loss function scope |
| Checkpoint wouldn't load at inference | PyTorch Lightning adds model. key prefix | Custom state_dict key parser |
Results
| Metric | ConvNeXt-Tiny |
|---|---|
| Age RMSE | 8.12 |
| Age MAE | 5.66 |
| Gender Macro F1 | 0.940 |
| Kaggle Harmonic Score | 0.821 |
Lessons
- Multi-task training is data engineering first, architecture second
- Sampling strategy can matter more than loss function design
- Geometric augmentations must respect the domain — facial structure is sensitive
- TTA is almost always worth the 2× inference cost
Have a Better Approach?
This is my take — not the definitive solution. If you've worked on multi-task learning, know better gradient balancing strategies, or found a smarter sampling technique, I'd genuinely like to hear from you.