Multi-Task Deep Learning for Age and Gender Prediction: From Baseline to 0.82 Kaggle Score
A research-style deep dive into building, comparing, and optimizing multi-task CNN architectures (ResNet18, EfficientNet-B0, ConvNeXt-Tiny) for simultaneous age regression and gender classification.

1. Abstract & Introduction
This research presents the design, development, and comparative evaluation of deep learning models for simultaneous age regression and gender classification from facial images.
The core focus of this work lies in systematically building a robust Multi-Task Learning (MTL) architecture. By jointly optimizing for regression and classification, we explore the challenges of shared feature learning, competitive objective alignment, and metric-driven architecture tuning. We tested three distinct CNN backbone paradigms (ResNet18, EfficientNet-B0, and ConvNeXt-Tiny) to determine optimal representations for facial data.
The final production-ready inference system, utilizing ConvNeXt-Tiny with Test Time Augmentation (TTA), achieved a highly competitive Kaggle harmonic score of 0.821. This case study details the complete engineering and research pipeline, moving beyond theory into the practical intricacies of data sampling, mixed-precision training, and systematic debugging.
2. Problem Statement
Given a facial image, the objective is to predict two distinct targets simultaneously:
- Age: A continuous regression value in the range [0, 100].
- Gender: A binary classification label (0 = Female, 1 = Male).
This presents a classic multi-task learning challenge:
- Regression Loss vs. Classification Loss: The gradients from age mean squared errors behave very differently than binary cross-entropy gradients.
- Shared Feature Learning: The early convolutional layers must learn to extract edge and shape features that are useful for both detecting masculine/feminine jawlines and identifying fine-grained skin wrinkles.
3. Dataset Description
The models were trained and evaluated on a specialized dataset provided via a Kaggle competition.
- Dataset Source: Kaggle Competition
- Total Training Samples: 34,708 images
- Total Test Samples: 8,677 images
Dataset Profile
Columns: id, full_path, gender, age.
The images in this dataset vary immensely in quality and context, including:
- Arbitrary facial poses and head rotations.
- Extreme variations in lighting and contrast.
- Variable image resolutions.
- Broad diversity in ethnicity and age groups.
4. Data Preprocessing Pipeline
To ensure the model learns robust facial features rather than dataset artifacts, building a high-quality preprocessing pipeline was paramount. We utilized OpenCV, Albumentations, and PyTorch.
Training Augmentations
We applied a stochastic sequence of augmentations to generalize the model against real-world variations:
HorizontalFlip(p=0.5)ShiftScaleRotateto simulate different camera angles and subject distances.RandomBrightnessContrast&HueSaturationValueto combat diverse lighting conditions.GaussNoise&Blurto prevent overfitting to high-resolution noise patterns.
Normalization
We normalized all inputs using standard ImageNet statistics to leverage pre-trained weights effectively:
mean = (0.485, 0.456, 0.406)std = (0.229, 0.224, 0.225)
Resize Strategy
A critical early design choice was how to resize images to the target 224×224 resolution. Instead of a naive rectangular resize—which stretches and squashes faces, distorting key geometric features used to determine gender and age—we implemented:
LongestMaxSize -> PadIfNeeded -> CenterCrop
Reasoning: This preserves the true aspect ratio of the face, avoiding fatal geometric distortions, while still safely standardizing the tensor shape to $224 \times 224 \times 3$ for the GPU.
5. Addressing Class Imbalance: The Sampling Strategy
The Problem: Age distributions in facial datasets are notoriously imbalanced. Babies and the elderly are drastically underrepresented compared to adults in their 20s and 30s. Training naively causes the model to strongly bias its predictions toward the mean age (~30 years old), destroying its RMSE on edge cases.
The Solution: WeightedRandomSampler
We instituted a dynamic sampling strategy to flat-line the representation during training:
- Placed ages into histogram bins.
- Computed the inverse frequency of each bin.
- Assigned sampling weights dynamically to every dataset index.
The Effect: This artificially boosted the exposure frequency of rare ages per epoch, drastically reducing regression bias and improving the global RMSE stability.
6. Model Architecture Design
All multi-task models followed a universal design paradigm: a shared convolutional backbone feeding into dual, specialized Multi-Layer Perceptron (MLP) heads.
Input Image (224x224x3)
↓
CNN Backbone (timm)
↓
Shared Feature Vector
↓
Shared MLP Head
↓
┌───────────────┬───────────────┐
│ Age Head │ Gender Head │
│ Regression │ Classification│
└───────────────┴───────────────┘
Multi-Task Head Topology
Instead of mapping the backbone directly to a Linear(features, 1) layer, we introduced a shared transitional MLP to provide additional capacity for the network to disentangle the age and gender features before splitting:
Linear -> BatchNorm -> GELU -> Dropout
Finally, the paths diverge into two separate linear outputs:
- Age:
Linear(256 -> 1)(Continuous Output) - Gender:
Linear(256 -> 1)(Logit Output)
7. Backbone Models Compared
We benchmarked three distinct architectural families loaded via the timm library:
| Model | Params | Architecture Type | Characteristics |
|---|---|---|---|
| ResNet18 | 11.5M | Residual CNN | Deep, standard skip connections, proven reliability. |
| EfficientNet-B0 | 4.8M | Compound scaling | Highly parameter-efficient, optimized for FLOPs. |
| ConvNeXt-Tiny | 28.3M | Modern ConvNet | Re-imagined standard CNN incorporating Transformer-like design principles. |
8. Loss Functions & Joint Optimization
Optimizing two completely different tasks simultaneously requires careful loss scaling.
Age Loss: SmoothL1Loss
- Why?: Age regression suffers heavily from dataset mislabeling (outliers). Mean Squared Error (MSE) squares the penalty of these outliers, destroying gradients. Smooth L1 behaves like L1 loss for large errors (robustness) and L2 loss near zero (stable gradients).
Gender Loss: BCEWithLogitsLoss(pos_weight=1.1)
- Why?: We utilized binary cross-entropy with un-normalized logits for numerical stability. We also applied a manual
pos_weight=1.1to slightly penalize misclassification of the minority gender, handling slight class imbalances smoothly.
Final Joint Objective: Loss = AgeLoss + GenderLoss
(Note: We experimented with dynamic loss weighting, but static 1:1 addition proved simple, stable, and highly effective.)
9. Optimizer, Scheduler & Mixed Precision Training
To achieve industry-level training speeds and convergence, the optimization pipeline was highly refined.
Optimizer Setup
- Algorithm:
AdamW(chosen over Adam for superior weight decay integration and generalization). - Learning Rates: Backbone-specific.
1.5e-4for ConvNeXt & EfficientNet;2e-4for ResNet. - Weight Decay:
5e-5 - Gradient Clipping:
1.0(to prevent exploding gradients from erratic outlier batches).
Scheduler
We utilized CosineAnnealingLR. By smoothly and continuously decaying the learning rate without sharp step discontinuities, we allowed the dual-task loss landscape to settle into wider, more generalized local minima.
Mixed Precision Training
We wrapped the entire training loop in torch.cuda.amp.GradScaler.
- Advantages: Computing activations and gradients in FP16 (while maintaining FP32 weights) resulted in fundamentally faster training cycles, halved the VRAM footprint allowing for larger batch sizes, and maintained gradient numeric stability.
10. Evaluation Metrics & Kaggle Scoring Alignment
Because we were optimizing two tasks, evaluating the model required observing multiple dimensions simultaneously.
Age Metrics Tracked:
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- R²
- Granular Accuracy: ±3 years, ±5 years, ±10 years.
Gender Metric Tracked:
- Macro F1 Score
Composite Metric Tracked:
- Harmonic Mean (Kaggle Score)
The Kaggle Harmonic Score Definition
The competition defined success via a highly punitive composite formula:
Age Score = 1 - min(RMSE, 30) / 30Final Harmonic Score = 2 * (Age Score * F1) / (Age Score + F1)
A harmonic mean aggressively penalizes imbalance. Even if a model achieved a near-perfect F1 score on gender, a poor RMSE on age would collapse the total score. This forced us to train strictly balanced models.
11. Model Comparison Results & Analysis
After implementing identical training protocols, hyperparameters, and sampling strategies, the backbones yielded the following definitive results:
| Model | RMSE | MAE | Macro F1 | Harmonic Mean (Kaggle Score) |
|---|---|---|---|---|
| ConvNeXt-Tiny 🥇 | 8.12 | 5.66 | 0.940 | 0.821 |
| EfficientNet-B0 🥈 | 8.64 | 6.12 | 0.918 | 0.802 |
| ResNet18 🥉 | 9.79 | 7.10 | 0.879 | 0.763 |
Architectural Analysis: Why ConvNeXt Won
ConvNeXt-Tiny dominated the metrics. Its larger effective receptive field and modern convolution blocks allowed it to capture subtle spatial geometries (like fine skin wrinkles and texture) vastly better than its predecessors. It provided the strongest gender feature discrimination (0.940 F1) alongside the most stable regression (8.12 RMSE).
EfficientNet-B0 performed admirably, proving its strength as a compound-scaled architecture with unmatched parameter efficiency (only 4.8M parameters).
ResNet18, limited by its older, less-expressive features, struggled to isolate the representations needed for multi-task conflicting gradients, resulting in the lowest performance.
Confusion Matrix Insights
Validating the ConvNeXt gender predictions revealed exceptionally high true positive rates, minimal class confusion, and deeply balanced F1 performance irrespective of the underlying age bracket.
12. Training Behavior & Convergence Dynamics
By tracking the epoch-over-epoch telemetry, clear training patterns emerged:
- RMSE vs Epochs: Age regression took significantly longer to learn. RMSE reliably stabilized only after epoch ~12.
- F1 vs Epochs: Gender classification is computationally easier and plateaued very early in the training cycle.
- Harmonic Mean vs Epochs: Because F1 saturated quickly, the overall Harmonic Mean (Kaggle Score) essentially tracked the gradient of the RMSE improvement curve.
13. Pushing Limits: Test Time Augmentation (TTA)
To squeeze maximal performance out of the final inference pipeline, we deployed Test Time Augmentation (TTA).
For every image in the 8,677 test set, the pipeline executed two forward passes:
- The Original Image
- A Horizontal Flip
Aggregation Logic:
Final Age = mean(prediction_1, prediction_2)Final Gender = mean(probability_1, probability_2) > 0.5
Benefit: This computationally cheap ensemble alternative resulted in a measurable +0.01 to +0.02 score gain on the final leaderboard.
Final Submission Format
The fully optimized pipeline generated CSVs mapping id to continuous age (clamped to [0,100]) and binary gender, yielding the ultimate benchmark success.
14. Real-World Engineering Challenges Solved
Behind the elegant ML math lay brutal engineering challenges. These debugging learnings defined the project's success:
- Age Scaling Mismatches: A silent killer. If training ages are scaled [0,1] locally, but inference fails to accurately revert that specific scale map, the entire prediction space collapses. Fixed inside the isolated loss function.
- Early Stopping Instability: The harmonic metric oscillates wildly in early training as the dual tasks fight for gradient share. Hard-coded patience bumps were necessary.
- Backbone Freezing Issues: Experimenting with freezing early layers proved disastrous for multi-task models; shared superficial edge detectors must adapt equally.
- Checkpoint Prefix Errors: Distributed orchestration frameworks (like PyTorch Lightning) unpredictably inject a
model.prefix intostate_dictkeys. A custom parser was required to load them into raw Python inference scripts. - Gender Collapse in Inference: Relying purely on raw BCE logits without carefully tuned threshold bounds can skew predictions toward the majority class silently.
15. Summary: Final Best Model Configuration
| Component | Setting |
|---|---|
| Backbone | ConvNeXt-Tiny (28.3M params) |
| Input Size | 224x224 |
| Optimizer & LR | AdamW @ 1.5e-4 |
| Scheduler | CosineAnnealingLR |
| Loss | SmoothL1 + BCE |
| Augmentations | Albumentations (Heavy) |
| Sampling | WeightedRandomSampler |
| TTA | Horizontal Flip Aggregation |
16. Future Improvements & Conclusion
Future Work
While the system is robust, future architectural upgrades could include:
- Ordinal Age Regression: Shifting from continuous regression to treating age as ordered, quantifiable categories.
- Attention Modules: Forcing the network to map facial landmarks explicitly.
- Label Distribution Learning: Smoothing hard age ground-truths into Gaussian distributions to reflect the visual ambiguity of human aging.
- Face Alignment Preprocessing: Enforcing geometric eye-alignment crops before inference.
Research Contributions & Conclusion
This deep dive confirms that constructing a production-grade multi-task system is as much an exercise in systematic engineering as it is in dataset mathematics.
Key achievements included engineering a robust, non-distorting preprocessing pipeline, enforcing balanced loss optimization over disparate distributions, and executing a rigorous multi-backbone comparison. Ultimately, ConvNeXt-Tiny emerged as the optimal architecture for this dataset, pulling a final Kaggle harmonic score of 0.82.
This case study demonstrates that modern ConvNet architectures, paired with smart sampling and logical inference pipelines, remain exceptionally powerful tools in the contemporary AI landscape.
Connect
If you're interested in multi-task learning, model comparison research, or Kaggle optimization strategies—feel free to connect.
- GitHub: devp1866
- LinkedIn: devp1866
- Email: devp1866@gmail.com
Author: Devkumar Patel
Domain: Deep Learning, Computer Vision, Multi-Task Learning