Multi-Task Deep Learning for Age and Gender Prediction: From Baseline to 0.82 Kaggle Score

1. Abstract & Introduction

This research presents the design, development, and comparative evaluation of deep learning models for simultaneous age regression and gender classification from facial images.

The core focus of this work lies in systematically building a robust Multi-Task Learning (MTL) architecture. By jointly optimizing for regression and classification, we explore the challenges of shared feature learning, competitive objective alignment, and metric-driven architecture tuning. We tested three distinct CNN backbone paradigms (ResNet18, EfficientNet-B0, and ConvNeXt-Tiny) to determine optimal representations for facial data.

The final production-ready inference system, utilizing ConvNeXt-Tiny with Test Time Augmentation (TTA), achieved a highly competitive Kaggle harmonic score of 0.821. This case study details the complete engineering and research pipeline, moving beyond theory into the practical intricacies of data sampling, mixed-precision training, and systematic debugging.

2. Problem Statement

Given a facial image, the objective is to predict two distinct targets simultaneously:

Age: A continuous regression value in the range [0, 100].
Gender: A binary classification label (0 = Female, 1 = Male).

This presents a classic multi-task learning challenge:

Regression Loss vs. Classification Loss: The gradients from age mean squared errors behave very differently than binary cross-entropy gradients.
Shared Feature Learning: The early convolutional layers must learn to extract edge and shape features that are useful for both detecting masculine/feminine jawlines and identifying fine-grained skin wrinkles.

3. Dataset Description

The models were trained and evaluated on a specialized dataset provided via a Kaggle competition.

Dataset Source: Kaggle Competition
Total Training Samples: 34,708 images
Total Test Samples: 8,677 images

Dataset Profile

Columns: id, full_path, gender, age.

The images in this dataset vary immensely in quality and context, including:

Arbitrary facial poses and head rotations.
Extreme variations in lighting and contrast.
Variable image resolutions.
Broad diversity in ethnicity and age groups.

4. Data Preprocessing Pipeline

To ensure the model learns robust facial features rather than dataset artifacts, building a high-quality preprocessing pipeline was paramount. We utilized OpenCV, Albumentations, and PyTorch.

Training Augmentations

We applied a stochastic sequence of augmentations to generalize the model against real-world variations:

HorizontalFlip(p=0.5)
ShiftScaleRotate to simulate different camera angles and subject distances.
RandomBrightnessContrast & HueSaturationValue to combat diverse lighting conditions.
GaussNoise & Blur to prevent overfitting to high-resolution noise patterns.

Normalization

We normalized all inputs using standard ImageNet statistics to leverage pre-trained weights effectively:

mean = (0.485, 0.456, 0.406)
std = (0.229, 0.224, 0.225)

Resize Strategy

A critical early design choice was how to resize images to the target 224×224 resolution. Instead of a naive rectangular resize—which stretches and squashes faces, distorting key geometric features used to determine gender and age—we implemented:

LongestMaxSize -> PadIfNeeded -> CenterCrop

Reasoning: This preserves the true aspect ratio of the face, avoiding fatal geometric distortions, while still safely standardizing the tensor shape to $224 \times 224 \times 3$ for the GPU.

5. Addressing Class Imbalance: The Sampling Strategy

The Problem: Age distributions in facial datasets are notoriously imbalanced. Babies and the elderly are drastically underrepresented compared to adults in their 20s and 30s. Training naively causes the model to strongly bias its predictions toward the mean age (~30 years old), destroying its RMSE on edge cases.

The Solution: WeightedRandomSampler We instituted a dynamic sampling strategy to flat-line the representation during training:

Placed ages into histogram bins.
Computed the inverse frequency of each bin.
Assigned sampling weights dynamically to every dataset index.

The Effect: This artificially boosted the exposure frequency of rare ages per epoch, drastically reducing regression bias and improving the global RMSE stability.

6. Model Architecture Design

All multi-task models followed a universal design paradigm: a shared convolutional backbone feeding into dual, specialized Multi-Layer Perceptron (MLP) heads.

       Input Image (224x224x3)
                 ↓
       CNN Backbone (timm)
                 ↓
      Shared Feature Vector
                 ↓
        Shared MLP Head
                 ↓
  ┌───────────────┬───────────────┐
  │ Age Head      │ Gender Head   │
  │ Regression    │ Classification│
  └───────────────┴───────────────┘

Multi-Task Head Topology

Instead of mapping the backbone directly to a Linear(features, 1) layer, we introduced a shared transitional MLP to provide additional capacity for the network to disentangle the age and gender features before splitting:

Linear -> BatchNorm -> GELU -> Dropout

Finally, the paths diverge into two separate linear outputs:

Age: Linear(256 -> 1) (Continuous Output)
Gender: Linear(256 -> 1) (Logit Output)

7. Backbone Models Compared

We benchmarked three distinct architectural families loaded via the timm library:

Model	Params	Architecture Type	Characteristics
ResNet18	11.5M	Residual CNN	Deep, standard skip connections, proven reliability.
EfficientNet-B0	4.8M	Compound scaling	Highly parameter-efficient, optimized for FLOPs.
ConvNeXt-Tiny	28.3M	Modern ConvNet	Re-imagined standard CNN incorporating Transformer-like design principles.

8. Loss Functions & Joint Optimization

Optimizing two completely different tasks simultaneously requires careful loss scaling.

Age Loss: SmoothL1Loss

Why?: Age regression suffers heavily from dataset mislabeling (outliers). Mean Squared Error (MSE) squares the penalty of these outliers, destroying gradients. Smooth L1 behaves like L1 loss for large errors (robustness) and L2 loss near zero (stable gradients).

Gender Loss: BCEWithLogitsLoss(pos_weight=1.1)

Why?: We utilized binary cross-entropy with un-normalized logits for numerical stability. We also applied a manual pos_weight=1.1 to slightly penalize misclassification of the minority gender, handling slight class imbalances smoothly.

Final Joint Objective: Loss = AgeLoss + GenderLoss

(Note: We experimented with dynamic loss weighting, but static 1:1 addition proved simple, stable, and highly effective.)

9. Optimizer, Scheduler & Mixed Precision Training

To achieve industry-level training speeds and convergence, the optimization pipeline was highly refined.

Optimizer Setup

Algorithm: AdamW (chosen over Adam for superior weight decay integration and generalization).
Learning Rates: Backbone-specific. 1.5e-4 for ConvNeXt & EfficientNet; 2e-4 for ResNet.
Weight Decay: 5e-5
Gradient Clipping: 1.0 (to prevent exploding gradients from erratic outlier batches).

Scheduler

We utilized CosineAnnealingLR. By smoothly and continuously decaying the learning rate without sharp step discontinuities, we allowed the dual-task loss landscape to settle into wider, more generalized local minima.

Mixed Precision Training

We wrapped the entire training loop in torch.cuda.amp.GradScaler.

Advantages: Computing activations and gradients in FP16 (while maintaining FP32 weights) resulted in fundamentally faster training cycles, halved the VRAM footprint allowing for larger batch sizes, and maintained gradient numeric stability.

10. Evaluation Metrics & Kaggle Scoring Alignment

Because we were optimizing two tasks, evaluating the model required observing multiple dimensions simultaneously.

Age Metrics Tracked:

Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
R²
Granular Accuracy: ±3 years, ±5 years, ±10 years.

Gender Metric Tracked:

Macro F1 Score

Composite Metric Tracked:

Harmonic Mean (Kaggle Score)

The Kaggle Harmonic Score Definition

The competition defined success via a highly punitive composite formula:

Age Score = 1 - min(RMSE, 30) / 30
Final Harmonic Score = 2 * (Age Score * F1) / (Age Score + F1)

A harmonic mean aggressively penalizes imbalance. Even if a model achieved a near-perfect F1 score on gender, a poor RMSE on age would collapse the total score. This forced us to train strictly balanced models.

11. Model Comparison Results & Analysis

After implementing identical training protocols, hyperparameters, and sampling strategies, the backbones yielded the following definitive results:

Model	RMSE	MAE	Macro F1	Harmonic Mean (Kaggle Score)
ConvNeXt-Tiny 🥇	8.12	5.66	0.940	0.821
EfficientNet-B0 🥈	8.64	6.12	0.918	0.802
ResNet18 🥉	9.79	7.10	0.879	0.763

Architectural Analysis: Why ConvNeXt Won

ConvNeXt-Tiny dominated the metrics. Its larger effective receptive field and modern convolution blocks allowed it to capture subtle spatial geometries (like fine skin wrinkles and texture) vastly better than its predecessors. It provided the strongest gender feature discrimination (0.940 F1) alongside the most stable regression (8.12 RMSE).

EfficientNet-B0 performed admirably, proving its strength as a compound-scaled architecture with unmatched parameter efficiency (only 4.8M parameters).

ResNet18, limited by its older, less-expressive features, struggled to isolate the representations needed for multi-task conflicting gradients, resulting in the lowest performance.

Confusion Matrix Insights

Validating the ConvNeXt gender predictions revealed exceptionally high true positive rates, minimal class confusion, and deeply balanced F1 performance irrespective of the underlying age bracket.

12. Training Behavior & Convergence Dynamics

By tracking the epoch-over-epoch telemetry, clear training patterns emerged:

RMSE vs Epochs: Age regression took significantly longer to learn. RMSE reliably stabilized only after epoch ~12.
F1 vs Epochs: Gender classification is computationally easier and plateaued very early in the training cycle.
Harmonic Mean vs Epochs: Because F1 saturated quickly, the overall Harmonic Mean (Kaggle Score) essentially tracked the gradient of the RMSE improvement curve.

13. Pushing Limits: Test Time Augmentation (TTA)

To squeeze maximal performance out of the final inference pipeline, we deployed Test Time Augmentation (TTA).

For every image in the 8,677 test set, the pipeline executed two forward passes:

The Original Image
A Horizontal Flip

Aggregation Logic:

Final Age = mean(prediction_1, prediction_2)
Final Gender = mean(probability_1, probability_2) > 0.5

Benefit: This computationally cheap ensemble alternative resulted in a measurable +0.01 to +0.02 score gain on the final leaderboard.

Final Submission Format

The fully optimized pipeline generated CSVs mapping id to continuous age (clamped to [0,100]) and binary gender, yielding the ultimate benchmark success.

14. Real-World Engineering Challenges Solved

Behind the elegant ML math lay brutal engineering challenges. These debugging learnings defined the project's success:

Age Scaling Mismatches: A silent killer. If training ages are scaled [0,1] locally, but inference fails to accurately revert that specific scale map, the entire prediction space collapses. Fixed inside the isolated loss function.
Early Stopping Instability: The harmonic metric oscillates wildly in early training as the dual tasks fight for gradient share. Hard-coded patience bumps were necessary.
Backbone Freezing Issues: Experimenting with freezing early layers proved disastrous for multi-task models; shared superficial edge detectors must adapt equally.
Checkpoint Prefix Errors: Distributed orchestration frameworks (like PyTorch Lightning) unpredictably inject a model. prefix into state_dict keys. A custom parser was required to load them into raw Python inference scripts.
Gender Collapse in Inference: Relying purely on raw BCE logits without carefully tuned threshold bounds can skew predictions toward the majority class silently.

15. Summary: Final Best Model Configuration

Component	Setting
Backbone	ConvNeXt-Tiny (28.3M params)
Input Size	224x224
Optimizer & LR	AdamW @ `1.5e-4`
Scheduler	CosineAnnealingLR
Loss	SmoothL1 + BCE
Augmentations	Albumentations (Heavy)
Sampling	`WeightedRandomSampler`
TTA	Horizontal Flip Aggregation

16. Future Improvements & Conclusion

Future Work

While the system is robust, future architectural upgrades could include:

Ordinal Age Regression: Shifting from continuous regression to treating age as ordered, quantifiable categories.
Attention Modules: Forcing the network to map facial landmarks explicitly.
Label Distribution Learning: Smoothing hard age ground-truths into Gaussian distributions to reflect the visual ambiguity of human aging.
Face Alignment Preprocessing: Enforcing geometric eye-alignment crops before inference.

Research Contributions & Conclusion

This deep dive confirms that constructing a production-grade multi-task system is as much an exercise in systematic engineering as it is in dataset mathematics.

Key achievements included engineering a robust, non-distorting preprocessing pipeline, enforcing balanced loss optimization over disparate distributions, and executing a rigorous multi-backbone comparison. Ultimately, ConvNeXt-Tiny emerged as the optimal architecture for this dataset, pulling a final Kaggle harmonic score of 0.82.

This case study demonstrates that modern ConvNet architectures, paired with smart sampling and logical inference pipelines, remain exceptionally powerful tools in the contemporary AI landscape.

Connect

If you're interested in multi-task learning, model comparison research, or Kaggle optimization strategies—feel free to connect.

GitHub: devp1866
LinkedIn: devp1866
Email: devp1866@gmail.com

Author: Devkumar Patel
Domain: Deep Learning, Computer Vision, Multi-Task Learning