Back to Case Studies
Engineering Case StudyMachine LearningDeep LearningBioinformatics

Protein Secondary Structure Prediction (BiRNN-CRF)

Building and debugging a BiRNN-CRF model for Q8/Q3 protein secondary structure prediction under Kaggle compute and inference constraints.

December 2025
3 min read
Devkumar Patel
Sequence ModelingPyTorch LightningCRFKaggle
Protein Secondary Structure Prediction (BiRNN-CRF)

Problem

Predict secondary structure labels for each amino acid in a protein sequence:

  • Q8 — 8 fine-grained classes: H, G, I, E, B, C, S, T
  • Q3 — 3 coarse classes: Helix (H), Sheet (E), Coil (C)

Platform: Kaggle. Constraints: limited GPU, inference-only scoring, no test labels, strict submission format.


Architecture Decision

Baseline BiLSTM → Linear got Q8 < 0.30. Token-level predictions were independent — no structural continuity. Protein secondary structures don't switch randomly at every residue.

Fix: Added a CRF layer to enforce legal label transitions.

Embedding → BiLSTM → Projection (LN + Dropout) → CRF (Q8) + Linear (Q3)
ModelQ8 F1Q3 F1
BiLSTM baseline~0.28~0.55
BiRNN + CRF~0.34~0.69

Tried ESM-2 (protein language model) — underperformed BiRNN-CRF on this dataset. Dataset was too small for the fine-tuning depth ESM-2 needed, and the sequence trimming caused mask misalignment with the CRF. Reverted.


What Broke

ProblemRoot CauseFix
CRF crashes + NaN gradients at startCRF cold start instabilityWarm-up with CrossEntropy (epochs 0–4), switch to CRF at epoch 5
IndexError: mask shape mismatchBOS token added to labels, mask not trimmed consistentlyUnified slicing across tokens, labels, and masks
Local score 0.44, Kaggle 0.40Q3 weighted more in the leaderboard formulaPivoted inference to prioritize Q3 accuracy
Checkpoint wouldn't load in inference notebookLightning saves model. prefix in state_dict keysDynamic key replacement on load

Inference-Only Improvements

Post-training, only inference changes were allowed. These gained real leaderboard points:

CRF decoding, not argmax — Easy to forget, but required for the CRF to actually enforce valid transitions.

Q8 smoothing — Fill isolated single-residue spikes:

if pred[i-1] == pred[i+1] != pred[i]:
    pred[i] = pred[i-1]

Gained +0.02 on the leaderboard.

Deterministic Q8 → Q3 mappingH,G,I → H, E,B → E, C,S,T → C. Avoided a separate fragile Q3 head entirely.


Results

MetricScore
Q8 Macro F1~0.34
Q3 Micro F1~0.69
Kaggle Score~0.43

Lessons

  1. Bigger models don't always win under data and compute constraints
  2. Inference engineering mattered as much as training — +0.04 came post-training
  3. Align your training metric exactly to what the leaderboard scores
  4. CRF layers need careful initialization — don't throw them in cold

Have a Better Approach?

Protein structure prediction is a deep research field. My solution is a constrained engineering attempt, not state-of-the-art. If you work in bioinformatics, know better sequence modeling approaches, or have ideas on improving under Kaggle-style constraints, I'd genuinely like to hear from you.

Get in touch →