Protein Secondary Structure Prediction (BiRNN-CRF)
Building and debugging a BiRNN-CRF model for Q8/Q3 protein secondary structure prediction under Kaggle compute and inference constraints.

Problem
Predict secondary structure labels for each amino acid in a protein sequence:
- Q8 — 8 fine-grained classes: H, G, I, E, B, C, S, T
- Q3 — 3 coarse classes: Helix (H), Sheet (E), Coil (C)
Platform: Kaggle. Constraints: limited GPU, inference-only scoring, no test labels, strict submission format.
Architecture Decision
Baseline BiLSTM → Linear got Q8 < 0.30. Token-level predictions were independent — no structural continuity. Protein secondary structures don't switch randomly at every residue.
Fix: Added a CRF layer to enforce legal label transitions.
Embedding → BiLSTM → Projection (LN + Dropout) → CRF (Q8) + Linear (Q3)
| Model | Q8 F1 | Q3 F1 |
|---|---|---|
| BiLSTM baseline | ~0.28 | ~0.55 |
| BiRNN + CRF | ~0.34 | ~0.69 |
Tried ESM-2 (protein language model) — underperformed BiRNN-CRF on this dataset. Dataset was too small for the fine-tuning depth ESM-2 needed, and the sequence trimming caused mask misalignment with the CRF. Reverted.
What Broke
| Problem | Root Cause | Fix |
|---|---|---|
| CRF crashes + NaN gradients at start | CRF cold start instability | Warm-up with CrossEntropy (epochs 0–4), switch to CRF at epoch 5 |
IndexError: mask shape mismatch | BOS token added to labels, mask not trimmed consistently | Unified slicing across tokens, labels, and masks |
| Local score 0.44, Kaggle 0.40 | Q3 weighted more in the leaderboard formula | Pivoted inference to prioritize Q3 accuracy |
| Checkpoint wouldn't load in inference notebook | Lightning saves model. prefix in state_dict keys | Dynamic key replacement on load |
Inference-Only Improvements
Post-training, only inference changes were allowed. These gained real leaderboard points:
CRF decoding, not argmax — Easy to forget, but required for the CRF to actually enforce valid transitions.
Q8 smoothing — Fill isolated single-residue spikes:
if pred[i-1] == pred[i+1] != pred[i]:
pred[i] = pred[i-1]
Gained +0.02 on the leaderboard.
Deterministic Q8 → Q3 mapping — H,G,I → H, E,B → E, C,S,T → C. Avoided a separate fragile Q3 head entirely.
Results
| Metric | Score |
|---|---|
| Q8 Macro F1 | ~0.34 |
| Q3 Micro F1 | ~0.69 |
| Kaggle Score | ~0.43 |
Lessons
- Bigger models don't always win under data and compute constraints
- Inference engineering mattered as much as training — +0.04 came post-training
- Align your training metric exactly to what the leaderboard scores
- CRF layers need careful initialization — don't throw them in cold
Have a Better Approach?
Protein structure prediction is a deep research field. My solution is a constrained engineering attempt, not state-of-the-art. If you work in bioinformatics, know better sequence modeling approaches, or have ideas on improving under Kaggle-style constraints, I'd genuinely like to hear from you.