Protein Secondary Structure Prediction (BiRNN-CRF)

Problem

Predict secondary structure labels for each amino acid in a protein sequence:

Q8 — 8 fine-grained classes: H, G, I, E, B, C, S, T
Q3 — 3 coarse classes: Helix (H), Sheet (E), Coil (C)

Platform: Kaggle. Constraints: limited GPU, inference-only scoring, no test labels, strict submission format.

Architecture Decision

Baseline BiLSTM → Linear got Q8 < 0.30. Token-level predictions were independent — no structural continuity. Protein secondary structures don't switch randomly at every residue.

Fix: Added a CRF layer to enforce legal label transitions.

Embedding → BiLSTM → Projection (LN + Dropout) → CRF (Q8) + Linear (Q3)

Model	Q8 F1	Q3 F1
BiLSTM baseline	~0.28	~0.55
BiRNN + CRF	~0.34	~0.69

Tried ESM-2 (protein language model) — underperformed BiRNN-CRF on this dataset. Dataset was too small for the fine-tuning depth ESM-2 needed, and the sequence trimming caused mask misalignment with the CRF. Reverted.

What Broke

Problem	Root Cause	Fix
CRF crashes + NaN gradients at start	CRF cold start instability	Warm-up with CrossEntropy (epochs 0–4), switch to CRF at epoch 5
`IndexError: mask shape mismatch`	BOS token added to labels, mask not trimmed consistently	Unified slicing across tokens, labels, and masks
Local score 0.44, Kaggle 0.40	Q3 weighted more in the leaderboard formula	Pivoted inference to prioritize Q3 accuracy
Checkpoint wouldn't load in inference notebook	Lightning saves `model.` prefix in `state_dict` keys	Dynamic key replacement on load

Inference-Only Improvements

Post-training, only inference changes were allowed. These gained real leaderboard points:

CRF decoding, not argmax — Easy to forget, but required for the CRF to actually enforce valid transitions.

Q8 smoothing — Fill isolated single-residue spikes:

if pred[i-1] == pred[i+1] != pred[i]:
    pred[i] = pred[i-1]

Gained +0.02 on the leaderboard.

Deterministic Q8 → Q3 mapping — H,G,I → H, E,B → E, C,S,T → C. Avoided a separate fragile Q3 head entirely.

Results

Metric	Score
Q8 Macro F1	~0.34
Q3 Micro F1	~0.69
Kaggle Score	~0.43

Lessons

Bigger models don't always win under data and compute constraints
Inference engineering mattered as much as training — +0.04 came post-training
Align your training metric exactly to what the leaderboard scores
CRF layers need careful initialization — don't throw them in cold

Have a Better Approach?

Protein structure prediction is a deep research field. My solution is a constrained engineering attempt, not state-of-the-art. If you work in bioinformatics, know better sequence modeling approaches, or have ideas on improving under Kaggle-style constraints, I'd genuinely like to hear from you.

Get in touch →