Protein Secondary Structure Prediction with BiRNN-CRF: A Practical Engineering Case Study

Abstract

This post documents an end-to-end engineering effort to build a protein secondary structure prediction system under real-world constraints. The task was to predict Q8 (8-class) and Q3 (3-class) secondary structure labels for amino-acid sequences.

Rather than presenting a polished success story, this article is written as a research case study + engineering postmortem. It includes:

baseline failures
architectural tradeoffs
training instabilities
inference mismatches
leaderboard vs local metric gaps
and the exact fixes that led to a stable production submission

This is written for ML engineers, not beginners.

1. Background: Protein Structure 101

Before diving into the architecture, it's helpful to understand what we're predicting.

Proteins are long chains of amino acids (primary structure). But they don't stay as simple chains; they fold into complex 3D shapes to perform functions.

Secondary Structure serves as the intermediate step between the raw sequence and the final 3D shape. It describes the local folding patterns of the amino acid chain:

H (Helix): A spiral shape (Alpha Helix). Common in structural supports.
E (Sheet): A flat, pleated sheet (Beta Sheet). Forms rigid cores.
C (Coil): Flexible loops connecting helices and sheets.

Why predict this? Predicting the full 3D structure (like AlphaFold does) is computationally expensive. Secondary structure prediction is a faster, essential first step that narrows down the immense search space of protein folding.

2. Problem Definition

Given a protein sequence:

FFKGSYQKVSNQLLYQANQIQDQTGTITII...

Predict:

Q8: fine-grained secondary structure labels `{H, G, I, E, B, C, S, T}`
Q3: coarse-grained labels `{H, E, C}`

Output Format (Kaggle)

id	sst8	sst3
0	HGHHIIG...	HHHHCCC...

The Kaggle evaluation score was a weighted combination of Q8 and Q3 F1, making Q3 deceptively important.

3. Constraints & Platform

Platform: Kaggle Constraints:

GPU time limited
inference-only scoring
no test labels
strict submission format
models must be reloadable for inference notebooks

4. Metrics: Why F1, Not Accuracy

Secondary structure prediction is highly imbalanced (e.g., Coil dominates).

We used:

Macro F1 (Q8) → forces performance across rare states
Micro F1 (Q3) → reflects leaderboard sensitivity

Key insight: A model with lower Q8 but stronger Q3 can score higher overall.

This directly shaped training decisions.

5. Baseline Models

5.1 Baseline 1: Simple Embedding + BiLSTM

Architecture

Embedding -> BiLSTM -> Linear (Q8 / Q3)

Problems

Q8 < 0.30
Over-predicted Coil
No sequence-level consistency

Root Cause

Token-level independence
No structural constraints

5.2 Baseline 2: BiRNN + CRF (Major Breakthrough)

Architecture

Embedding
   |
BiLSTM
   |
Projection (LN + Dropout)
   |
+-----------+
|    CRF    | -> Q8
+-----------+
     |
Linear -> Q3

Why CRF?

Enforces legal label transitions
Smooths predictions across residues
Matches biological structure continuity

Results

Model	Best Q8 F1	Best Q3 F1
BiRNN	~0.28	~0.55
BiRNN + CRF	~0.34	~0.69

This became the production model.

6. Attempted Upgrade: ESM-based Model

Hypothesis

Protein language models (ESM-2) should improve representations.

Architecture

ESM-2 (partial fine-tune)
          |
       BiLSTM
          |
CRF (Q8) + Linear (Q3)

Reality Check

Model	Q8	Q3
BiRNN-CRF	0.34	0.69
ESM-BiLSTM-CRF	0.31	0.68

Why It Failed

Dataset size insufficient for deep fine-tuning
ESM sequence trimming caused mask misalignment
CRF instability early in training
Over-regularization from frozen layers

Decision:

❌ Rejected for production
✅ Retained BiRNN-CRF

7. Training Strategy (Hard Lessons)

7.1 CRF Cold Start Instability

Symptom

Loss spikes
CUDA assertions
NaN gradients

Fix

Warm-up with CrossEntropy
Enable CRF after epoch 5

Epochs 0–4 -> CE loss
Epochs 5+  -> CRF loss

7.2 Padding & Mask Mismatch Bugs

Failure Mode

IndexError: mask shape does not match emissions

Root Cause

BOS token added to labels
Mask not trimmed consistently

Fix Enforced identical trimming: tokens, labels, masks.

7.3 Leaderboard ≠ Local Metrics

Observation

Local score: ~0.44
Kaggle score: ~0.40

Root Causes

Q3 weighted more heavily
Test distribution skewed
Boundary noise in Q8 predictions

8. Code-Level Engineering Decisions

Analyzing the production codebase reveals critical engineering choices that ensured stability and reproducibility.

8.1 Determinism is Non-Negotiable

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
pl.seed_everything(42)

In Kaggle competitions, reproducibility is critical. Disabling CuDNN benchmarking ensures that the algorithm selection doesn't change based on input size, preventing subtle non-deterministic behavior.

8.2 Safe Checkpoint Loading

PyTorch Lightning saves models with a model. prefix in the state_dict. Standard PyTorch inference expects keys without this prefix. We solved this dynamically during loading:

state_dict = {
    k.replace("model.", ""): v
    for k, v in checkpoint["state_dict"].items()
}

This prevents the dreaded RuntimeError: Unexpected key(s) in state_dict: "model.embedding...".

8.3 Shared Bottleneck Architecture

The code implements a shared projection layer before the task-specific heads:

self.proj = nn.Sequential(
    nn.Linear(hidden_out, hidden_out),
    nn.LayerNorm(hidden_out),
    nn.Dropout(0.3)
)

This acts as an informational bottleneck, forcing the BiLSTM to learn robust features applicable to both Q8 and Q3 tasks before branching out.

9. Inference-Only Improvements (No Retraining)

Once training was frozen, only inference changes were allowed.

9.1 CRF Decoding Only (No Argmax)

Correct:

CRF.decode(emissions)

Incorrect:

argmax(logits)

9.2 Post-processing: Q8 Smoothing

def smooth_q8(pred):
    for i in range(1, len(pred)-1):
        if pred[i-1] == pred[i+1] != pred[i]:
            pred[i] = pred[i-1]
    return pred

Effect

Reduced single-residue spikes
Improved Q8 consistency
+0.02 leaderboard gain

9.3 Deterministic Q8 → Q3 Mapping

H,G,I -> H
E,B   -> E
C,S,T -> C

Avoided error-prone separate Q3 head during inference.

10. Final Production Model

Selected Model

BiRNN-CRF

Final Results (Approx.)

Metric	Score
Q8 F1	~0.34
Q3 F1	~0.69
Kaggle Score	~0.43

Deployment & Model Storage

Model checkpoint uploaded via kagglehub

Inference notebook:

downloads model
rebuilds architecture
loads state_dict
runs CRF decoding
generates submission CSV

No training code present in inference notebook.

11. Failure Summary & Fixes

Problem	Root Cause	Fix
CRF crash	Cold start	CE warm-up
Mask errors	BOS trimming	Unified slicing
Score mismatch	Metric weighting	Q3-focused inference
ESM underperforming	Data + constraints	Reverted model

12. Key Engineering Takeaways

Bigger models are not always better
CRFs require careful initialization
Inference logic can materially affect leaderboard score
Q3 mattered more than Q8 for final ranking
Debugging consumed more time than modeling

Future Work

If unconstrained:

Train longer with curriculum learning
Distill ESM representations into BiRNN
Add transition constraints to CRF
Explore ensemble decoding

Closing Thoughts

This project reinforced a core ML truth: Performance comes from systems thinking, not architectures alone.

The final gains came not from new layers, but from:

understanding metrics
respecting constraints
fixing silent bugs
and engineering the inference pipeline carefully

This is the kind of work that never shows up in papers — but decides real-world outcomes.

References

Kaggle Competition: Sep 25 DL Gen AI NPPE 2
PyTorch Lightning: Documentation

Connect

Kaggle: Kaggle
LinkedIn: LinkedIn
Email: devp1866@gmail.com

Author: Devkumar Patel
Domain: Deep Learning · Bioinformatics · Sequence Modeling