Back to Blog
Machine LearningDeep LearningBioinformatics

Protein Secondary Structure Prediction with BiRNN-CRF: A Practical Engineering Case Study

A research-style engineering postmortem on building, debugging, and deploying a protein secondary structure prediction system under Kaggle constraints.

December 16, 2025
7 min read
Devkumar Patel
Sequence ModelingPyTorch LightningCRFKaggle
Protein Secondary Structure Prediction with BiRNN-CRF: A Practical Engineering Case Study

Abstract

This post documents an end-to-end engineering effort to build a protein secondary structure prediction system under real-world constraints. The task was to predict Q8 (8-class) and Q3 (3-class) secondary structure labels for amino-acid sequences.

Rather than presenting a polished success story, this article is written as a research case study + engineering postmortem. It includes:

  • baseline failures
  • architectural tradeoffs
  • training instabilities
  • inference mismatches
  • leaderboard vs local metric gaps
  • and the exact fixes that led to a stable production submission

This is written for ML engineers, not beginners.


1. Background: Protein Structure 101

Before diving into the architecture, it's helpful to understand what we're predicting.

Proteins are long chains of amino acids (primary structure). But they don't stay as simple chains; they fold into complex 3D shapes to perform functions.

Secondary Structure serves as the intermediate step between the raw sequence and the final 3D shape. It describes the local folding patterns of the amino acid chain:

  • H (Helix): A spiral shape (Alpha Helix). Common in structural supports.
  • E (Sheet): A flat, pleated sheet (Beta Sheet). Forms rigid cores.
  • C (Coil): Flexible loops connecting helices and sheets.

Why predict this? Predicting the full 3D structure (like AlphaFold does) is computationally expensive. Secondary structure prediction is a faster, essential first step that narrows down the immense search space of protein folding.


2. Problem Definition

Given a protein sequence:

FFKGSYQKVSNQLLYQANQIQDQTGTITII...

Predict:

  • Q8: fine-grained secondary structure labels `{H, G, I, E, B, C, S, T}`
  • Q3: coarse-grained labels `{H, E, C}`

Output Format (Kaggle)

idsst8sst3
0HGHHIIG...HHHHCCC...

The Kaggle evaluation score was a weighted combination of Q8 and Q3 F1, making Q3 deceptively important.


3. Constraints & Platform

Platform: Kaggle Constraints:

  • GPU time limited
  • inference-only scoring
  • no test labels
  • strict submission format
  • models must be reloadable for inference notebooks

4. Metrics: Why F1, Not Accuracy

Secondary structure prediction is highly imbalanced (e.g., Coil dominates).

We used:

  • Macro F1 (Q8) → forces performance across rare states
  • Micro F1 (Q3) → reflects leaderboard sensitivity

Key insight: A model with lower Q8 but stronger Q3 can score higher overall.

This directly shaped training decisions.


5. Baseline Models

5.1 Baseline 1: Simple Embedding + BiLSTM

Architecture

Embedding -> BiLSTM -> Linear (Q8 / Q3)

Problems

  • Q8 < 0.30
  • Over-predicted Coil
  • No sequence-level consistency

Root Cause

  • Token-level independence
  • No structural constraints

5.2 Baseline 2: BiRNN + CRF (Major Breakthrough)

Architecture

Embedding
   |
BiLSTM
   |
Projection (LN + Dropout)
   |
+-----------+
|    CRF    | -> Q8
+-----------+
     |
Linear -> Q3

Why CRF?

  • Enforces legal label transitions
  • Smooths predictions across residues
  • Matches biological structure continuity

Results

ModelBest Q8 F1Best Q3 F1
BiRNN~0.28~0.55
BiRNN + CRF~0.34~0.69

This became the production model.


6. Attempted Upgrade: ESM-based Model

Hypothesis

Protein language models (ESM-2) should improve representations.

Architecture

ESM-2 (partial fine-tune)
          |
       BiLSTM
          |
CRF (Q8) + Linear (Q3)

Reality Check

ModelQ8Q3
BiRNN-CRF0.340.69
ESM-BiLSTM-CRF0.310.68

Why It Failed

  • Dataset size insufficient for deep fine-tuning
  • ESM sequence trimming caused mask misalignment
  • CRF instability early in training
  • Over-regularization from frozen layers

Decision:

  • ❌ Rejected for production
  • ✅ Retained BiRNN-CRF

7. Training Strategy (Hard Lessons)

7.1 CRF Cold Start Instability

Symptom

  • Loss spikes
  • CUDA assertions
  • NaN gradients

Fix

  • Warm-up with CrossEntropy
  • Enable CRF after epoch 5
Epochs 0–4 -> CE loss
Epochs 5+  -> CRF loss

7.2 Padding & Mask Mismatch Bugs

Failure Mode

IndexError: mask shape does not match emissions

Root Cause

  • BOS token added to labels
  • Mask not trimmed consistently

Fix Enforced identical trimming: tokens, labels, masks.

7.3 Leaderboard ≠ Local Metrics

Observation

  • Local score: ~0.44
  • Kaggle score: ~0.40

Root Causes

  • Q3 weighted more heavily
  • Test distribution skewed
  • Boundary noise in Q8 predictions

8. Code-Level Engineering Decisions

Analyzing the production codebase reveals critical engineering choices that ensured stability and reproducibility.

8.1 Determinism is Non-Negotiable

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
pl.seed_everything(42)

In Kaggle competitions, reproducibility is critical. Disabling CuDNN benchmarking ensures that the algorithm selection doesn't change based on input size, preventing subtle non-deterministic behavior.

8.2 Safe Checkpoint Loading

PyTorch Lightning saves models with a model. prefix in the state_dict. Standard PyTorch inference expects keys without this prefix. We solved this dynamically during loading:

state_dict = {
    k.replace("model.", ""): v
    for k, v in checkpoint["state_dict"].items()
}

This prevents the dreaded RuntimeError: Unexpected key(s) in state_dict: "model.embedding...".

8.3 Shared Bottleneck Architecture

The code implements a shared projection layer before the task-specific heads:

self.proj = nn.Sequential(
    nn.Linear(hidden_out, hidden_out),
    nn.LayerNorm(hidden_out),
    nn.Dropout(0.3)
)

This acts as an informational bottleneck, forcing the BiLSTM to learn robust features applicable to both Q8 and Q3 tasks before branching out.


9. Inference-Only Improvements (No Retraining)

Once training was frozen, only inference changes were allowed.

9.1 CRF Decoding Only (No Argmax)

Correct:

CRF.decode(emissions)

Incorrect:

argmax(logits)

9.2 Post-processing: Q8 Smoothing

def smooth_q8(pred):
    for i in range(1, len(pred)-1):
        if pred[i-1] == pred[i+1] != pred[i]:
            pred[i] = pred[i-1]
    return pred

Effect

  • Reduced single-residue spikes
  • Improved Q8 consistency
  • +0.02 leaderboard gain

9.3 Deterministic Q8 → Q3 Mapping

H,G,I -> H
E,B   -> E
C,S,T -> C

Avoided error-prone separate Q3 head during inference.


10. Final Production Model

Selected Model

BiRNN-CRF

Final Results (Approx.)

MetricScore
Q8 F1~0.34
Q3 F1~0.69
Kaggle Score~0.43

Deployment & Model Storage

  • Model checkpoint uploaded via kagglehub

Inference notebook:

  1. downloads model
  2. rebuilds architecture
  3. loads state_dict
  4. runs CRF decoding
  5. generates submission CSV

No training code present in inference notebook.


11. Failure Summary & Fixes

ProblemRoot CauseFix
CRF crashCold startCE warm-up
Mask errorsBOS trimmingUnified slicing
Score mismatchMetric weightingQ3-focused inference
ESM underperformingData + constraintsReverted model

12. Key Engineering Takeaways

  1. Bigger models are not always better
  2. CRFs require careful initialization
  3. Inference logic can materially affect leaderboard score
  4. Q3 mattered more than Q8 for final ranking
  5. Debugging consumed more time than modeling

Future Work

If unconstrained:

  • Train longer with curriculum learning
  • Distill ESM representations into BiRNN
  • Add transition constraints to CRF
  • Explore ensemble decoding

Closing Thoughts

This project reinforced a core ML truth: Performance comes from systems thinking, not architectures alone.

The final gains came not from new layers, but from:

  • understanding metrics
  • respecting constraints
  • fixing silent bugs
  • and engineering the inference pipeline carefully

This is the kind of work that never shows up in papers — but decides real-world outcomes.


References

Connect


Author: Devkumar Patel
Domain: Deep Learning · Bioinformatics · Sequence Modeling