Protein Secondary Structure Prediction with BiRNN-CRF: A Practical Engineering Case Study
A research-style engineering postmortem on building, debugging, and deploying a protein secondary structure prediction system under Kaggle constraints.

Abstract
This post documents an end-to-end engineering effort to build a protein secondary structure prediction system under real-world constraints. The task was to predict Q8 (8-class) and Q3 (3-class) secondary structure labels for amino-acid sequences.
Rather than presenting a polished success story, this article is written as a research case study + engineering postmortem. It includes:
- baseline failures
- architectural tradeoffs
- training instabilities
- inference mismatches
- leaderboard vs local metric gaps
- and the exact fixes that led to a stable production submission
This is written for ML engineers, not beginners.
1. Background: Protein Structure 101
Before diving into the architecture, it's helpful to understand what we're predicting.
Proteins are long chains of amino acids (primary structure). But they don't stay as simple chains; they fold into complex 3D shapes to perform functions.
Secondary Structure serves as the intermediate step between the raw sequence and the final 3D shape. It describes the local folding patterns of the amino acid chain:
- H (Helix): A spiral shape (Alpha Helix). Common in structural supports.
- E (Sheet): A flat, pleated sheet (Beta Sheet). Forms rigid cores.
- C (Coil): Flexible loops connecting helices and sheets.
Why predict this? Predicting the full 3D structure (like AlphaFold does) is computationally expensive. Secondary structure prediction is a faster, essential first step that narrows down the immense search space of protein folding.
2. Problem Definition
Given a protein sequence:
FFKGSYQKVSNQLLYQANQIQDQTGTITII...
Predict:
- Q8: fine-grained secondary structure labels
`{H, G, I, E, B, C, S, T}` - Q3: coarse-grained labels
`{H, E, C}`
Output Format (Kaggle)
| id | sst8 | sst3 |
|---|---|---|
| 0 | HGHHIIG... | HHHHCCC... |
The Kaggle evaluation score was a weighted combination of Q8 and Q3 F1, making Q3 deceptively important.
3. Constraints & Platform
Platform: Kaggle Constraints:
- GPU time limited
- inference-only scoring
- no test labels
- strict submission format
- models must be reloadable for inference notebooks
4. Metrics: Why F1, Not Accuracy
Secondary structure prediction is highly imbalanced (e.g., Coil dominates).
We used:
- Macro F1 (Q8) → forces performance across rare states
- Micro F1 (Q3) → reflects leaderboard sensitivity
Key insight: A model with lower Q8 but stronger Q3 can score higher overall.
This directly shaped training decisions.
5. Baseline Models
5.1 Baseline 1: Simple Embedding + BiLSTM
Architecture
Embedding -> BiLSTM -> Linear (Q8 / Q3)
Problems
- Q8 < 0.30
- Over-predicted Coil
- No sequence-level consistency
Root Cause
- Token-level independence
- No structural constraints
5.2 Baseline 2: BiRNN + CRF (Major Breakthrough)
Architecture
Embedding
|
BiLSTM
|
Projection (LN + Dropout)
|
+-----------+
| CRF | -> Q8
+-----------+
|
Linear -> Q3
Why CRF?
- Enforces legal label transitions
- Smooths predictions across residues
- Matches biological structure continuity
Results
| Model | Best Q8 F1 | Best Q3 F1 |
|---|---|---|
| BiRNN | ~0.28 | ~0.55 |
| BiRNN + CRF | ~0.34 | ~0.69 |
This became the production model.
6. Attempted Upgrade: ESM-based Model
Hypothesis
Protein language models (ESM-2) should improve representations.
Architecture
ESM-2 (partial fine-tune)
|
BiLSTM
|
CRF (Q8) + Linear (Q3)
Reality Check
| Model | Q8 | Q3 |
|---|---|---|
| BiRNN-CRF | 0.34 | 0.69 |
| ESM-BiLSTM-CRF | 0.31 | 0.68 |
Why It Failed
- Dataset size insufficient for deep fine-tuning
- ESM sequence trimming caused mask misalignment
- CRF instability early in training
- Over-regularization from frozen layers
Decision:
- ❌ Rejected for production
- ✅ Retained BiRNN-CRF
7. Training Strategy (Hard Lessons)
7.1 CRF Cold Start Instability
Symptom
- Loss spikes
- CUDA assertions
- NaN gradients
Fix
- Warm-up with CrossEntropy
- Enable CRF after epoch 5
Epochs 0–4 -> CE loss
Epochs 5+ -> CRF loss
7.2 Padding & Mask Mismatch Bugs
Failure Mode
IndexError: mask shape does not match emissions
Root Cause
- BOS token added to labels
- Mask not trimmed consistently
Fix
Enforced identical trimming: tokens, labels, masks.
7.3 Leaderboard ≠ Local Metrics
Observation
- Local score: ~0.44
- Kaggle score: ~0.40
Root Causes
- Q3 weighted more heavily
- Test distribution skewed
- Boundary noise in Q8 predictions
8. Code-Level Engineering Decisions
Analyzing the production codebase reveals critical engineering choices that ensured stability and reproducibility.
8.1 Determinism is Non-Negotiable
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
pl.seed_everything(42)
In Kaggle competitions, reproducibility is critical. Disabling CuDNN benchmarking ensures that the algorithm selection doesn't change based on input size, preventing subtle non-deterministic behavior.
8.2 Safe Checkpoint Loading
PyTorch Lightning saves models with a model. prefix in the state_dict. Standard PyTorch inference expects keys without this prefix.
We solved this dynamically during loading:
state_dict = {
k.replace("model.", ""): v
for k, v in checkpoint["state_dict"].items()
}
This prevents the dreaded RuntimeError: Unexpected key(s) in state_dict: "model.embedding...".
8.3 Shared Bottleneck Architecture
The code implements a shared projection layer before the task-specific heads:
self.proj = nn.Sequential(
nn.Linear(hidden_out, hidden_out),
nn.LayerNorm(hidden_out),
nn.Dropout(0.3)
)
This acts as an informational bottleneck, forcing the BiLSTM to learn robust features applicable to both Q8 and Q3 tasks before branching out.
9. Inference-Only Improvements (No Retraining)
Once training was frozen, only inference changes were allowed.
9.1 CRF Decoding Only (No Argmax)
Correct:
CRF.decode(emissions)
Incorrect:
argmax(logits)
9.2 Post-processing: Q8 Smoothing
def smooth_q8(pred):
for i in range(1, len(pred)-1):
if pred[i-1] == pred[i+1] != pred[i]:
pred[i] = pred[i-1]
return pred
Effect
- Reduced single-residue spikes
- Improved Q8 consistency
- +0.02 leaderboard gain
9.3 Deterministic Q8 → Q3 Mapping
H,G,I -> H
E,B -> E
C,S,T -> C
Avoided error-prone separate Q3 head during inference.
10. Final Production Model
Selected Model
BiRNN-CRF
Final Results (Approx.)
| Metric | Score |
|---|---|
| Q8 F1 | ~0.34 |
| Q3 F1 | ~0.69 |
| Kaggle Score | ~0.43 |
Deployment & Model Storage
- Model checkpoint uploaded via kagglehub
Inference notebook:
- downloads model
- rebuilds architecture
- loads state_dict
- runs CRF decoding
- generates submission CSV
No training code present in inference notebook.
11. Failure Summary & Fixes
| Problem | Root Cause | Fix |
|---|---|---|
| CRF crash | Cold start | CE warm-up |
| Mask errors | BOS trimming | Unified slicing |
| Score mismatch | Metric weighting | Q3-focused inference |
| ESM underperforming | Data + constraints | Reverted model |
12. Key Engineering Takeaways
- Bigger models are not always better
- CRFs require careful initialization
- Inference logic can materially affect leaderboard score
- Q3 mattered more than Q8 for final ranking
- Debugging consumed more time than modeling
Future Work
If unconstrained:
- Train longer with curriculum learning
- Distill ESM representations into BiRNN
- Add transition constraints to CRF
- Explore ensemble decoding
Closing Thoughts
This project reinforced a core ML truth: Performance comes from systems thinking, not architectures alone.
The final gains came not from new layers, but from:
- understanding metrics
- respecting constraints
- fixing silent bugs
- and engineering the inference pipeline carefully
This is the kind of work that never shows up in papers — but decides real-world outcomes.
References
- Kaggle Competition: Sep 25 DL Gen AI NPPE 2
- PyTorch Lightning: Documentation
Connect
- Kaggle: Kaggle
- LinkedIn: LinkedIn
- Email: devp1866@gmail.com
Author: Devkumar Patel
Domain: Deep Learning · Bioinformatics · Sequence Modeling