We began with a proprietary protein language model and performed an initial round of fine-tuning on approximately 100 curated sequences characterized using in-silico metrics. This stage produced a model capable of generating diverse nanobodies against Nipah virus, with hundreds of novel candidates exhibiting the target properties. However, while the model demonstrated flexibility in variant generation, its sampling distribution was not optimized for the desired functional qualities. To address this limitation, we applied Reinforcement Learning to guide the model toward high-quality regions of the sequence space.
To refine the model so that it samples variants preferentially from regions with desirable properties, we employed Group Relative Policy Optimization (GRPO). This post-training stage consisted of three sequential steps: (1) generation of candidate variants, (2) scoring of generated candidates, and (3) backpropagation of rewards to update model parameters. This iterative process resulted in a model that samples from a narrowed antibody sequence space, concentrating on variants with the desired functional characteristics.
Work by Gerard Boxó and collaboratos. Center For Genomic Regulation, Ferruz Lab