LLM-Guided Multi-Stage Pipeline

The Core Reasoning The fundamental challenge in computational protein design is that generative models can sometimes create structurally beautiful proteins that are biologically irrelevant to the specific target. The reasoning behind this a…

Description

The Core Reasoning

The fundamental challenge in computational protein design is that generative models can sometimes create structurally beautiful proteins that are biologically irrelevant to the specific target. The reasoning behind this agent-based workflow is to effectively bridge the gap between qualitative biological literature and quantitative generative constraints.

By utilizing an LLM agent acting within an environment like Claude Code, the pipeline transforms from a simple "generate and score" script into a traceable decision-making engine. It makes explicit trade-offs between biological reliability and structural diversity (including exploratory positions like W101 and E100). This prevents the generative models from prematurely converging on a narrow, potentially flawed binding hypothesis.

The 5-Stage Pipeline

This method operates as a computational funnel, starting with broad biological rules and narrowing down to highly validated structural candidates through five distinct stages:

Phase 1: Knowledge Translation (Target Definition)

The LLM agent reviews existing literature to identify the most critical, experimentally supported interaction points on the RBX1 surface. It maps these biological findings to a specific structural interface, establishing the mandatory target constraints for all downstream generation.

Phase 2: Constrained Generation (Expanding the Search Space)

The predefined interface constraints are fed into RFdiffusion to construct custom binder backbones specifically aimed at that spatial region. To thoroughly explore the sequence space, ProteinMPNN then generates dozens of amino acid sequence variations for each of those backbones, resulting in a structurally diverse library of 480,000 raw binder candidates.

Phase 3: Agent-Driven Quality Control (Filtering)

To manage this massive scale, the agent shifts its focus to reliability, applying strict, automated filtering criteria based on sequence recovery and structural scores from ProteinMPNN. The library is aggressively distilled down to 515 high-confidence candidates, with all metadata meticulously preserved for later auditing.

Phase 4: Orthogonal Validation (Boltz-2 Rescoring)

The agent independently validates the generated complexes using Boltz-2 to predict their binding affinity and geometric poses. It computationally measures the heavy-atom distances (< 5.0 Å) to ensure the generated binders actually physically touch the exact residues defined in Phase 1.

Phase 5: Asymmetric Prioritization (Closing the Loop)

The agent ranks the final designs using an "asymmetric scoring rule" tied directly back to the initial literature. It heavily rewards binders that hit the primary anchors and gives secondary credit for hitting the exploratory positions , resulting in a finalized, prioritized list of candidates that successfully balance targeted biological engagement with high structural confidence.

Proteins (100)

TableGrid