Binders generated with forge

youngsuko/Binders generated with forge

Description

Most recent AI binder design methods are structure-based and trained on the Protein Data Bank (PDB). This limits them to learning from a small, biased subset of proteins that are stable and crystallizable. Many therapeutically important targets, especially disordered, flexible, or viral proteins, are poorly represented in the PDB, meaning structure-based models often struggle or require heavy filtering to find a viable binder.

Here, we take a different approach: sequence-based binder design. Instead of using complex structures, we train on ~10M interactions from the STRING database, which captures a far broader and more diverse distribution of protein-protein interactions, including those that do not appear in the PDB. Our method, forge, uses latent flow matching, a state-of-the-art generative modeling technique, to perform target-to-binder generation directly from sequence. Given a target sequence, forge generates a corresponding binder sequence with no structural input required. (See our design method description for additional details.)

For this competition, we conditioned forge on the C-terminal region of the Nipah virus G protein and generated candidate binders between 100–250 amino acids. Now for the obvious question: why is the average ipSAE score so low for this submission?

One of the challenges of sequence-based binder design is that the current ecosystem for evaluating binder design structure-based. Think ipTM, ipSAE, etc., these all require the predicted binder-target structures to compute. But since forge is trained on STRING and learns from proteins never seen in the PDB, it can generate binders that are very un-PDB-like (many of our designs have <20% sequence identity to any PDB structure). This means the structure predictor (here, we use Boltz) may struggle to accurately predict the binder structure, leading to a poor in-silico metric.

So, how did we choose the 10 sequences in our submission? Without a well-established evaluation infrastructure for sequence-based binders, this was challenging for us! While we initially went for the highest ipSAE scores, we realized we didn't want to rely too heavily on structure-based metrics. In the end, we selected five sequences with the highest ipSAE and five additional sequences with lower scores but predicted complex structures that place the binder at C-terminus. While adding additional filters could be useful, it adds confounding variables (does our model generate well vs does the filter select well). As we ultimately want to know how well our model generates true binders from the start, we decided on only using ipSAE for the competition.

In summary, deep learning methods are shaped by the data they learn from. forge embraces this idea by learning from a vastly richer and more natural interaction space than the PDB. Despite forge still being a work in progress, we believe sequence-based binder design has the potential to address the shortcomings of structure-based methods. Vote for forge if you want to see how far sequence-based design can go!

Proteins (10)

TableGrid

Yyoungsuko

id: hollow-panther-rose

Binder

Other

forge

Target

Nipah Virus Glycoprotein G

None

31.80

False

17.6 kDa

155

id: crimson-hawk-cedar

Binder

Other

forge

Target

Nipah Virus Glycoprotein G

None

32.64

True

15.6 kDa

135

id: calm-ram-sand

Binder

Other

forge

Target

Nipah Virus Glycoprotein G

None

30.47

True

15.6 kDa

135

id: lunar-moth-wave

Binder

Other

forge

Target

Nipah Virus Glycoprotein G

None

35.89

False

20.6 kDa

179

id: soft-crane-frost

Binder

Other

forge

Target

Nipah Virus Glycoprotein G

None

48.48

False

23.3 kDa

207

id: misty-moth-birch

Binder

Other

forge

Target

Nipah Virus Glycoprotein G

None

33.69

False

18.1 kDa

155

id: quiet-mole-ruby

Binder

Other

forge

Target

Nipah Virus Glycoprotein G

None

32.40

False

14.5 kDa

130

id: rapid-otter-crystal

Binder

Other

forge

Target

Nipah Virus Glycoprotein G

None

31.22

False

23.0 kDa

200

id: lunar-eagle-leaf

Binder

Other

forge

Target

Nipah Virus Glycoprotein G

None

34.31

True

20.5 kDa

180

id: brisk-cat-orchid

Binder

Other

forge

Target

Nipah Virus Glycoprotein G

0.76

25.68

15.9 kDa

146