Submission 2

Description

We are a team of two first-year undergraduate students from IIT Madras entering the field of computational biology for the first time. We used our standard and simple laptops and free-tier cloud instances (Google Colab and Kaggle) throughout our journey in this competition.

Methodology: We planned to create a baseline sequence using a diffusion model and start experimenting on it using various algorithms, trying to optimize the binder.

Phase 1:RFdiffusion Timeline:25th october-31st October For our initial baseline sequence, we used RFdiffusion on Google Colab to generate de novo protein backbones using the NiV-G target. We generated sets of protein sequences. We optimized this by using LLMs to check available literature and suggest hotspot positions to target. We refined RFdiffusion parameters, optimizing for sequence length, contact constraints, and specific "hotspot" targeting.

Outcome-We got out first breakthrough we got a iptM of 0.74 and a iPAE of 10.46

Phase 2:Genetic Algorithm Timeline: 1st November-10thnovember(coding and debugging) 11th-23rd-End Semester examinations(very slow work) 24th-28th november(Algorithm testing)

We developed a genetic algorithm to evolve our best baseline design over many generations. While developing the algorithm we faced a dependency conflict between Python,colabfold,tensorflow,jax,biopython,numpy,matplotlib,dm-haiku,flax,orbax,protobuf etc. Resolving this was a very difficult part alongside managing with our end-semester examinations.

After resolving all these errors, our code ran partially on a Google Colab CPU which had significant memory constraints. We then ran locally using a basic GPU (linux), after some generations the GPU(4GB) could not process the binder and target and crashed. We also identified and fixed critical logic errors in our selection algorithm that were causing premature convergence.

We also integrated ESMFold into our pipeline to rapidly screen a larger diversity of sequences, leveraging its high inference speed.

Phase 3:Refinement(28th and 29th November)

We implemented our genetic algorithm on Kaggle’s P100 with the best design and then we had a breakthrough. We refined this sequence and got our current best sequence.

As our first initiation into the world of protein-folding and our first such project, this was a great practical learning experience. We believe this journey showcases how rapid, resource-constrained protein design can be implemented. As now we have more time available, we hope to pursue this further and optimize our sequence using various algorithms and techniques. By open-sourcing our evolutionary scripts and sharing our journey through hardware and software constraints, we hope to help the community build more resilient protein design pipelines that can operate effectively even with limited compute resources.

Proteins (1)

TableGrid

A&PArsh & Prakhar

id: mellow-yak-cedar

Binder

Peptide

RFDiffusion and Evolutionary Algorithm (Using ESMFold)

Target

Nipah Virus Glycoprotein G

0.61

43.62

4.0 kDa