DISCLAIMER : BELOW CONTENT IS NOT AI GENERATED AND REPRESENTS THE AUTHORS VIEWS AND INTERPRETATIONS. FOR QUERIES USE THE COMMENT BOX BELOW.

The protein folding problem refers to the challenge : Can we predict a protein's structure completely from its sequence?. While this is possible using X-ray crystallography and Nuclear Magnetic Resonance (NMR), these techniques face challenges such as:

  1. low success rates in getting high quality protein crystals (for ex: intrinsically disordered proteins fall in this category. More details on how to grow crystals here),
  2. limitations on the protein sizes that can be crystallized (for ex: while individual proteins in the Nuclear Pore Complex (NPC) can be crystallized and examined, the overall assembly remains elusive to study via experimental techniques),
  3. tedious process with low throughput rates.

An alternative approach would be to simulate the protein folding process using computers. However, our current compute capabilities allow access to simulation times on the scale of 10s of nanoseconds, enabling folding simulations of only small proteins.[1] Google DeepMind's first iteration at this problem, called AlphaFold1, uses Convolutional Neural Networks (CNNs) to predict torsion and distance distributions (called distograms) from Multiple Sequence Alignment (MSA) features. Potentials were constructed based on these distributions and the initial structure from the predicted torsion and distance distributions was minimized using gradient descent.[2] Using this strategy the DeepMind team could predict accurate protein backbone structures.[2]

The following year the DeepMind team unveiled a more powerful model, based on the transformer architecture, that could now predict accurate protein side conformations as well. This model, called AlphaFold2, could predict structures with a median backbone accuracy of 0.96 Å r.m.s.d.95 (Cα root-mean-square deviation at 95% residue coverage) on the CASP14 test domains. In addition, most predicted structures had Template Modelling (TM) scores greater than 0.9.[3] AlphaFold2 marks the start of a series of computational models that predict protein structures on the scale of experimental accuracy.

AF1 and AF2 architectures
Figure 1. AlphaFold1 and AlphaFold2 model diagrams

AlphaFold3 (AF3) was released by the DeepMind team in 2023 and is their latest open source model as of date that goes beyond just predicting protein structures and includes RNA, DNA, docking of ligands and ions, bio-molecular complexes and multimers.[4] In this technical blog I will go over some of the details of the model architecture, results on benchmarks and some of the limitations of this model.

The journey from token to structure

Below figure is a snapshot of the screen the user encounters when using the AlphaFold server. It provides a text box for the user to paste in their protein, DNA or RNA sequence and a dropdown menu to choose from the different types of ions and ligands AF3 supports.

AF3 input screen
Figure 2. Overview of AF3 input screen on the AlphaFold Server. Selection of entity type.

AF3 also provides supports for structure prediction of multimeric chains and for different post translation modifications of amino acid residues and DNA/RNA nucleotides.

AF3 input screen
Figure 3. Overview of AF3 input screen on the AlphaFold Server. Selection for post translation modifications.

Once the user submits the job. AF3 runs the user input data through a data processing pipeline, generates features to feed the model and then predicts the 3D atomic coordinates and per atom confidence (pLDDT) and alignment metrics.

AF3 input screen
Figure 4. Overview of AF3 input screen on the AlphaFold Server. Structure Prediction and confidence metrics

When the user submits the job on the AlphaFold server, AF3 runs a data processing pipeline where :

  1. input sequences are tokenized and features extracted from the tokenized sequences (Section 1.1),
  2. reference conformers generated for each residue/nucleotide (Section 1.2),
  3. MSA run on the input sequence and the search results are featurized (Section 1.3),
  4. template searches for single entities based on the retrieved MSA results (Section 1.4).
1.1. Tokenization of input sequences

The amino acids, nucleotides, ligands and ions are represented using numerical representation called tokens. Each standard amino acid and nucleotide are represented using single tokens while modified amino acids, ligands and ions are tokenized per-atom. For example, Serine which is a standard amino acid is represented by 1 token while ibuprofen which contains 15 heavy atoms is represented using 15 tokens. There are in total 32 classes of molecules : 20 standard amino acids + 1 unknown, 4 standard DNA nucleotides + 1 unknown, 4 standard RNA nucleotides + 1 unknown, gap (from the MSA), ligands and ions are treated as unknown. Two examples are shown below, where in one case a protein chain is comprised of standard amino acids and another case consisting of multiple chains. The token features overall attempt to distinguish between the different amino acids and nucleotides in a chain from those present in different chains, as in the case of multimers. The token_bonds feature is a 2D matrix which indicates whether a bond exists between token i and j and is restricted to just inter ligand bonds and bonds between ligand and polymer which are less than 2.4 Å.

token features
Figure 5. Features constructed from tokenized sequences.
1.2. Generation of reference conformers (Training only)

Reference conformers for each monomer in the chains are created using RDKit's ETKDG3 confomer generation algorithm. Data from the mmCIF file is used to create a set of features shown in Figure 6. Conformer generation done only during training. At inference time, a dummy CIF with all atom coordinates zeroed is used.

conformer features
Figure 6. Features constructed from generated conformers
1.3. Multiple Sequence Alignment (MSA) searches

The process of aligning 2 or more protein, DNA or RNA sequences to maximize regions of sequence similarity is called Multiple Sequence Alignment (MSA). More details on MSA can be found in this blogpost. MSA is useful in structure prediction as correlated mutations are evolutionary signals that AF3 can use to infer whether a pair of residues are in close proximity with each other. AF3 uses Hidden Markov Models (HMMs) to build the MSA for the query sequence because traditional sequence alignment algorithms do not provide site specific substitution probabilities. Unlike traditional HMMs which take the form of cyclic graphs, HMMs from MSA, also called profiles, have a directional information flow from left to right. An example HMM is shown in Figure 7 where starting from the left end of the sequence, the arrows indicate the most probable state to enter next. The states are indicated as M, D and I which represent an amino acid, deletion or insertion state respectively.

MSA features
Figure 7. Features constructed from MSA search
1.4. Creating structural priors by searching for templates

Using the constructed MSA profile in the previous step, AF3 next runs a search across genetic databases to find structural priors for the input sequence. This is done only for individual chains so the model does not know how different chains are in proximity with each other. AF3 uses upto 4 templates during training and inference. The template features can be divided based on sequence and structure. While AF1 predicts distograms, AF3 uses distograms for the template as an input.

template features
Figure 8. Features constructed from template search
2. High Level Overview of AF3 Architecture

Okay so far we have collected the data and featurized it. Now we are ready to pass this data through the AF3 model and predict 3D structures and confidence metrics. AF3 primarily operates on two internal representations called the pair p and single representations. These representations operate on a fine-grained and coarse-grained scale. On the fine-grained scale they encode relationships between atomic level These are continuously updated as they are passed through multiple module layers until they reach the Diffusion Module where starting from gaussian distributed 3D atomic coordinates, the module iteratively denoises the coordinates conditioned based on the refined single and pair representations.

References

[1] Scheraga, H. A.; Khalili, M.; Liwo, A. Protein-Folding Dynamics: Overview of Molecular Simulation Techniques. Annu. Rev. Phys. Chem. 2007, 58 (1), 57–83.https://doi.org/10.1146/annurev.physchem.58.032806.104614.

[2] Senior, A. W.; Evans, R.; Jumper, J.; Kirkpatrick, J.; Sifre, L.; Green, T.; Qin, C.; Žídek, A.; Nelson, A. W. R.; Bridgland, A.; Penedones, H.; Petersen, S.; Simonyan, K.; Crossan, S.; Kohli, P.; Jones, D. T.; Silver, D.; Kavukcuoglu, K.; Hassabis, D. Improved Protein Structure Prediction Using Potentials from Deep Learning. Nature 2020, 577 (7792), 706–710.https://doi.org/10.1038/s41586-019-1923-7.

[3] Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; Bridgland, A.; Meyer, C.; Kohl, S. A. A.; Ballard, A. J.; Cowie, A.; Romera-Paredes, B.; Nikolov, S.; Jain, R.; Adler, J.; Back, T.; Petersen, S.; Reiman, D.; Clancy, E.; Zielinski, M.; Steinegger, M.; Pacholska, M.; Berghammer, T.; Bodenstein, S.; Silver, D.; Vinyals, O.; Senior, A. W.; Kavukcuoglu, K.; Kohli, P.; Hassabis, D. Highly Accurate Protein Structure Prediction with AlphaFold. Nature 2021, 596 (7873), 583–589. https://doi.org/10.1038/s41586-021-03819-2.

[4] Abramson, J.; Adler, J.; Dunger, J.; Evans, R.; Green, T.; Pritzel, A.; Ronneberger, O.; Willmore, L.; Ballard, A. J.; Bambrick, J.; Bodenstein, S. W.; Evans, D. A.; Hung, C.-C.; O’Neill, M.; Reiman, D.; Tunyasuvunakool, K.; Wu, Z.; Žemgulytė, A.; Arvaniti, E.; Beattie, C.; Bertolli, O.; Bridgland, A.; Cherepanov, A.; Congreve, M.; Cowen-Rivers, A. I.; Cowie, A.; Figurnov, M.; Fuchs, F. B.; Gladman, H.; Jain, R.; Khan, Y. A.; Low, C. M. R.; Perlin, K.; Potapenko, A.; Savy, P.; Singh, S.; Stecula, A.; Thillaisundaram, A.; Tong, C.; Yakneen, S.; Zhong, E. D.; Zielinski, M.; Žídek, A.; Bapst, V.; Kohli, P.; Jaderberg, M.; Hassabis, D.; Jumper, J. M. Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3. Nature 2024, 630 (8016), 493–500. https://doi.org/10.1038/s41586-024-07487-w.