Protein Structure Prediction with AlphaFold2
BY ROBIN YU
The Protein Folding Problem
AlphaFold2 may be the solution to a 60 year-old problem. Around 1960, scientists were beginning to resolve the first atomic resolution protein structures, and with that came the “protein folding problem.” 1 The goal was to predict the structure of a protein purely by its amino acid sequence, and with the advent of computation and increasingly clever computational techniques, such as artificial neural networks, scientists are reaching ever closer to this goal. Now, more than ever, the ability to predict protein structures is critical in biotechnology and medicine, aiding the development of drug therapies and improving our quality of life.
Within this 60-year period, there were some important breakthroughs that allowed scientists to better rationalize protein folding.1 The first was the thermodynamic hypothesis by Christian Anfinsen, where he stated that the native protein structure is only dictated by the primary amino acid sequence and solution condition. While there were some kinetically trapped exceptions, this hypothesis guided future studies of folding equilibrium and kinetics from the physical chemistry perspective. Afterwards, in the mid-1980’s, a new statistical mechanics model was developed that challenged the conventional view, where the primary sequence was thought to be the dominant folding code for the secondary structure and subsequently tertiary structure. Yet, this new model proposed both local and nonlocal folding, essentially the secondary structure affects the tertiary structure just as much as the reverse.
Initial Attempts at Protein Folding Software
Compared to its denatured state, native protein structure is only 5-10 kcal/mol more stable,1 and an important component of this stability is the tendency for a protein to maximize hydrogen bonding between backbone amide and carbonyl groups. Yet, the dominant folding driving force must lie within the side chains of the protein residues because that is what confers a protein’s uniqueness. There has been a wealth of evidence that supports water-mediated hydrophobic force to be key to protein stability: now termed the hydrophobic effect.1 The first predictions of protein structure with computational biology incorporated this physics in an atomic force field with Monte Carlo sampling; essentially, an energy function contains terms such as Lennard-Jones potential energy for atomic bonds, electrostatic interactions, steric repulsion, hydrogen bonding, torsional angle of the peptide bond, and etc. The program would then try to find the protein conformation with the lowest energy, and a statistical term is included to allow for the possibility of escaping local energy minima. This method of predicting the lowest energy protein conformation is still used today in software such as Rosetta.2
However, there are issues with this ab initio approach. For example, there is the assumption that the energy function accurately captures the physical interactions in proteins. Moreover, as the protein primary sequence lengthens, the sampling space of protein conformations grows exponentially. Thus, calculations are immensely resource intensive, although scientists have found some ways of circumventing this issue with bioinformatics. With sequence alignments, evolutionary homologous structures may be used as templates for an unknown protein, which would greatly shorten the calculation time.1 Nonetheless, even with this combined method, there are still shortcomings; predicting protein structure requires a homologous structure, and there are no methods to further refine the predicted structure.
Protein Folding with AlphaFold2
AlphaFold2 is Google’s latest neural network-based model for protein structure prediction. This software uses physical and biological knowledge of proteins in addition to an improved machine learning algorithm to accurately predict protein structures even without similar sequences. The astonishing success of AlphaFold2 is described in this open access Nature article.3 In the recent biennial 14th Critical Assessment of protein Structure Prediction (CASP14), the gold standard of structure prediction, AlphaFold2 vastly outcompeted other methods. With reference to the blind test structures, the median AlphaFold2 structures had an accuracy of 0.96 Å r.m.s.d. (95% confidence interval = 0.85–1.16 Å), compared to next best performing method of 2.8 Å r.m.s.d. (95% confidence interval = 2.7–4.0 Å). For reference, 0.96 Å is about two-thirds the length of a typical carbon-carbon bond of 1.54 Å. Thus, AlphaFold2 is achieving atomic resolution structures. The all-atom r.m.s.d is similarly significantly better than other methods and scalable to a long 2180-residue protein without structural homologs. Furthermore, AlphaFold2 also provides a per-residue reliability score for its predictions since some protein domains may be more confidently predicted than others.
So, what is the thought process behind these excellent benchmarks? First, the input sequence is analyzed across AlphaFold2’s training database to generate a multiple sequence alignment (MSA) and patterns of mutation in similar sequences. The idea is that mutations to one amino acid would lead to mutations of amino acids in close physical proximity in order to preserve the same structure, and MSA may help elucidate these interacting amino acids.4 Additionally, AlphaFold2 tries to generate a starting model, or pair representation, with template fragments from similar structures. The principle behind this procedure is that protein structures remain relatively stable despite the accumulation of mutations.4 Therefore, AlphaFold2 doesn’t rely on sequence similarity as much as structural similarity. For example, during AlphaFold2’s testing, proteins with sequences containing 40% sequence identity covering more than 1% of the chain from its training dataset were excluded.3
Yet, the idea of MSA and pair representation is not new, which begs the question; what makes AlphaFold2 that much more accurate? The answer lies in the Evoformer, the central building block of AlphaFold2’s neural network. Previously, the geometric proximity deduced from MSA was solely used as the product of structural predictions. In Evoformer, the information garnered from MSA analysis is still used to guide its pair representation.3,4 However, the pair representation is not only the product but also the intermediate layer and further analyzed to refine AlphaFold2’s structural hypothesis. This revised hypothesis can then be applied to MSA and pair representation through iterative refinement. There are 48 cycles of iterative refinement in Evoformer before the output is recycled into the whole network, and this process repeats three times.4 The scientists at Google provided evidence of hypothesis formation in Evoformer, citing the gradual improvement of its intermediate structures until no further refinement was made.3 Lastly, a key component of Evoformer is the Transformer architecture developed by Google Brain. A concept of the Transformer in artificial intelligence is to properly direct the attention of the neural network to its inputs.3 Harnessing the power of neural networks, the team of researchers at Google have developed the AlphaFold2 software that is capable of reliably predicting protein structures to atomic accuracy, even without the knowledge of homologous structures. In fact, the utility of AlphaFold2 has been demonstrated in a companion open access Nature paper, where AlphaFold2 was used to predict structures in the human proteome.5 Accompanying the rapid development of genomics, AlphaFold2 represents a significant advancement in the biophysics field. Made freely available, AlphaFold2 may be the start of a new proteomic revolution in structural biology and accelerate the developments of fields such as drug discovery, an increasingly important aspect in modern society.
References
- Dill KA, Ozkan SB, Shell MS, Weikl TR. The protein folding problem. Annu Rev Biophys. 2008;37:289-316. doi: 10.1146/annurev.biophys.37.092707.153558.
- “Scoring Tutorial.” Rosetta Commons. 14 Mar. 2022, https://new.rosettacommons.org/demos/latest/tutorials/scoring/scoring#scoring-in-rosetta
- Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2
- “AlphaFold 2 is here: what’s behind the structure prediction miracle.” Oxford Protein Informatics Group. 14 Mar. 2022, https://www.blopig.com/blog/2021/07/alphafold-2-is-here-whats-behind-the-structure-prediction-miracle/#comment-532884
- Tunyasuvunakool, K., Adler, J., Wu, Z. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596 (2021). https://doi.org/10.1038/s41586-021-03828-1