Vai direttamente ai contenuti della pagina
News

Protein language models with structural alphabet

19.12.2023

When trained on millions of protein sequences, large language models have been shown to develop emergent capabilities and to generalize across a range of applications, from prediction of mutational effects to long-range contact predictions.

Similarly, unsupervised machine learning tools are able to cluster protein sequences, identifying homologous domains and improving functional and evolutionary analyses.

In order to describe structural information, van Kempen et al. (2023) have recently introduced a discretized 20-letter structural alphabet (3Di) encoding the tertiary interactions between residues.

The 3Di alphabet allows the density peak clustering of structures starting from local Foldseek alignments, thus refining protein family classification in the twilight zone.

At the same time, new protein language models can be trained using this structure-aware vocabulary, finding applications, for instance, in protein stability due to point mutations.

Speaker: Marco Celoria, RIT (Area Science Park)