The Embedded Alphabet (TEA)

is a novel approach using contrastive learning to convert protein language model embeddings into a new 20-letter alphabet, enabling highly sensitive and efficient large-scale protein homology searches, without the need for structure.

This website provides access to downloadable datasets of protein sequences converted with TEA. Furthermore, it will shortly provide an interactive search service that allows you to convert and search your protein of interest against popular protein datasets.

TEA sequence conversion command, model code, training scripts and documentation are available on GitHub

The Embedded Alphabet (TEA) on Hugging Face

For more detailed information, please refer to our preprint on bioRxiv