Domain BERT
Masters Research Project where I trained an encoder on domain-level sequences to obtain an embedding space covering 77 million protein.
Abstract
Today protein LLMs are widely used to solve tasks like structure and function prediction. With their utility depending on the quality of the learned embedding space. To build such spaces, transformer-based masked-token models are typically trained on amino-acid tokenisations. This study asks whether biological structure could be similarly captured if tokens are protein domains instead. We train an encoder on domain-level sequences and obtain an embedding space covering 77 million proteins. The resulting space shows some organisation: embeddings cluster by similarity, embedding proximity correlates with multi-domain architecture order, and neighbours display Gene Ontology consistency, indicating that the model captures both domain composition and ordering. Despite limits in training scale and embedding dimensionality, these results provide proof of concept that domain-based tokenisation is biologically informative. We conclude that protein LLMs trained using domain tokens can complement residue-level models and may support downstream applications such as protein function prediction and design.