MichelNivard/DNABert-CDS-13Species-v0.1

A toy DNA BErt based on ModernBert

This is a small (5.3m parameter) DNA language model trained on coding sequeces (the parts of the DNA that get translated to RNA and then that RNA gets transcribed to proteins) for 13 vertebrea species. THe tokenizer tokenized at the single base level and has 20 tikens, obviously G,C,T and A but also tokens for missing or uncertain bases. The tokenier follos the full FASTA file format rules.

In innitial taraining an a macbook for ±50million tokens it reached a loss of 1.12, which coresponds to a token probability of exp(-1.12) = 32.6%. THe DNA we train on is absolutely dominated by G,C,T and A coded bases, though there is a very small proportion of unknown bases (N) and gaps(-). But assomign near complete G,C,T, A random guessing would yield ±25% probability so 32.6% is significant progress. Though consider that simple biological rules for the structure of vertebrea coding sequences guarantee the first 3 based on any coding sequence in this dataset form the start codon (ATG), none of the 3 token sequences after it should form a stop codon untile the end of the coding sequences and there are specific affinities for synonymous codons within species. In other words you could write a simple script to predict based in a coding sequences which would get > 25% accuracy.

MichelNivard
/

DNABert-CDS-13Species-v0.1

A toy DNA BErt based on ModernBert

Dataset used to train MichelNivard/DNABert-CDS-13Species-v0.1