A machine learning
model that predicts ATG and near-cognate codon initiation sites (TIS) in
nucleotide sequences
TIS Predictor is a machine learning software
developed to predict translation initiation sites in a given nucleotide
sequence [1]. It was trained on instances of ATG and near-cognate
translation initiation. Furthermore, it outperforms the recently developed model,
TITER [2], that had been created for the same
function by more than 18% in accuracy.
Please
cite this publication
and/or this publication if predictions are used.
ATG translation-initiation codons are
predicted with about 87.79% accuracy, and near-cognate codons (CTG, GTG and
TTG) with 85.00% accuracy. Less common near-cognate codons are extrapolated (AAG, AGG, ACG, ATC,
ATT, and ATA), based on patterns
recognized from CTG, GTG, and TTG instances. Predictions for codons that the
software was trained on show up in bold. Extrapolated codons show up
without bold formatting (and should be akndowledged with less confidence).
Previously identified translation initiation
sites upstream of repeat expansions for the C9ORF72, FMR1, DM1,
and HDL2 genes – associated with neurologic disease – were compiled and
used as a reference for software performance.
Table 1. Previously identified translation initiation sites
from publications. |
||||
Gene |
Codon |
Number of Bases Upstream of Repeat |
Peptide Repeat Translated |
Kozak Similarity Score |
C9orf72 (Sense) [3] |
AGG |
1 |
Poly-GR |
0.66 |
CTG |
24 |
Poly-GA |
0.69 |
|
C9orf72 (Antisense) [3] |
ATG
|
194 |
Poly-PG |
0.61 |
FMR1 |
GTG |
11 |
Poly-G |
0.70 |
ACG
|
35 |
Poly-G |
0.80 |
|
ACG |
60 |
Poly-G |
0.71 |
|
DM1 (Antisense) with slightly modified sequence [6] |
ATC |
7 |
Poly-A |
0.61 |
ATG |
17 |
Poly-S |
0.66 |
|
ATT |
23 |
Poly-S |
0.74 |
|
HDL2 (Antisense) [6] |
ATC |
6 |
Poly-Q |
0.74 |
All translation initiation sites were correctly
identified with one exception: ATC, seven bases upstream of the repeat [6]