AraClean is an error corrected and redundancy reduced database of
Arabidopsis thaliana sequences extracted from GenBank rel. 87.
Version 1.1 holds nucleotide sequences and annotation for 142 GenBank entries as
Cleaning the GenBank Arabidopsis thaliana data set,
P.G. Korning, S.M. Hebsgaard, P. Rouze and S. Brunak,
Nucl. Acids Res., 24, 316-320, 1996.
Due to the discovery of what seems to be additional errors, we have
excluded two more sequences, see below.
Format of AraClean.
This data set contains 142 entries comprising 144 genes with 745 donor
sites and 746 acceptor sites. The sequence format in the files is as
One line gives the sequence length, the GenBank locus, the accession
number, the ncbi number, and the organism indication. Subsequent lines
give the nucleotide sequence with 80 characters per line. After the
nucleotide sequence follows the annotation in a non-numeric symbolic
. (period) represents intergenic or unannotated sequence
B represents transcription start
M represents transcribed untranslated exon sequence
E represents translated exon sequence
D represents the first nucleotide in intron sequence (donor site)
A represents the last nucleotide in intron sequence (acceptor site)
I represents intron sequence
S represents transcription stop
Some comments on what seems to be additional errors in two sequences
NOT included in the present file: ara87.1.1.seq :
We have further reduced the set by two sequences due to a seemingly
wrongly positioned exon in ATSUCSYN, and one wrong and one very
suspicious acceptor site in ATU08315.
ATU08315: from homology with z18242 it appears that the third intron
apparently is misplaced, and should be six positions ahead (2043/2044
instead of 2049/2050). Homology with U20502 and z35108 confirms this.
There is no close cDNA homolog for the last acceptor site which is
highly suspicious, its position remains largely uncertain and we
have consequently discarded the entry.
ATSUCSYN: when compared to the other sucrose synthase, the N-terminal
show no homology up to the intron position, albeit these sequences
display quite a bit of conservation in sequence. When one looks for
these conserved sequence elements they can be found, with the first ATG
beginning at position 585 (instead of 464) and a possible donor site at
671/672. Here is the encoded protein sequence, compared to some others.
The maize one being of interest, since it is a well known gene, with no
doubt on the intron position (/). Others come from sequences of cDNAs.
The problem is that this exon ... is NOT in phase with the following
one (and there is no good alternative solution to this, with the
sequence available only). This suggests that there is likely sequencing
errors in that area (maybe two T's are lacking between position 664 and
666, that would give the canonical ending of the exon: SLFSR, and give
the correct frame for the splice site). This could the cause, since
such a correction would give a polyT tract which again may give
problems in the sequencing for the precise determination of the number
of T's. On the basis of this we have also discarded the entry from the