AraClean V1.1

Release notes

AraClean is an error corrected and redundancy reduced database of Arabidopsis thaliana sequences extracted from GenBank rel. 87.
Version 1.1 holds nucleotide sequences and annotation for 142 GenBank entries as described in:

Cleaning the GenBank Arabidopsis thaliana data set,
P.G. Korning, S.M. Hebsgaard, P. Rouze and S. Brunak,
Nucl. Acids Res., 24, 316-320, 1996.

Due to the discovery of what seems to be additional errors, we have excluded two more sequences, see below.

Format of AraClean.

This data set contains 142 entries comprising 144 genes with 745 donor sites and 746 acceptor sites. The sequence format in the files is as follows: One line gives the sequence length, the GenBank locus, the accession number, the ncbi number, and the organism indication. Subsequent lines give the nucleotide sequence with 80 characters per line. After the nucleotide sequence follows the annotation in a non-numeric symbolic notation where

.  (period) represents intergenic or unannotated sequence
B  represents transcription start
M  represents transcribed untranslated exon sequence
E  represents translated exon sequence
D  represents the first nucleotide in intron sequence (donor site)
A  represents the last nucleotide in intron sequence (acceptor site)
I  represents intron sequence
S  represents transcription stop


Some comments on what seems to be additional errors in two sequences NOT included in the present file: ara87.1.1.seq :

We have further reduced the set by two sequences due to a seemingly wrongly positioned exon in ATSUCSYN, and one wrong and one very suspicious acceptor site in ATU08315.

ATU08315: from homology with z18242 it appears that the third intron apparently is misplaced, and should be six positions ahead (2043/2044 instead of 2049/2050). Homology with U20502 and z35108 confirms this. There is no close cDNA homolog for the last acceptor site which is highly suspicious, its position remains largely uncertain and we have consequently discarded the entry.

ATSUCSYN: when compared to the other sucrose synthase, the N-terminal show no homology up to the intron position, albeit these sequences display quite a bit of conservation in sequence. When one looks for these conserved sequence elements they can be found, with the first ATG beginning at position 585 (instead of 464) and a possible donor site at 671/672. Here is the encoded protein sequence, compared to some others. The maize one being of interest, since it is a well known gene, with no doubt on the intron position (/). Others come from sequences of cDNAs.


The problem is that this exon ... is NOT in phase with the following one (and there is no good alternative solution to this, with the sequence available only). This suggests that there is likely sequencing errors in that area (maybe two T's are lacking between position 664 and 666, that would give the canonical ending of the exon: SLFSR, and give the correct frame for the splice site). This could the cause, since such a correction would give a polyT tract which again may give problems in the sequencing for the precise determination of the number of T's. On the basis of this we have also discarded the entry from the dataset.