The generation of artificial datasets to test the accuracy of motif search applications is a very relevant aspect in the investigation of pattern discovery. The ABS database provides a program Constructor to allow the users the creation and customization of their own datasets for benchmarking tests.

We have decided to implement the generation with a system that first, chooses the real binding sites from the ABS collections and second, embeds the sites in artificial sequences that are generated according to the composition decided by the user.

The benchmark consists of N sequences that can be graphically represented horizontally as imaginary lines in which the boxes represents the location of the motifs. For instance, this is a series of examples showing the behaviour of the parameters (number of motifs and number of sequences):

The length of the sequences is easily modified by selecting one of the standard sizes (500,1000,2000,5000):

Moreover, the users can customize the probability that a motif is present or not in the sequences (0..1), being able to produce more complex datasets like this:

Finally, it is possible to choose the composition of the background sequences and the species from which the binding sites are coming from.

The output of Constructor is divided into 4 parts:

I. The artificial sequences

-> The embedded motifs are highlighted in uppercase.

II. The positions of the real motifs in the artificial sequences

-> The annotations are shown in GFF format.

III. The original annotations of these motifs in the ABS database

-> The annotations are shown in GFF format.

IV. The graphical plot

The complete example output can be examined here

