********************************************************

NAME
        matscan  - a program to search putative TF binding sites in genomic sequences
SYNOPSIS
        matscan
                [-sWClmjx]
                [-T threshold]
                [-c context]
                [-v] [-h]
                <locus_seq_in_fasta_format> <matrix_transfac_format>
OPTIONS
        -T: Threshold for the predictions (0..1)
        -W: Only Forward sense prediction (Watson)
        -C: Only Reverse sense prediction (Crick)
        -x: Sorting the predictions (output)
        -c: Print the context of the site (requires a number)
        -m: Using log-likelihood matrices
        -s: Display for each site, the value of the sum
        -j: Using JASPAR matrices
        -l: Using MEME matrices
        -v: Verbose. Display info messages
        -h: Show this help

********************************************************

USING THE PROGRAM:


(1) regulatory sequences must be in FASTA format:

Multi-fasta files are accepted, no limits in the size

(2) matrices in several formats are accepted:

Multi-matrices files are possible (use "//" as separator)

(3) GFF format for the Outputs

(4) Examples:

--Position frequency matrices (pfm):

TBP
1   61   145   152   31
2   16   46   18   309
3   352   0   2   35
4   3   10   2   374
5   354   0   5   30
6   268   0   0   121
7   360   3   10   6
8   222   2   44   121
9   155   44   157   33
10   56   135   150   48
11   83   147   128   31
12   82   127   128   52
13   82   118   128   61
14   68   107   139   75
15   77   101   140   71

----------------

%bin/matscan -T 0.85 samples/rhodopsines.fa samples/matrices.pfm | grep MatScan
rh3     MatScan TBP             968     982      0.95   +       .       # GTATAAAAGCCCCAA
rh3     MatScan V$TATA_01       968     982      0.95   +       .       # GTATAAAAGCCCCAA
rh5     MatScan TBP             779     793      0.85   +       .       # ATATAAATAAAAGCG
rh5     MatScan TBP             972     986      0.88   +       .       # CTATAAAAGCATTTT
rh5     MatScan V$TATA_01       779     793      0.85   +       .       # ATATAAATAAAAGCG
rh5     MatScan V$TATA_01       972     986      0.88   +       .       # CTATAAAAGCATTTT
rh6     MatScan TBP             971     985      0.85   +       .       # ATATAAAGACCAGAC
rh6     MatScan V$TATA_01       971     985      0.85   +       .       # ATATAAAGACCAGAC

* 0.85 is the cut-off (85% of the maximum score of the matrix, default: 85%)

----------------

%bin/matscan -sT 0.85 samples/rhodopsines.fa samples/matrices.pfm | grep MatScan

rh3     MatScan TBP             968     982     3220.00 +       .       # GTATAAAAGCCCCAA
rh3     MatScan V$TATA_01       968     982     3220.00 +       .       # GTATAAAAGCCCCAA
rh5     MatScan TBP             779     793     2950.00 +       .       # ATATAAATAAAAGCG
rh5     MatScan TBP             972     986     3018.00 +       .       # CTATAAAAGCATTTT
rh5     MatScan V$TATA_01       779     793     2950.00 +       .       # ATATAAATAAAAGCG
rh5     MatScan V$TATA_01       972     986     3018.00 +       .       # CTATAAAAGCATTTT
rh6     MatScan TBP             971     985     2938.00 +       .       # ATATAAAGACCAGAC
rh6     MatScan V$TATA_01       971     985     2938.00 +       .       # ATATAAAGACCAGAC

* the option -s provides the total score of the hit (default: percentage score)

----------------

%bin/matscan -T 0.85 -c 5 samples/rhodopsines.fa samples/matrices.pfm | grep MatScan

rh3     MatScan TBP             968     982      0.95   +       .       # GGGCC GTATAAAAGCCCCAA GCTGG
rh3     MatScan V$TATA_01       968     982      0.95   +       .       # GGGCC GTATAAAAGCCCCAA GCTGG
rh5     MatScan TBP             779     793      0.85   +       .       # ATTAT ATATAAATAAAAGCG TAGAT
rh5     MatScan TBP             972     986      0.88   +       .       # GCGGG CTATAAAAGCATTTT GACGG
rh5     MatScan V$TATA_01       779     793      0.85   +       .       # ATTAT ATATAAATAAAAGCG TAGAT
rh5     MatScan V$TATA_01       972     986      0.88   +       .       # GCGGG CTATAAAAGCATTTT GACGG
rh6     MatScan TBP             971     985      0.85   +       .       # TGGCC ATATAAAGACCAGAC GACCA
rh6     MatScan V$TATA_01       971     985      0.85   +       .       # TGGCC ATATAAAGACCAGAC GACCA

* the option -c N provides the context (N nucleotides around the site)

----------------

* the options -W/-C to force prediction in one strand of the sequence

----------------

bin/matscan -T 0.85 -x samples/rhodopsines.fa samples/matrices.pfm | grep MatScan

rh3     MatScan TBP             968     982      0.95   +       .       # GTATAAAAGCCCCAA
rh3     MatScan V$TATA_01       968     982      0.95   +       .       # GTATAAAAGCCCCAA
rh5     MatScan TBP             779     793      0.85   +       .       # ATATAAATAAAAGCG
rh5     MatScan TBP             972     986      0.88   +       .       # CTATAAAAGCATTTT
rh5     MatScan V$TATA_01       779     793      0.85   +       .       # ATATAAATAAAAGCG
rh5     MatScan V$TATA_01       972     986      0.88   +       .       # CTATAAAAGCATTTT
rh6     MatScan TBP             971     985      0.85   +       .       # ATATAAAGACCAGAC
rh6     MatScan V$TATA_01       971     985      0.85   +       .       # ATATAAAGACCAGAC

* the option -x to sort the hits by position (default: by locus name)

----------------

--Position log-likelihood weight matrices (lwm):

- the original PFM matrix has been compared against a random uniform model
to produce a log-likelihood model

TBP
1   -0.466   0.399   0.447   -1.143
2   -1.805   -0.749   -1.687   1.156
3   1.286   -5.270   -3.884   -1.022
4   -3.479   -2.275   -3.884   1.347
5   1.292   -5.270   -2.968   -1.176
6   1.014   -5.270   -5.270   0.219
7   1.335   -3.453   -2.249   -2.759
8   0.825   -3.884   -0.793   0.219
9   0.466   -0.793   0.479   -1.081
10   -0.552   0.328   0.433   -0.706
11   -0.158   0.413   0.275   -1.143
12   -0.171   0.267   0.275   -0.626
13   -0.171   0.193   0.275   -0.466
14   -0.358   0.096   0.357   -0.260
15   -0.233   0.038   0.364   -0.315
//

----------------

%bin/matscan -mT 0.65 samples/rhodopsines.fa samples/matrices.lwm | grep MatScan

rh3     MatScan TBP             968     982      0.87   +       .       # GTATAAAAGCCCCAA
rh3     MatScan V$TATA_01       968     982      0.87   +       .       # GTATAAAAGCCCCAA
rh5     MatScan TBP             779     793      0.66   +       .       # ATATAAATAAAAGCG
rh5     MatScan TBP             972     986      0.68   +       .       # CTATAAAAGCATTTT
rh5     MatScan V$TATA_01       779     793      0.66   +       .       # ATATAAATAAAAGCG
rh5     MatScan V$TATA_01       972     986      0.68   +       .       # CTATAAAAGCATTTT

----------------

Other matrix formats supported by the program:

JASPAR format (option -j):
 3  1  4  2  4  2 18 18  0  5  2  1
 0  1  1  9  2 15  0  0  6  6  5  6
 0  4  6  2 10  0  0  0  2  5  3  5
15 12  7  5  2  1  0  0 10  2  8  6

MEME format (option -l):
log-odds matrix: alength= 4 w= 16 n= 1620 bayes= 7.76765 E= 2.8e+002 
  -897   -897    226   -897 
   -60     13    152   -897 
    40    113   -897    -61 
  -897   -897    226   -897 
   172   -897   -897   -897 
    99   -897     93   -897 
   172   -897   -897   -897 
   172   -897   -897   -897 
  -897     13     -6     97 
    99   -897     -6    -61 
   -60    171     -6   -897 
  -897    171     93   -897 
    40   -897    152   -897 
  -897    245   -897   -897 
    40   -897     93    -61 
  -897   -897   -897    171 

----------------

- option -v: verbose, providing more information about the process

********************************************************

The program SS to extract fragments of a FASTA sequence:

(1) make SS

(2) %bin/matscan -T 0.85 samples/rhodopsines.fa samples/matrices.pfm | grep MatScan
rh3     MatScan TBP     968     982      0.95   +       .       # GTATAAAAGCCCCAA
(...)

(3) %bin/SS samples/rhodopsines.fa 968 982 (it only works on the 1st sequence)

%% ScanSequence by Enrique Blanco (2006)
%% Allocating memory
%% Reading sequence
%% Locus: rh3
%% Extraction [968,982: 15 bp]
>rh3
GTATAAAAGCCCCAA
%% Finished

********************************************************

References:

If you use this program, please cite:

E. Blanco, X. Messeguer, T.F. Smith and R. Guigo.
Transcription Factor Map Alignment of Promoter Regions.
PLoS Computational Biology 2(5): e49 (2006) 

For further information see:
/datasets/meta2005/index.html#mapping


Bibliography:

JASPAR, the open access database of transcription factor-binding 
profiles: new content and tools in the 2008 update.
Nucl. Acids Res. 2008 36: D102-D106.
http://jaspar.genereg.net/

TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation 
in eukaryotes.
Nucleic Acids Res., Jan 2006; 34: D108 - D110. 
http://www.gene-regulation.com/pub/databases.html

********************************************************

For questions and debugging, please contact
Enrique Blanco at: eblanco@imim.es

********************************************************