******************************************************** NAME matscan - a program to search putative TF binding sites in genomic sequences SYNOPSIS matscan [-sWClmjx] [-T threshold] [-c context] [-v] [-h] OPTIONS -T: Threshold for the predictions (0..1) -W: Only Forward sense prediction (Watson) -C: Only Reverse sense prediction (Crick) -x: Sorting the predictions (output) -c: Print the context of the site (requires a number) -m: Using log-likelihood matrices -s: Display for each site, the value of the sum -j: Using JASPAR matrices -l: Using MEME matrices -v: Verbose. Display info messages -h: Show this help ******************************************************** USING THE PROGRAM: (1) regulatory sequences must be in FASTA format: Multi-fasta files are accepted, no limits in the size (2) matrices in several formats are accepted: Multi-matrices files are possible (use "//" as separator) (3) GFF format for the Outputs (4) Examples: --Position frequency matrices (pfm): TBP 1 61 145 152 31 2 16 46 18 309 3 352 0 2 35 4 3 10 2 374 5 354 0 5 30 6 268 0 0 121 7 360 3 10 6 8 222 2 44 121 9 155 44 157 33 10 56 135 150 48 11 83 147 128 31 12 82 127 128 52 13 82 118 128 61 14 68 107 139 75 15 77 101 140 71 ---------------- %bin/matscan -T 0.85 samples/rhodopsines.fa samples/matrices.pfm | grep MatScan rh3 MatScan TBP 968 982 0.95 + . # GTATAAAAGCCCCAA rh3 MatScan V$TATA_01 968 982 0.95 + . # GTATAAAAGCCCCAA rh5 MatScan TBP 779 793 0.85 + . # ATATAAATAAAAGCG rh5 MatScan TBP 972 986 0.88 + . # CTATAAAAGCATTTT rh5 MatScan V$TATA_01 779 793 0.85 + . # ATATAAATAAAAGCG rh5 MatScan V$TATA_01 972 986 0.88 + . # CTATAAAAGCATTTT rh6 MatScan TBP 971 985 0.85 + . # ATATAAAGACCAGAC rh6 MatScan V$TATA_01 971 985 0.85 + . # ATATAAAGACCAGAC * 0.85 is the cut-off (85% of the maximum score of the matrix, default: 85%) ---------------- %bin/matscan -sT 0.85 samples/rhodopsines.fa samples/matrices.pfm | grep MatScan rh3 MatScan TBP 968 982 3220.00 + . # GTATAAAAGCCCCAA rh3 MatScan V$TATA_01 968 982 3220.00 + . # GTATAAAAGCCCCAA rh5 MatScan TBP 779 793 2950.00 + . # ATATAAATAAAAGCG rh5 MatScan TBP 972 986 3018.00 + . # CTATAAAAGCATTTT rh5 MatScan V$TATA_01 779 793 2950.00 + . # ATATAAATAAAAGCG rh5 MatScan V$TATA_01 972 986 3018.00 + . # CTATAAAAGCATTTT rh6 MatScan TBP 971 985 2938.00 + . # ATATAAAGACCAGAC rh6 MatScan V$TATA_01 971 985 2938.00 + . # ATATAAAGACCAGAC * the option -s provides the total score of the hit (default: percentage score) ---------------- %bin/matscan -T 0.85 -c 5 samples/rhodopsines.fa samples/matrices.pfm | grep MatScan rh3 MatScan TBP 968 982 0.95 + . # GGGCC GTATAAAAGCCCCAA GCTGG rh3 MatScan V$TATA_01 968 982 0.95 + . # GGGCC GTATAAAAGCCCCAA GCTGG rh5 MatScan TBP 779 793 0.85 + . # ATTAT ATATAAATAAAAGCG TAGAT rh5 MatScan TBP 972 986 0.88 + . # GCGGG CTATAAAAGCATTTT GACGG rh5 MatScan V$TATA_01 779 793 0.85 + . # ATTAT ATATAAATAAAAGCG TAGAT rh5 MatScan V$TATA_01 972 986 0.88 + . # GCGGG CTATAAAAGCATTTT GACGG rh6 MatScan TBP 971 985 0.85 + . # TGGCC ATATAAAGACCAGAC GACCA rh6 MatScan V$TATA_01 971 985 0.85 + . # TGGCC ATATAAAGACCAGAC GACCA * the option -c N provides the context (N nucleotides around the site) ---------------- * the options -W/-C to force prediction in one strand of the sequence ---------------- bin/matscan -T 0.85 -x samples/rhodopsines.fa samples/matrices.pfm | grep MatScan rh3 MatScan TBP 968 982 0.95 + . # GTATAAAAGCCCCAA rh3 MatScan V$TATA_01 968 982 0.95 + . # GTATAAAAGCCCCAA rh5 MatScan TBP 779 793 0.85 + . # ATATAAATAAAAGCG rh5 MatScan TBP 972 986 0.88 + . # CTATAAAAGCATTTT rh5 MatScan V$TATA_01 779 793 0.85 + . # ATATAAATAAAAGCG rh5 MatScan V$TATA_01 972 986 0.88 + . # CTATAAAAGCATTTT rh6 MatScan TBP 971 985 0.85 + . # ATATAAAGACCAGAC rh6 MatScan V$TATA_01 971 985 0.85 + . # ATATAAAGACCAGAC * the option -x to sort the hits by position (default: by locus name) ---------------- --Position log-likelihood weight matrices (lwm): - the original PFM matrix has been compared against a random uniform model to produce a log-likelihood model TBP 1 -0.466 0.399 0.447 -1.143 2 -1.805 -0.749 -1.687 1.156 3 1.286 -5.270 -3.884 -1.022 4 -3.479 -2.275 -3.884 1.347 5 1.292 -5.270 -2.968 -1.176 6 1.014 -5.270 -5.270 0.219 7 1.335 -3.453 -2.249 -2.759 8 0.825 -3.884 -0.793 0.219 9 0.466 -0.793 0.479 -1.081 10 -0.552 0.328 0.433 -0.706 11 -0.158 0.413 0.275 -1.143 12 -0.171 0.267 0.275 -0.626 13 -0.171 0.193 0.275 -0.466 14 -0.358 0.096 0.357 -0.260 15 -0.233 0.038 0.364 -0.315 // ---------------- %bin/matscan -mT 0.65 samples/rhodopsines.fa samples/matrices.lwm | grep MatScan rh3 MatScan TBP 968 982 0.87 + . # GTATAAAAGCCCCAA rh3 MatScan V$TATA_01 968 982 0.87 + . # GTATAAAAGCCCCAA rh5 MatScan TBP 779 793 0.66 + . # ATATAAATAAAAGCG rh5 MatScan TBP 972 986 0.68 + . # CTATAAAAGCATTTT rh5 MatScan V$TATA_01 779 793 0.66 + . # ATATAAATAAAAGCG rh5 MatScan V$TATA_01 972 986 0.68 + . # CTATAAAAGCATTTT ---------------- Other matrix formats supported by the program: JASPAR format (option -j): 3 1 4 2 4 2 18 18 0 5 2 1 0 1 1 9 2 15 0 0 6 6 5 6 0 4 6 2 10 0 0 0 2 5 3 5 15 12 7 5 2 1 0 0 10 2 8 6 MEME format (option -l): log-odds matrix: alength= 4 w= 16 n= 1620 bayes= 7.76765 E= 2.8e+002 -897 -897 226 -897 -60 13 152 -897 40 113 -897 -61 -897 -897 226 -897 172 -897 -897 -897 99 -897 93 -897 172 -897 -897 -897 172 -897 -897 -897 -897 13 -6 97 99 -897 -6 -61 -60 171 -6 -897 -897 171 93 -897 40 -897 152 -897 -897 245 -897 -897 40 -897 93 -61 -897 -897 -897 171 ---------------- - option -v: verbose, providing more information about the process ******************************************************** The program SS to extract fragments of a FASTA sequence: (1) make SS (2) %bin/matscan -T 0.85 samples/rhodopsines.fa samples/matrices.pfm | grep MatScan rh3 MatScan TBP 968 982 0.95 + . # GTATAAAAGCCCCAA (...) (3) %bin/SS samples/rhodopsines.fa 968 982 (it only works on the 1st sequence) %% ScanSequence by Enrique Blanco (2006) %% Allocating memory %% Reading sequence %% Locus: rh3 %% Extraction [968,982: 15 bp] >rh3 GTATAAAAGCCCCAA %% Finished ******************************************************** References: If you use this program, please cite: E. Blanco, X. Messeguer, T.F. Smith and R. Guigo. Transcription Factor Map Alignment of Promoter Regions. PLoS Computational Biology 2(5): e49 (2006) For further information see: /datasets/meta2005/index.html#mapping Bibliography: JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucl. Acids Res. 2008 36: D102-D106. http://jaspar.genereg.net/ TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes. Nucleic Acids Res., Jan 2006; 34: D108 - D110. http://www.gene-regulation.com/pub/databases.html ******************************************************** For questions and debugging, please contact Enrique Blanco at: eblanco@imim.es ********************************************************