---selenoprofiles.py help/readme/manual Selenoprofiles is a pipeline written in python whose goal is to identify all members of selenoprotein families in a nucleotide database. This is a manual for its use. Before reading this, you should have already read "Selenoprofiles: profile-based scanning of eukaryotic genome sequences for selenoprotein genes (2010) M.Mariotti, R.Guig̣, (in print)." ########################### #### TYPICAL USE ########## ########################### The typical user will not need to use most of selenoprofiles options. So, here's the description of how to run selenoprofiles in its simplest way. selenoprofiles RESULTS_FOLDER -S -target TARGET_GENOME_FILE [-species_name SPECIES] -refilter Option -S serves to perform all steps of the search pipeline, except the last refiltering step (which is run with the option -refilter here above). The target file should have an extension such as .fa or .fasta, and should be formatted with both formatdb and fastaindex (with the commands "formatdb -i TARGET_GENOME_FILE -p F -o T" and "fastaindex TARGET_GENOME_FILE TARGET_GENOME_INDEX", where TARGET_GENOME_INDEX is TARGET_GENOME_FILE with the extension changed into .index). Anyway, selenoprofiles will attempt to create the missing files if any. A summary of the various operations carried out by selenoprofiles will be printed on screen. We recommend to save this output, it could be useful to parse it. If selenoprofiles encounters any error, you will find one or more lines containing the keyword "ERROR", and the description of the error. Otherwise, you will find a number of results created in RESULTS_FOLDER. Here, a subfolder named after the SPECIES has been created (or after the TARGET_GENOME_FILE if the species name has not been defined). Inside this folder, you will find several subfolders, storing the files created in the various steps of the pipeline (and detailed later). Your results are in RESULTS_FOLDER/SPECIES/output/ . For each result, the fundamental file is a .p2g link to the gene prediction, either a genewise or exonerate output. Other files, in which only the extension is changed, are present. You will have a .gff file with Genomic Feature Format output, and some fasta files: .seq with the protein sequence, .cds with the nucleotide coding sequence, .3_prime with the sequence downstream of the prediction, where SECIS are searched. If any SECIS element has been found for this result, it will be named like .3_prime.PATTERN.secis, where PATTERN describe which pattern was used by SECISearch. Image for each SECIS will be found named as the secis, but with extension .png. Selenoprofiles labels each output gene depending on its class. If you're interested only in selenoprotein predictions, than you want only results whose name contains ".selenocysteine.". A summary alignment file for each protein family is produced, containing all hits found in this species and also the sequences of the profile, for comparison. This is named FAMILY.ali or FAMILY.ali.filtered in case a refiltering procedure was executed. GRAPHICAL OUTPUT A standalone perl program (selenoprofiles_drawer.pl) is used to produce graphical output. This program reads direclty the *.ali files produced by selenoprofiles in each output folder. To use it on all results, just cd into the output directory of one organism and run it without options. If you ran selenoprofiles on more than one target and you want to have a summary of results, you should first cd into the RESULTS_FOLDER and run the program join_fam_selenoprofiles.py . This will produce a summary alignment file for each selenoprotein family for which at least one result was found. Then, run selenoprofiles_drawer.pl in the same folder. ########################### #### EXPERT USE ########## ########################### Here we detail the use of selenoprofiles for the expert user. The options for Selenoprofiles are read from a configuration file and from the command line options; the latter are overriding the former. Command line options without an argument are assumed to be 1 (True). The "-config FILENAME" option allows to use a particular configuration file. Default is called "selenoprofiles.config" . The general syntax to run selenoprofiles is: selenoprofiles.py RESULTS_FOLDER [families.list] [-options with arg] [define target] Using the options, you will have to define a target and a list of routines (main operations) to be executed, otherwise selenoprofiles will do nothing. - DEFINING TARGET: You have two ways of defining a target: one is with the -target option, as in the typical use example. The other one is with " -genome GENOME_NAME ". This is assuming that in the "genomes_folder" (which is an option of selenoprofiles) there is a folder named after each genome, and inside it a file called "genome.fa" (and the other files deriving from formatting the database). - MAIN ROUTINES: The following options define if the correspondant routines are run. To activate all main routine except refilter, run the "-S" option. search : Run a psitblastn search of the profiles against target genomes. parse : Parse and filter the blast output. The filtering scripts are defined in the file specified in the "-blast_filters" option. Particular filtering scripts can be chosen for each family. Besides, various filtering classes can be set. These are hierarchly defined in the filters file. p2g : This is equivalent to options "-exonerate" and "-genewise" combined. Each blast hit is treated as a seed for running exonerate through the cyclic_exonerate method. Genewise is then used to compute an alternative gene structure for each gene found. If the option "-genewise_to_be_sure" is active, genewise is also used on every blast hit whose exonerete run produced no output. extract : exonerate/genewise output files are evaluated and filtered. For each hit, the best prediction among exonerate and genewise is chosen, and a decision tree chooses if the hit is output or not. For the hits passing the filtering, output files are created. If option "-secisearch" is active, secisearch is launched on the nucleotide sequence downstream the predicted coding sequence. refilter: Parse one by one each output result and re-evaluate it considering the rules in "selenoprofiles.refiltering" (or in another file specified with option -output_filters). This allows to specify output filters specific for each family, basing on all information available after outputing. If some results are filtered out, the relative files are moved to a "discarded" folder and a new alignment file without these sequences is produced. clean: The exonerate/genewise files written by selenoprofiles are checked and those which are useless are deleted. There are 4 levels of "cleanliness" which can be given as argument: 0 -> all files are kept. 1 -> all exonerate/genewise for which an output or discarded (refiltered) output has been written are kept. 2 -> same as 1 but only for output files. Everything on the discarded folder is deleted. 3 -> same as 2, but additionally only the chosen prediction (exonerate OR genewise) is kept for each output, the other one is deleted. - RESULTS FOLDER: All files produced by selenoprofiles are stored in the results folder, defined by the "-results" option or by the first argument in the command line. In this folder, a subfolder named after the species name is created. This folder contains the following subfolders. blasted : This contains all blast output files. These are named FAM.blastout parsed : This contains all files deriving from the parse routine. These are named FAM.CLASS, where CLASS is the concerned filtering class. exonerate : This contains all exonerate output files. In the beginning of p2g routine, blast hits from all filtering classes are merged (removing duplicates) and one index is assigned to each of them. Exonerate is run on each blast hit, and output files are named FAM.hitINDEX.CLASS , where INDEX is the assigned index and CLASS is the top-hierarchy filtering class that contained the seed blast hit genewise : This contains all genewise output files. These can be produced by the standard routine (exonerate defines the target boundaries for genewise run) or by the genewise_to_be_sure routine (genewise is run using as seed every blast hit that produced no output with exonerate). Files are named FAM.gwINDEX.CLASS . The index is the lowest possible among duplicates exonerate files (since exonerate seeded on exons of the same gene end in the same gene prediction). The files produced by the to_be_sure routine are stored in a subdirectory, whose name is defined by the "-genewise_to_be_sure_folder" option. These files are linked into the genewise folder in case a sufficiently large alignment has been produced. output : This is the actual selenoprofiles output folder. Each output gene is assigned a label, depending on the amino acid aligned to the selenocysteine in the profile. The minimal output consists of a file named FAM.INDEX.LABEL.p2g, which is a link to the chosen p2g file (exonerate or genewise output) and one alignment file for each family, named FAM.ali . Additional files can be produced depending on the output options. If the refiltering procedure leads to filter out some sequences, then another alignment file containing only the sequences passing the re-filtering is created (named FAM.ali.filtered) discarded : in case the refilter routine is run and filters out some results, their output files are moved in this folder. - OUTPUT OPTIONS: It is possible to control which files are produced for each output gene. The files with the following suffixes are produced if the option "-out_SUFFIX" is active. seq : predicted protein sequence, fasta format gff : gff file for the prediction cds : DNA sequence corresponding to the predicted coding sequence, fasta format dna : DNA sequence of the whole predicted gene, including introns 3_prime : DNA sequence downstream the predicted coding sequence, fasta format. Its lenght depends on the "-three_prime_length" option. SECISearch is run on this file, so eventually other files whose name is beginning with this name are present annotation : this is the gff file of the ensembl transcript which is most similar to predicted gene. This is fetched utilizing the Ensembl Perl API, which must be configured for the used species. - ANALYZING RESULTS: Analyzing selenoprofiles predictions require some experience with the selenoprotein families. Nonetheless, we can indicate here some useful hints. Visualize the alignment of results along with the profile (.ali or .ali.filtered file in the /output folder). From here you can check if the positions important for the family function are conserved in your output candidate. Another useful operation is performing a blast of the protein sequence of the candidate (.seq file) against all known proteins (nr). If the most similar proteins to your candidate are not belonging to the input family, this may be a spurious hit. Also, to assess the reliability of a selenoprotein prediction, check for how complete the prediction is in respect to the profile, and check for the presence of a SECIS. - INPUT FAMILIES: PROFILES FOLDER OR TEMPORARY PROFILE Selenoprofiles searches by default for all families in a single run. Anyway, a list of families can be defined in two different ways: from an input file (option "-i" or second argument in command line) or with the "-fam" option (separating family names with a single comma "," without white spaces). The latter overrides the former. The profile alignments are searches in the folder defined by the "-profiles" option. You may want to use a profile alignment of your own. There are two ways to do this. The first way is to use the -profile [alignmentfile.fa] option, which accepts a aligned fasta file with selenocysteines indicated with Us (please, don't use any dot or any strange character in the filename). This option implies that under the hood a temporary profile is built from the alignment and used. The other way is to run the script prepare_alignment_selenoprofiles.py on the profile alignment and store results in a folder that than you will provide to selenoprofiles with the option -profiles. The script prepare_alignment_selenoprofiles.py will create the following files: FAM.profile.aligned.fasta : the actual profile alignment in fasta format. The sequences that are suitable for being chosen as exonerate or genewise queries are tagged with "QUERY" in the title. Selenocysteine in this file must be indicated with U. FAM.profile.fasta : the same sequences, but unaligned. FAM.profile.alignment : the same alignment, but in the format read by blastpgp, necessary to build the PSSM for the blast search. FAM.query.fa : contains one single sequence of the profile, the one that was chosen as the blast query. Selenocysteine here must be indicated with a star "*". Once these files are present, Selenoprofiles can build a PSSM for the psitblastn search. This is done launching blastpgp, and produces these additional files in the profiles folder: FAM.profile.chk : a blastpgp checkpoint file. This is storing the PSSM in the format readby blast. This is the only file in this list which is strictly necessary. FAM.profile.chk.log : this is the output of the blastgp run. This is kept to be checked in case of errors. FAM.profile.chk.pssm : this file is produced by blastpgp and contains a human-readable representation of the PSSM. An important option of this program is -tag_threshold, which define which proportion of sequences in the alignment that should be tagged as queries for exonerate/genewise (sequences are ordered according to their "completeness"). The advantage of this second method on the first is the possibility of using all different options of prepare_alignment_selenoprofiles (see its help page). If the option -profile is used, the option -tag_threshold is given the value 1, which indicates a very good input alignment (all sequences tagged as queries). - OTHER OPTIONS AND PARAMETERS: temp_folder : This folder is used by Selenoprofiles for temporary files. Actually, the program creates a subfolder with a random name inside this folder and uses that. This allows running multiple instances of Selenoprofiles without filenames conflicts. This random name folder is deleted at the end of the execution if Selenoprofiles does not crash. split_folder : Selenoprofiles usually extract the chromosome fasta entries in temporary files that are deleted after execution. In case you utilize often the same species, you may want to specify this option to use always this folder to extract fasta entries to. If an entry has already been fetched, a file named after the fasta title will be present here. Be careful not to use the same folder for different species, or this will raise conflicts. Then, these options are parameters used in the pipeline: exonerate_moreseq : sent to cyclic_exonerate; it's the width of the extension done each cycle from each side, starting with the blast positions; see the cyclic_exonerate help for details. genewise_enlarge_moreseq : applied to the hits already found by exonerate and sent to genewise, it's the extension from both sides starting from the positions of the exonerate prediction genewise_moreseq : applied to the hits that was not exonerated and are sent to genewise, it's the extension from both sides starting from the blast positions; it workds only if the option genewise_to_be_sure is active threshold_percent_aligned : for filtering the exonerate/genewise hits; it's the minimum ( lenght(predicted_seq) / lenght(query_used_length) ) minimum_length_for_seq : for filtering the exonerate/genewise hits; if the predicted seq is more than this parameter, the hit is passing the filter regardless of the percent aligned secisearch_ethreshold : energy threshold for secis elements three_prime_length : length of the sequence downstream the predicted coding sequence that is searched with SECISearch These are the other options not yet described: parsed : subfolder of genome named folder in the results folder. It contains filtered blast output files. Default is /parsed/ exonerated : subfolder of genome named folder in the results folder. It contains exonerate output files. Default is /exonerate/ genewised : subfolder of genome named folder in the results folder. It contains genewise output files. Default is /genewise/ discarded : subfolder of genome named folder in the results folder. It contains discarded output files. Default is /discarded/ genewise_to_be_sure_folder : subfolder of the genewised folder. It contains output files of genewise_to_be_sure routine. /to_be_sure/ Finally, these options control particular running modes: -v verbose, for debugging -o wait for input after each family run -tblastn_mode choose the exonerate query with the deprecated system of tblastn runs.