EMBO Practical Course on Sequence Analysis and Molecular Evolution

Homology Searching Practical

by Toby Gibson, Ewan Birney and Des Higgins, 9/7/97

In this practical we will run some database search tools available through the web. Examination of the outputs may reveal some differences between the results, depending on the type of algorithm used in the sequence comparison. We will also modify the query and search set ups to illustrate the importance of a little thought in advance of (or during...) database searching. Rule No. 1 is "Know your sequence!"

WWW DB search Tools

We will use:

All the services are installed at both EBI and EMBL, which should be useful in case of server or network problems. The servers are not identical and if you try more than one you may notice some differences.

Step 1 Choosing an snRNP SM protein as query

SM proteins are found in snRNP complexes. There are quite a number in Swiss-Prot and they are fairly divergent, so it is difficult (or impossible) to detect them all in one search. All SM proteins share a small globular domain, but many have a C-terminal non-globular domain too. This will be used to illustrate the problems of searching with multi-domain proteins.

Step 2 BLAST2 searching with human SM-B protein

BLAST2 is an upgraded version of BLAST, one of the most widely used database search packages. The BLAST programs find the best matching ungapped sections in a sequence comparison. The most important modification for the user to note in BLAST2 is that neighbouring ungapped segments can be now be concatenated by allowing gaps between them. This improves both sensitivity and interpretation of the results.


Step 2B BLAST 2 search with SM-B and a filter

Now repeat the search but filter out segments of "reduced sequence complexity".


Step 3 Bic_SW search with human SM-B protein

The Bioccellerator is fast dedicated hardware exclusively designed for dynamic programming (ie. slow but sensitive) sequence comparison. It is built by the company Compugen. It can perform a number of search permutations including basic Smith-Waterman, profile searches and Protein v. DNA frame-shifting comparisons. Today we will do the Smith-Waterman search, which finds the best matching segments between any two sequences, allowing for gaps to be inserted at any position.

The search will take a couple of minutes (unless the Bic is busy). When it is finished you can look at the high-score list and alignments in the output and compare the results with BLAST2.


Step3B Bic-SW search with the SM Domain only

Now repeat the search but use the globular N-terminal domain only.


Take Home Lesson

Hopefully the exercises here have illustrated that the way a search is set up is very important. The query here illustrates the effect of different sequence type. There are other parameters that often influence the search sensitivity. For example when a globular domain is longer, the Gonnet Pam250 matrix would be expected to outperform the default Blosum62 in the detection of divergent homologues, because it is less stringent and so gives longer optimally matching segments. (Over short matches it is noisier and could perform worse). Also, gap penalties are critical parameters in dynamic programming and should always be tested by trial and error. In other words, it pays to try several variations in the searches, not just accept the results of the first search.