Mining the Genome

Written by Eduardo Eyras

Introduction

There is an increasing amount of information available in databases that can be useful to unveil important biological facts. One of the problems in Bioinformatics is how to structure and store this data in such a way that it can be readily used for answering biologically relevant questions. We will explore one of the methods used to make the information closer to the questions posed by the biologists.

We know now how to find information about our favourite gene, like the number of exons, transcripts, the functional domains, etc. We also know how to view some of the properties of the genome in a region of interest, e.g.: sequence conservation with other genomes, synteny, etc. However, can we somehow link all this to ask questions like:

Give me all genes in human with a given Pfam domain and an ortholog gene in mouse and rat, with a sequence similarity greater than 70%.

Ensmart

We will use the Ensmart browser:

www.ensembl.org/EnsMart

Biological databases are usually designed such that the storage and update of information is optimized, and have structures (tables, files ) that are very specific to the data at hand. Complex queries are therefore computationally expensive as they require a large amount of analyses and computations. This also often requires specialized software (e.g. Ensembl and UCSC browsers).

Ensmart provides a way of organizing the data such that it is optimized for querying. The computations are still necessary, but the results are stored in such a way, that we are able to obtain a very fast answer to queries that involve a non-direct relationships between attributes. Ensmart is built with pre-computed data and it is in the way the results are stored that we gain usability.

The Ensmart database structure is adapted from a star schema. There are one or more central tables (gene, transcript, SNP) and these are linked to a number of satellite tables containing the attributes. The central table is the source for the constraints and the satellite is the source for the attributes:

You can read more about it here: http://www.genome.org/cgi/content/full/14/1/160

We are going to illustrate the use of Ensmart with an example. We have seen previously (see previous section) that the ADH gene contains a domain from the Aldo/Keto reductase family with Pfam ID PF00248.

Let's imagine that we are interested in promoters and we want to study the upstream region of human genes with this domain. Additionally we would like to concentrate on genes that have an ortholog in mouse. However, we do not want all of them, we only want to look at those that have non-synonymous SNPs in the coding region of the gene.

We would like to obtain two pieces of information for these genes:

The upstream genomic sequence.
The pattern of expression.

Ensmart is based on a focus, which is the entity about which we ask the questions, and a series of satellites sets of data, which hold the attributes and which are used to do to operations:

Filtering: Eliminates or includes entities according to some attribute filter.
Output: It selects a set of attributes from the select items for output.

Data Mining

Open the browser: www.ensembl.org/EnsMart

How many databases can you see?

How many species can we choose from? Is there anything different in 'species' with respect to the species you can see in www.ensembl.org?

Select dataset: human. Note the versions: NCBI35.

Once we have selected the database, we need to specify the filters. A wide range of filters can be applied in any combination.

Question: Explore the possible filters

Region
Gene
Gene Ontology
Expression
Multispecies comparisons
Protein
SNP

Question: Select the filters according to what we want to answer:

Make sure we search all human genes
Select for mouse orthology
How do you specify the Pfam domain PF00248?
How do you specify the type of SNP?

Once we have applied the filter, we can choose which output to generate. Press the 'next' button. Note the update on the right-hand side.

Question: Explore the four possible types of 'output':

Features
SNPs
Structures
Sequences

Select the 'sequence' output. Choose 5' end sequence.

Question: Why can we choose between gene and transcript to select the sequences?

Select the upstream region of the gene. You should obtain an output like this:

>ENSG00000165568.4 assembly=NCBI34|chr=10|strand=forward|bases 4821426 to 4822425|region upstream of gene only
CTCCCCTGATGGCCAGCACTGAAGACCCAGGCAAGGAACCTAGAAACAAAGCCCTCATCTGGGTGTGGGTGTCCTCAAGG
CAGTAGGACTCCCAGGGCTGAGGGGGGCAATGAAGGGGGAGCTGTAAGCTCCAGGAGAGATAAGAGGGGCGTCGGAAGGC
TCCCTTGACCCCTCTTTCCCTCCACTGGCCCTGGGGGAGCCCAGTCCACTCATAAGGGGGGTGTCCAGTCCACCCCATCC
...

We can go back in the browser to select the expression patter of these genes. In the Features section select one of the sources of expression.

Question:

Look at how the expression information is presented. It is in a tree structure.
Do you know how this information is obtained?
How are the external links obtained in general?

Exercise

Carry out the following searches:

Rat orthologs of human genes annotated as involved in disease(s) and expressed in brain.
All validated human SNPs on chromosome 2 between 100-200Mb that change one aminoacid (non-synonymous).
Genomic location and of all mouse and Fugu homologs of all the human genes that have transmembrane domains, are expressed in cardiovascular system and have non-synonymous SNPs.

Other Examples

Other examples of data you can retrieve with Ensmart:

Coding SNPs for all novel kinases
Genes on chromosome 5 expressed in liver
Sequences for all Ensembl genes mapped to some microarray probe (e.g. U95A).
Disease related genes between two markers (e.g. D10S255 and D10S259).
Transmembrane proteins with an Ig-MHC domain (IPR003006) on chromosome 2.
Genes with associated coding SNPs on chromosomal band 5q35.3.

Biomart

As the idea behind Ensmart is data-driven rather than data-dependent, the same approach is extensible to other data sources. This is achieved by collecting the raw data from a database and generating the tables in a star schema. Once the satellite tables are defined the structure and dependencies can be automatically generated. Any additional domain specific information can be added in the form of external lookup tables.

Go to the web page of Biomart

We can see that so far it has been applied to other databases:

Vega genes (manually annotated human genes)
Uniprot (all known proteins)
Molecular Structure Database,
ESTGenes (Genes built from ESTs by Ensembl)
dbSNP (Database of single nucleotide polymorphysms)

This picture illustrates what ESTGenes are:

Click on the BioMart logo.

Exercise

Get the publication information from those proteins in Arabidopsis thaliana that have an entry in the Protein Data Bank (PDB).

Hint: Obtain first from Uniprot the "Uniprot Accession Identifiers" of those proteins with an associated PDB ID. Then search for the entries in the MSD database corresponding to these accession IDs.

Questions:

Do you see in Uniprot or MSD the same possible outputs as before with Ensembl? Why?
Note that in some cases the Focus is a multispecies database. Can you see any relation to the way the data is originally stored/generated?
Can you give an example of other type of data (Biological or not) that could be put into this system?

To Sum Up

Some take-home messages:

It is very important how you structure and organize the data to make it useful for querying.
Use formats and terminologies as standard as possible. It is will make your tool/method more user-friendly.
It is all actually based on pre-computed data. If the data is bad or is wrongly computed, all the rest is useless.

/courses/laCaixa05/