some hg18-data for our NSMB paper (august 2009 - I put what people ask me for)
exons
here is a link to the human exons we used, which
- are located on chromosomes 1-22 and X
- have between 50-250bps
- are surrounded by introns at least 70bps and of U2 kind
- are part of a protein coding transcript
- are flanked by AG-GT dinucleotides
- were classified as constitutive using the AstaLaVista framework (using refSeq and genbank mRNAs as input)
the file format is the following
- column1: the id we use to refer to the exon, basically made up by its location: chrom_start_end_strand
- column2: chrom
- column3: start
- column4: end
- column5: strand
- column6: acceptor score (maxEnt, see Yeo and Burge 2004)
- column7: donor score (maxEnt, see Yeo and Burge 2004)
- column8: acceptor sequence
- column9: donor sequence
- column10: the sequence exon-100 to exon-40
- column11: the exonic sequence
- column12: the sequence exon+10 to exon +70.
Should you be interested in "strong exons" as we defined them in the paper, add columns 6 and 7 and take the highest scoring 3822 exons (and the lowest scoring 3822 for the "weak exons")
pseudoexons
As pseudoexons are not part of refSeq annotation, we used a different pipeline to define them. These files were only subsequently forced into the same format as exons. Therefore some columns simply contain "."-values. Below are 2 files, one for all pseudoexons, one for the "strong pseudoexons"
here is a link to human pseudoexons we used
here is a link to the strong human pseudoexons we used
symCurv
symCurv was developped by Christoforos Nikolaou. He and Sonja Althammer produced genome wide (hg18) symCurv values; below you can download them chromosome wise. The symCurv values are in column 3.
For details of the algorithm, please see the symCurv paper by Christoforos and colleagues.
The above values were the ones I used in Figure 1c of our paper. Please note that in order to get the clear shape shown in this figure, one has to smoothen the curve. I did this by taking for every position in the genome the average of 61 values around it (a 30bp-radius window). Afterwards the plot was "aggregated" as described in the paper in order to force exons into the same size.
Here's the data: