some hg18-data for our NSMB paper (august 2009 - I put what people ask me for)


exons

  • here is a link to the human exons we used, which
    1. are located on chromosomes 1-22 and X
    2. have between 50-250bps
    3. are surrounded by introns at least 70bps and of U2 kind
    4. are part of a protein coding transcript
    5. are flanked by AG-GT dinucleotides
    6. were classified as constitutive using the AstaLaVista framework (using refSeq and genbank mRNAs as input)
  • the file format is the following
    1. column1: the id we use to refer to the exon, basically made up by its location: chrom_start_end_strand
    2. column2: chrom
    3. column3: start
    4. column4: end
    5. column5: strand
    6. column6: acceptor score (maxEnt, see Yeo and Burge 2004)
    7. column7: donor score (maxEnt, see Yeo and Burge 2004)
    8. column8: acceptor sequence
    9. column9: donor sequence
    10. column10: the sequence exon-100 to exon-40
    11. column11: the exonic sequence
    12. column12: the sequence exon+10 to exon +70.
  • Should you be interested in "strong exons" as we defined them in the paper, add columns 6 and 7 and take the highest scoring 3822 exons (and the lowest scoring 3822 for the "weak exons")


  • pseudoexons

  • As pseudoexons are not part of refSeq annotation, we used a different pipeline to define them. These files were only subsequently forced into the same format as exons. Therefore some columns simply contain "."-values. Below are 2 files, one for all pseudoexons, one for the "strong pseudoexons"
  • here is a link to human pseudoexons we used
  • here is a link to the strong human pseudoexons we used



  • symCurv

  • symCurv was developped by Christoforos Nikolaou. He and Sonja Althammer produced genome wide (hg18) symCurv values; below you can download them chromosome wise. The symCurv values are in column 3.

  • For details of the algorithm, please see the symCurv paper by Christoforos and colleagues.

  • The above values were the ones I used in Figure 1c of our paper. Please note that in order to get the clear shape shown in this figure, one has to smoothen the curve. I did this by taking for every position in the genome the average of 61 values around it (a 30bp-radius window). Afterwards the plot was "aggregated" as described in the paper in order to force exons into the same size.

  • Here's the data:

    chr1_symcurv.dat.gz
    chr2_symcurv.dat.gz
    chr3_symcurv.dat.gz
    chr4_symcurv.dat.gz
    chr5_symcurv.dat.gz
    chr6_symcurv.dat.gz
    chr7_symcurv.dat.gz
    chr8_symcurv.dat.gz
    chr9_symcurv.dat.gz
    chr10_symcurv.dat.gz
    chr11_symcurv.dat.gz
    chr12_symcurv.dat.gz
    chr13_symcurv.dat.gz
    chr14_symcurv.dat.gz
    chr15_symcurv.dat.gz
    chr16_symcurv.dat.gz
    chr17_symcurv.dat.gz
    chr18_symcurv.dat.gz
    chr19_symcurv.dat.gz
    chr20_symcurv.dat.gz
    chr21_symcurv.dat.gz
    chr22_symcurv.dat.gz
    chrX_symcurv.dat.gz
    chrY_symcurv.dat.gz