Expression Values for Gencode v3c Annotate Smaller RNAs (Annotation Dependent): All annotated exons in Gencode v3c are used for this set of elements: The exon boundaries for the above biotypes are used to define element boundaries. Similar filtering criteria is applied here. Namely, prior to determining the expression values for the above exons we filter out multi-mappers, read that map < 16 nucleotides in length, mapped reads with internal A-stretches, mapped reads that have genomically encoded A-stretch at their 3’ ends. The exons are then scored independently for each biological replicate according to their mapped read #s. A read is counted only if it’s 5’ end maps to the exon. Exons with a score of zero in both biological replicates are removed. Non-parametric IDR is then calculated. The results are reported in GFF format .GFF Format: Column 1: Chromosome ID Column 2: Gencode Annotation version (v3c) Column 3: Gencode Annotation type (exon) Column 4: Start Column 5: Stop Column 6: Exon Expression Score (RPM computed using the summed reads from both bioreplicate. Value capped at 1,000 for display purpose. RPM is calculated by read number mapped on the exon per million mapped reads. Not normalized by exon length). Column 7: Strand Column 8: frame. Column 9: Description. (un-capped RPM calculated by pooled reads is shown in this field.) -------------------------------------------------------------------------------- README for CSHL Production pipeline: -------------------------------------------------------------------------------- CSHL ENCODE Small RNA Production Pipeline Jan 2011 Data Freeze Carrie A. Davis, Wei Lin, Alex Dobin and Thomas Gingeras Structure of Small RNA Library: 5’ Adapter (This is the RNA ligated onto the 5’ end): “r” = ribose, RNA base 5’- rArCrArCrUrCrUrUrUrCrCrCrUrArCrArCrGrArCrGrCrUrCrUrUrCrCrGrArUrCrU Alternate Barcoded 5’ Adapter (This is the RNA ligated onto the 5’ end): “r” = ribose, RNA base 5’- rArCrArCrUrCrUrUrUrCrCrCrUrArCrArCrGrArCrGrCrUrCrUrUrCrCrGrArUrCrUrNrNrNrCrG 3’ Adapter: Polyadenines added by Poly-A Polymerase RT Primer: 5’-TCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTTTTTTTTTTTTVN PE 5’ PCR (PCR Primer): 5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC PE 3’ PCR (PCR Primer): 5’-CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTC Sequencing: The libraries are sequenced in SE36 format from the 5’ end of the transcript. Pre-Mapping Filtration of the Reads: Biochemistry is not perfect. Some residual DNA and small nonfunctional decay products are likely to be surreptitiously, yet reproducibly cloned in any protocol and will hence be mapped and survive IDR. To avoid misinterpretation of these false positives we apply a variety of filters. It is important to remove the A-tails and Illumina 3’ linker added in vitro prior to mapping else the inclusion of these sequences can mismap these reads to sites that are genomically rich in polyadenine stretches and have homology to the Illumina adapter. Prior to mapping, the sequences are scanned through and sequences with homology to the adapters are clipped off according to the following rule. The clipper algorithm first aligns the adapter sequence to the reads and finds alignment with the fewest number of mismatches. If the number of mismatches is less than 20% of the aligned length it clips off to sequence from the position of the first aligned base to the end of the read. After clipping, the remaining sequence as well as the other unclipped sequences are then mapped. Mapping Parameters: STAR is used to map the reads with the following parameters: ntTrim5p 0 ntTrim3p 0 QasciiSubtract 64 Qsplit 10 Qtop 10 penGgap 1000 penNoncan 41 minDel 5 penDel 11 penDelBase 10 winFlank 250000 anchorMaxDist 501000 anchorMaxMapN 20 nMatch 16 nMM 10 pMM 0.1 multMapMaxScoreDiff 19 minDonorAcceptor 5 chimL 10000000 minChimericDA 20 maxChimReadGap 0 multMapNmaxOut 10 adapter3pSeq AAAAAAAAAAAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGA strandType 1 adapter3pClipMMp 0.2 All mapped reads satisfying these criteria are reported in the .BAM and .Wig files at UCSC. Post-Mapping Filtration of Mapped Reads: Some offending reads survive the pre-mapping filter so we apply additional post-mapping filters prior to element calling. (1) If the mapped read contains 5 or more consecutive A’s in its sequence it is discarded. We believe these are artifacts escaping the current pre-mapping filter. Other types of false positives can only be identified post-mapping and also need to be filtered: The smaller the sequence is the more times it is likely to map non-specifically to multiple locations hence we filter (2) all reads that map less than 16 nucleotides in length and (3) all reads that map to more than one locus (i.e. all multimappers). Note: We don’t think that all the multimappers are artifacts. Indeed, many of the small RNA classes are by nature heavily repeated. Simply that we don’t yet know how to treat multimappers and their removal streamlines analysis. (4) Yes, RT prefers to use RNA as a template. However, it is also active on DNA. Anchored oligo-dT is used to prime the RT reaction. Reads mapping upstream of genomically encoded poly-A tracks could represent RT priming off of residual DNA. We filter mapped reads if there are 4 or more consecutive A’s in the 7 nucleotides 3’ of the mapped sequence. Filtered data from both biological replicates is then pooled and contigs are called. Contig Calling (Annotation Agnostic): Contigs represent the union of the qualified/post-filtered mapped reads from each biological replicate. Each contig gets a score representing the sum of the reads from both bioreps. Contigs with only a single read (0,1; 1,0) are discarded. The surviving contigs are then independently scored for each replicate according to their mapped read count. These values are used to compute Non-parametric-IDR (A Dobin, CSHL) values for each contig. These results are reported in .BED format. .BED Format: Column 1: Chromosome ID Column 2: Contig Start Column 3: Contig Stop Column 4: Contig Name (named by co-ordinates) Column 5: Contig Expression Score (RPKM computed using the summed reads from both bioreplicate. Value capped at 1,000). Column 6: Strand Column 7: Real RPKM (non-capped) Column 8: IDR score (.dot if there is only 1 replicate) Column 9: Sum of the read counts for both biological replicates. Contact: Carrie A. Davis (davisc at cshl dot edu), Wei Lin (wlin at cshl dot edu)