Expression Values for Gencode v3c Annotate Smaller RNAs (Annotation Dependent):  All annotated exons in Gencode v3c are used for this set of elements:

The exon boundaries for the above biotypes are used to define element boundaries.  Similar filtering criteria is applied here. Namely, prior to determining the expression values for the above exons we filter out multi-mappers, read that map < 16 nucleotides in length, mapped reads with internal A-stretches, mapped reads that have genomically encoded A-stretch at their 3’ ends. The exons are then scored independently for each biological replicate according to their mapped read #s.  A read is counted only if it’s 5’ end maps to the exon.  Exons with a score of zero in both biological replicates are removed.  Non-parametric IDR is then calculated.  The results are reported in GFF format

.GFF Format:
Column 1: Chromosome ID
Column 2: Gencode Annotation version (v3c)
Column 3: Gencode Annotation type (exon)
Column 4: Start
Column 5: Stop 
Column 6:  Exon Expression Score (RPM computed using the summed reads from both bioreplicate. Value capped at 1,000 for display purpose. RPM is calculated by read number mapped on the exon per million mapped reads. Not normalized by exon length). 
Column 7: Strand
Column 8: frame.
Column 9: Description. (un-capped RPM calculated by pooled reads is shown in this field.)

--------------------------------------------------------------------------------
README for CSHL Production pipeline:
--------------------------------------------------------------------------------
CSHL ENCODE Small RNA Production Pipeline 
Jan 2011 Data Freeze

Carrie A. Davis, Wei Lin, Alex Dobin and Thomas Gingeras

Structure of Small RNA Library:  

5’ Adapter (This is the RNA ligated onto the 5’ end): “r” = ribose, RNA base
	5’- rArCrArCrUrCrUrUrUrCrCrCrUrArCrArCrGrArCrGrCrUrCrUrUrCrCrGrArUrCrU
Alternate Barcoded 5’ Adapter (This is the RNA ligated onto the 5’ end): “r” = ribose, RNA base
	5’- rArCrArCrUrCrUrUrUrCrCrCrUrArCrArCrGrArCrGrCrUrCrUrUrCrCrGrArUrCrUrNrNrNrCrG
3’ Adapter: Polyadenines added by Poly-A Polymerase
RT Primer:
	5’-TCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTTTTTTTTTTTTVN
PE 5’ PCR (PCR Primer):
	5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC
PE 3’ PCR (PCR Primer):
	5’-CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTC

Sequencing:  The libraries are sequenced in SE36 format from the 5’ end of the transcript.  

Pre-Mapping Filtration of the Reads:  Biochemistry is not perfect. Some residual DNA and small nonfunctional decay products are likely to be surreptitiously, yet reproducibly cloned in any protocol and will hence be mapped and survive IDR.  To avoid misinterpretation of these false positives we apply a variety of filters.  

It is important to remove the A-tails and Illumina 3’ linker added in vitro prior to mapping else the inclusion of these sequences can mismap these reads to sites that are genomically rich in polyadenine stretches and have homology to the Illumina adapter.  Prior to mapping, the sequences are scanned through and sequences with homology to the adapters are clipped off according to the following rule. The clipper algorithm first aligns the adapter sequence to the reads and finds alignment with the fewest number of mismatches. If the number of mismatches is less than 20% of the aligned length it clips off to sequence from the position of the first aligned base to the end of the read.  After clipping, the remaining sequence as well as the other unclipped sequences are then mapped.

Mapping Parameters:  STAR is used to map the reads with the following parameters: 

ntTrim5p 0
ntTrim3p 0
QasciiSubtract 64
Qsplit 10
Qtop 10
penGgap 1000
penNoncan 41
minDel 5
penDel 11
penDelBase 10
winFlank 250000
anchorMaxDist 501000
anchorMaxMapN 20
nMatch 16
nMM 10
pMM 0.1
multMapMaxScoreDiff 19
minDonorAcceptor 5
chimL 10000000
minChimericDA 20
maxChimReadGap 0
multMapNmaxOut 10
adapter3pSeq AAAAAAAAAAAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGA
strandType 1
adapter3pClipMMp 0.2

All mapped reads satisfying these criteria are reported in the .BAM and .Wig files at UCSC.

Post-Mapping Filtration of Mapped Reads:  Some offending reads survive the pre-mapping filter so we apply additional post-mapping filters prior to element calling.
(1) If the mapped read contains 5 or more consecutive A’s in its sequence it is discarded.  We believe these are artifacts escaping the current pre-mapping filter.

Other types of false positives can only be identified post-mapping and also need to be filtered: 

The smaller the sequence is the more times it is likely to map non-specifically to multiple locations hence we filter (2) all reads that map less than 16 nucleotides in length and (3) all reads that map to more than one locus (i.e. all multimappers).  Note: We don’t think that all the multimappers are artifacts. Indeed, many of the small RNA classes are by nature heavily repeated. Simply that we don’t yet know how to treat multimappers and their removal streamlines analysis.  

(4) Yes, RT prefers to use RNA as a template. However, it is also active on DNA.  Anchored oligo-dT is used to prime the RT reaction. Reads mapping upstream of genomically encoded poly-A tracks could represent RT priming off of residual DNA.  We filter mapped reads if there are 4 or more consecutive A’s in the 7 nucleotides 3’ of the mapped sequence. 

Filtered data from both biological replicates is then pooled and contigs are called.

Contig Calling (Annotation Agnostic):  Contigs represent the union of the qualified/post-filtered mapped reads from each biological replicate.  Each contig gets a score representing the sum of the reads from both bioreps.  Contigs with only a single read (0,1; 1,0) are discarded.  The surviving contigs are then independently scored for each replicate according to their mapped read count.  These values are used to compute Non-parametric-IDR (A Dobin, CSHL) values for each contig.  These results are reported in .BED format.

.BED Format:
Column 1: Chromosome ID
Column 2: Contig Start
Column 3: Contig Stop
Column 4: Contig Name (named by co-ordinates)
Column 5: Contig Expression Score (RPKM computed using the summed reads from both bioreplicate. Value capped at 1,000). 
Column 6: Strand
Column 7:  Real RPKM (non-capped)
Column 8: IDR score (.dot if there is only 1 replicate)
Column 9: Sum of the read counts for both biological replicates. 


Contact: Carrie A. Davis (davisc at cshl dot edu), Wei Lin (wlin at cshl dot edu)