HOWTO

batch-select and download data files from the ENCODE RNA dashboard




Overview and format specification


All data files linked from the ENCODE RNA Dashboard are listed in an easily parsable index file (example), in order to facilitate custom filtering and batch downloads.

The format of the index files follows that of "files.txt" metadata files provided by the ENCODE DCC (example). It consists in one line per data file.

Each line is tab-separated and is formed of two columns:

Column 1 : URL of the file.

Column 2 : File attributes
, provided as a semicolon-separated list of "key=value" pairs, which comply with UCSC's ENCODE controlled vocabulary. Note that the key-value pairs are listed in no particular order within column 2. Data files can be either selected or filtered out from the list based on their associated attributes.

Working example

Say you're interested in downloading all files containing RNA-Seq contigs from the long polyA+ extracts of K562. A possible way to achieve this, using a UNIX command line interface, would be:


1 - Download the index file (e.g. "hg19_RNA_dashboard_files.txt") on your hard disk.
     
2 - (Optional): Have a look at the possible values the keys you're interested in can take. If you want to filter on the "view" attribute, issue something like
:

$ cat hg19_RNA_dashboard_files.txt                                                        \
|cut -f2                                                                            \
| perl -ne 'chomp; $_=~s/;\s+/;/g; @line=split ";"; print join("\n", @line)."\n"'   \
| grep -P "^view="                                                                  \
| sed 's/=/\t/'                                                                     \
| sort|uniq



 3 - Select the lines fulfilling your criteria and write them in an output file ("hg19_RNA_dashboard_files_selected.txt") :

$ cat hg19_RNA_dashboard_files.txt     \
|  grep "dataType=RnaSeq;"       \
|  grep "view=Contigs;"          \
|  grep "rnaExtract=longPolyA;"  \
|  grep "cell=K562;"             \
> hg19_RNA_dashboard_files_selected.txt


4 - Loop over selected file URLs and download them into current directory using wget :

$ for url in `cut -f1 hg19_RNA_dashboard_files_selected.txt`;
do
wget --mirror -nd $url;
done;

Contact

Julien Lagarde, CRG, Barcelona (julienlag at gmail dot com).