Overview and format
specification
All
data files linked from the ENCODE
RNA Dashboard are listed in an easily parsable index file (example),
in
order
to
facilitate
custom
filtering
and batch downloads.
The
format of the index files follows that of "files.txt" metadata files provided
by the ENCODE DCC (example).
It
consists
in
one
line
per
data file.
Each line is tab-separated and
is formed of two columns:
Column 1
: URL of the file.
Column 2
: File attributes,
provided
as
a
semicolon-separated
list
of
"key=value" pairs, which comply with UCSC's
ENCODE controlled
vocabulary. Note that the
key-value pairs are listed in no particular order within
column 2. Data files can be either selected or filtered out from the
list based
on their associated attributes.
Working example
Say
you're interested in downloading all
files containing RNA-Seq contigs
from the long
polyA+ extracts of K562. A
possible way to achieve this, using a UNIX command line interface,
would be:
1 - Download the index file (e.g. "hg19_RNA_dashboard_files.txt")
on your hard disk.
2 - (Optional): Have a look at the possible values the keys
you're interested in can take. If you want to filter on the "view"
attribute,
issue
something
like:
$
cat
hg19_RNA_dashboard_files.txt
\
|cut
-f2
\
| perl -ne 'chomp; $_=~s/;\s+/;/g; @line=split ";"; print join("\n",
@line)."\n"' \
| grep -P "^view="
\
| sed
's/=/\t/'
\
| sort|uniq
3 - Select
the lines fulfilling your criteria and write them in an
output file ("hg19_RNA_dashboard_files_selected.txt")
:
$ cat
hg19_RNA_dashboard_files.txt \
| grep "rnaExtract=longPolyA;"
\
| grep "cell=K562;"
\
>
hg19_RNA_dashboard_files_selected.txt
4 - Loop over selected file URLs
and download
them into current directory using wget :
$ for url in `cut
-f1 hg19_RNA_dashboard_files_selected.txt`;
do
wget --mirror -nd
$url;
done;
Contact
Julien Lagarde, CRG, Barcelona
(julienlag at gmail dot com).