Session 4. File manipulation (GAWK scripting)



gawk program file

gawk '{STATEMENTS}' file

gawk 'BEGIN{STATEMENTS} {STATEMENTS} END {STATEMENTS}' file


Accessing fields:

$0 Entire line
$1 First field from current line
$2 Second field from current line
$n N-th field from current line
$NF Last field from current line

Built-in vars:

NF Number of fields in current line
NR Number of records (processed lines)
FS Field separator (default " ")
OFS Output field separator (default " ")
ORS Output record separator (between out lines, "\n")

Statements:


Filters:



Practice 4. File manipulation (GAWK scripting)



  1. % cd

  2. % mkdir work4

  3. % cd work4

  4. Save this sequence (FASTA) file

  5. Length of the sequence:
      % grep -v ">" data/M23768.fa | fold -1 | wc

  6. GC content:
      % grep -v ">" data/M23768.fa | fold -1 | sort | uniq -c | gawk '{print $2,$1/500}' | grep [CG] | gawk 'BEGIN{i=0}{i=i+$2;} END {print i}'



Access to the UCSC human genome browser at http://genome.cse.ucsc.edu. Go to the section Downloads (left frame) and select Annotation database. Then, save the files refGene.txt.gz and refLink.txt.gz

  1. % cd

  2. % cd work4

  3. Uncompress both files

  4. % head -5 refGene.txt

  5. % gawk '{print $0}' refGene.txt | more

  6. Number of genes:
     % gawk '{print $0}' refGene.txt | wc

  7. Number of unique genes:
     gawk '{print $1}' refGene.txt | sort | uniq | wc

  8. Genes with more than one isoform:
     % gawk '{print $1}' refGene.txt | sort | uniq -c | gawk '{if ($1 > 1) print $0;}' | sort -n

  9. % gawk '{print "Gene",NR,"\t",$0}' refGene.txt | more

  10. For each gene, print the ID (NM), chromosome location, number of exons (sort):
     % gawk '{print $1,$2,$8}' refGene.txt | sort +2nr | more

  11. The average number of exons:
     % gawk 'BEGIN{l=0;} {l=l+$8;} END {print "AvExons =",l/NR}' refGene.txt

  12. How many genes have more than 10 exons:
     % gawk '{if ($8>10) print $0}' refGene.txt | wc

  13. The average transcript length:
     gawk 'BEGIN{l=0;} {l=l+($5-$4)} END {print "AvTLength =",l/NR}' refGene.txt


  14. % head -5 refLink.txt

  15. For each gene, print the name, the gene and protein IDs, the strand and the chromosome location:
     % gawk 'BEGIN{OFS="\t";}{print $1,$2,$3,$8}' refGene.txt | sort > refGene.sort.txt

     % gawk 'BEGIN{FS="\t";OFS="\t";}{print $3,$4,$1}' refLink.txt | sort > refLink.sort.txt

     % head -5 *sort*

     % wc *sort*

     % join refGene.sort.txt refLink.sort.txt | gawk 'BEGIN{OFS="\t";}{print $6,$1,$5,$2,$3,$4}' | sort | more



Enrique Blanco © 2004 eblanco@imim.es