Manipulation on FASTA format file

· Read in about 4 min · (789 Words)

FASTA  format is a basic sequence format in the field of Bioinformatics. It’s easy to manipulate and parse.

Please use my another cool tool, SeqKit – a cross-platform and ultrafast toolkit for FASTA/Q file manipulation!

In my practice, I do a lot of work with FASTA format file. And I wrote some scripts to parse and analyze it. These are also some great tools like Bio series package like BioPerl, Biopython and BioJava.

Reading.  For coding practice, I wrote several editions of FastaReader in Perl and Golang. It’s easy to implement, just read line by line and check the “>” character and append the sequence in multiple lines. Recently, I realize that the FastaReader should support to parse FASTA sequence from standard input, so the tools could be used in shell pipe.

Writing. Output of FASTA format is pretty easy. Some tools need the sequences in a single line. But most of the time, for readability, sequences should be format in multiple lines of  equal length.

Simple statistics, including length, base content (usually GC content). We may also want to plot a distribution figure.

Transformation, like reverse complement  sequence, translating nucleotides to amino acide, removinggaps.

Query and Filtering. They are very common operations of FASTA files. For example, 1) query record by partial or full length of FASTA record name, 2) find restrict enzyme digest site, 3) extract records longer  than a given length or GC content, 4) find common sequences in several files, 5) remove duplicate records.

In the past several years, I wrote a lot of single scripts to do a specific job. However, it get hard to manage the code. So I refactored some packages to do the reproducibility tiny job.

THE SMALL TOOLS MANIFESTO FOR BIOINFORMATICS, a project on Github describes motives, rules and recommendations for designing software and pipelines for current day biological and biomedical research. This guide together with some other great tools like BWA  give me a lot of guidance to write reproductive command line tools.

In my bio_scriptes repository, I wrote several effective tools to manipulate FASTA format file:

fasta2tab and tab2fasta are used in pair. fasta2tab  transforms the FASTA fromat to two-column table, fist column is the header and the second is sequence. Its could also compute the reverse complement sequence and remove gaps. Sequence length and GC content could be outputted as another column, which could be used for filtering and sorting. tab2fasta just tranform the table back to FASTA format. Combining with shell tool like awk and sed, it’s easy to filter, sort FASTA files. Examples:

1) sort fasta by sequnece length

cat seq.fa | fasta2tab -t -l | sort -r -t"`echo -e '\t'`" -n -k3,3 \
| tab2fasta -l 70 > seq.sorted.fa

2) extract sub sequence

fasta2tab -t -sub 3,10 -rc seq.fa | tab2fasta

3) extract sequence longer than 1000 bp

cat seq.fa | fasta2tab -t -l | awk -F'\t' '$3 >= 1000' | tab2fasta -l 70

4) extract aligned sequence of which the original sequence is longer than 1000 bp

cat seq.fa | fasta2tab -l2 | awk -F'\t' '$3 >= 1000' | tab2fasta -l 70

5) reverse complement sequence, uppercase, and trim gaps

zcat seq.fa.gz | fasta2tab -uc -rc -t | tab2fasta

Length and GC content of sequence could be passed to a plotting scripts, which could read data form STDIN or file to plot distribution figure with Seaborn. could extract FASTA sequences by header  or sequence, excact matching or regular expression  matching are both supported. The query pattern could comes from files. And negation the result is also easy to do. What the most important, it could read from STDIN.  Examples:

1) sequences WITH “bacteria” in header -r -p Bacteria *.fa > result.fa

2) sequences WITHOUT bacteria in header -r -n -p Bacteria seq1.fa seq2.fa > result.fa

3) sequences with TTSAA (AgsI digest site) in SEQUENCE.  Base S stands for C or G. -r -s -p 'TT[C|G]AA' seq.fa > result.fa

4) sequences (read from STDIN ) with header that matches any patterns in list file

zcat seq.fa.gz | -pf name_list.txt > result.fa is used to find common sequences in multiple files. It supports comparing by header or sequence. By storing the MD5 value of sequences, it has a low memory usage. It’s also could be used to remove duplicated records, by finding common sequencing from the file and its copy or soft link.  could remove duplicated records from file or STDIN, by both sequence and header. could find restrict enzyme recognition site or other motif location by plain text or regular expression from fasta file. and fasta_gc_skew.plot.R could plot GC content/GC skew