Manipulation on CSV/TSV files

· Read in about 2 min · (306 Words)

CSV (Comma Separated Values) files and TSV (Tab Separated Values) files are common data transfer files in many fields, including Bioinformatics. CSV files is more powerful, because content in quoting characters (double quotation marks,  most of the time) could contains field separators (comma). Therefore we could not just split one line by filed separators.

Common manipulations on CSV/TSV files include grepping, joining, sorting, sub-sampling. Handling of TSV files and some of CSV can be implemented easily by *nix tools including sed , cut , awk , paste  and so on. But full manipulations need specialized tools. For example, csvkit brings tools including csvcut , csvgrep , csvstat , csvsql , csvmerge  and so on.

In my Bioinformatcis analysis work, I need grepping and joining more often. And in some cases, I need specify multiple indice (keys) , for example, start+end+strand in GFF files refers to an unique record. However, it’s not supported by csvkit. So I implemented csv_grep (csv_grep.go, golang version, faster with concurrency.) and csv_join, hosted on Github.

Being a data tranfer formats,  CSV/TSV has be an intermedium format in data science.  It could be easily reformated from and to other formats. Relevant tools should also be developed to reach this demand.

For example, the simplest FASTA sequence format could be reformated to TSV format, which make it very convenient for grepping, sorting and extraction. See tools fasta2tab - tab2fasta,  and fastq2tab - tab2fastq, and more about  Manipulation on FASTA format file.

Another example is GFF file, which is one kind of TSV file too. These two days, I wrote bam2gff.py and gff2fa.py to extract inserts from BAM files, and a BEDTools intersect like tool gff_intersect.py which brings more personalized functions. All these tools are relevant to GFF files, they are small individually but powerful when combining together, this is philosophy of UNIX small tools.