Wei Shen's Note

Manipulation on CSV/TSV files

26 Jun, 2015 bioinf devop

CSV (Comma Separated Values) files and TSV (Tab Separated Values) files are common data transfer files in many fields, including Bioinformatics. CSV files is more powerful, because content in quoting characters (double quotation marks, most of the time) could contains field separators (comma). Therefore we could not just split one line by filed separators.

Remove CNNIC CA

25 Mar, 2015 network

网上传输的任何信息都有可能被恶意截获。尽管如此，我们仍然在网上保存着很多重要的资料，比如私人邮件、银行交易。这是因为，有一个叫着 SSL/TLS/HTTPS 的东西在保障我们的信息安全，它将我们和网站服务器的通信加密起来。如果网站觉得它的用户资料很敏感，打算使用 SSL/TLS/HTTPS 加密，必须先向有 CA (Certificate Authority) 权限的公司/组织申请一个证书。有 CA 权限的公司/组织都是经过全球审核，值得信赖的。

CNNIC 可以随意造一个假的证书给任何网站，替换网站真正的证书，从而盗取我们的任何资料！目的在于，你懂的。

Coding experience of this week

13 Feb, 2015 devop

I did some analysis on high thoughput sequencing data this week. Here are some experience valuable to share.

1. Make scripts flexible and reproducible

Use arguments parser to handle different running condition/steps. Make sure it is easy to change parameters from command line.

Use counting option “verbose”; and logging modules for multiple level of output, e.g. “quite”; -> “info”; -> “verbose output”; -> “debug info”;, avoid repeatedly changing debug code.

Manipulation on FASTA format file

4 Dec, 2014 bioinf devop

FASTA format is a basic sequence format in the field of Bioinformatics. It’s easy to manipulate and parse.

Please use my another cool tool, SeqKit – a cross-platform and ultrafast toolkit for FASTA/Q file manipulation!

In my practice, I do a lot of work with FASTA format file. And I wrote some scripts to parse and analyze it. These are also some great tools like Bio series package like BioPerl, Biopython and BioJava.

Python note

11 Nov, 2014 python note

I plan to use Python as one of my main programming language for its huge numbers of libraries.

last update 2015-8-14

IDE

Sublime Text with plugin Python PEP8 Autoformat

Data structure

complex structure data.setdefault('names', []).append('Ruby')
get element data.get('foo', 0) # not data['foo']
[3*x for x in vec if x > 3] # [(x, x**2) for x in vec]
iterate for x, y in data.iteritems():
iterate through two sequences same time for x, y in zip(a, b):

Install Python applications on Linux

29 Oct, 2014 linux python

Python is very popular. A lot of bioinformatics softwares are written in Python, so we have to learn how to install Python applications, particularly when you have no ROOT privilege.

Install private Python firstly

wget https://www.python.org/ftp/python/2.7.8/Python-2.7.8.tgz
tar -zxcf Python-2.7.8.tgz
cd Python-2.7.8
./configure --prefix=/db/home/shenwei/local/app/python
make
make install

echo export LOCAL_PYTHON=/db/home/shenwei/local/app/python >> ~/.bashrc
echo export PYTHONPATH=\$LOCAL_PYTHON:\$LOCAL_PYTHON/lib/python2.7/site-packages:\$PYTHONPATH >> ~/.bashrc
echo export PATH=\$PYTHONPATH/bin:\$PATH >> ~/.bashrc
. ~/.bashrc

Lost my code by “mving” executables to source file!

27 Sep, 2014 linux data

Like the title said, I lost my code, which costed more than half of a day.

All possible rescue solutions failed:

Shell command mv used, which rewrited the source file. It’s incurable.
I supposed to upload the code to Github right after this rename operation, but I didn’t.
Dropbox syncing was paused.
No backup, even deleted old version in the trash.
Text editor was closed. If not it will store the code in RAM.

Shell Note

22 Sep, 2014 linux note

Yet another learning note after Perl note and Golang note.

last update: 2016-03-31

Standards for Command Line Interfaces

8 Sep, 2014 devop

Some common practice summerized from other great tools.

last update: 2014-09-08

Name

The name should be descriptive and easy to remember.

Options

Use option parser library to parse them. Long-named option is recommended to make them readable.

Some necessary options

–help Print introduction, usage and examples.
–version Version is needed!
–verbose Print additional information.

Additional options

-t Number of threads (concurrency).

用任务分发系统Gearman来加速你的计算任务（续）

4 Sep, 2014 devop linux

这是上一篇《用任务分发系统Gearman来加速你的计算任务》的续。

昨天处理一个个可并行化的的CPU密集型计算任务：每个任务里面包括一些有先后关系的子任务，其中一些任务可以并行执行，其后的任务则必须在它们都执行完之后才能继续。工作流如下所示：

                    |-> subjob2 |  
JOB1:    subjob1 -> |-> subjob3 | -> subjob5 -> subjob6
                    |-> subjob4 |

                    |-> subjob2 |  
JOB2:    subjob1 -> |-> subjob3 | -> subjob5 -> subjob6
                    |-> subjob4 |

                    |-> subjob2 |  
JOB3:    subjob1 -> |-> subjob3 | -> subjob5 -> subjob6
                    |-> subjob4 |

困扰于大段的shell命令，写了一个很实用的脚本crun执行部分可并行化的链式任务，即执行一个JOB。