CSV (Comma Separated Values) files and TSV (Tab Separated Values) files are common data transfer files in many fields, including Bioinformatics. CSV files is more powerful, because content in quoting characters (double quotation marks, most of the time) could contains field separators (comma). Therefore we could not just split one line by filed separators.
网上传输的任何信息都有可能被恶意截获。尽管如此,我们仍然在网上保存着很多重要的资料,比如私人邮件、银行交易。这是因为,有一个叫着 SSL/TLS/HTTPS 的东西在保障我们的信息安全,它将我们和网站服务器的通信加密起来。 如果网站觉得它的用户资料很敏感,打算使用 SSL/TLS/HTTPS 加密,必须先向有 CA (Certificate Authority) 权限的公司/组织申请一个证书。有 CA 权限的公司/组织都是经过全球审核,值得信赖的。
CNNIC 可以随意造一个假的证书给任何网站,替换网站真正的证书,从而盗取我们的任何资料!目的在于,你懂的。
I did some analysis on high thoughput sequencing data this week. Here are some experience valuable to share.
1. Make scripts flexible and reproducible
Use arguments parser to handle different running condition/steps. Make sure it is easy to change parameters from command line.
Use counting option “verbose”; and logging modules for multiple level of output, e.g. “quite”; -> “info”; -> “verbose output”; -> “debug info”;, avoid repeatedly changing debug code.
FASTA format is a basic sequence format in the field of Bioinformatics. It’s easy to manipulate and parse.
Please use my another cool tool, SeqKit – a cross-platform and ultrafast toolkit for FASTA/Q file manipulation!
In my practice, I do a lot of work with FASTA format file. And I wrote some scripts to parse and analyze it. These are also some great tools like Bio series package like BioPerl, Biopython and BioJava.
I plan to use Python as one of my main programming language for its huge numbers of libraries.
last update 2015-8-14
data.setdefault('names', []).append('Ruby')
data.get('foo', 0) # not data['foo']
[3*x for x in vec if x > 3] # [(x, x**2) for x in vec]
iterate for x, y in data.iteritems():
for x, y in zip(a, b):
Python is very popular. A lot of bioinformatics softwares are written in Python, so we have to learn how to install Python applications, particularly when you have no ROOT privilege.
wget https://www.python.org/ftp/python/2.7.8/Python-2.7.8.tgz
tar -zxcf Python-2.7.8.tgz
cd Python-2.7.8
./configure --prefix=/db/home/shenwei/local/app/python
make
make install
echo export LOCAL_PYTHON=/db/home/shenwei/local/app/python >> ~/.bashrc
echo export PYTHONPATH=\$LOCAL_PYTHON:\$LOCAL_PYTHON/lib/python2.7/site-packages:\$PYTHONPATH >> ~/.bashrc
echo export PATH=\$PYTHONPATH/bin:\$PATH >> ~/.bashrc
. ~/.bashrc
Like the title said, I lost my code, which costed more than half of a day.
All possible rescue solutions failed:
mv
used, which rewrited the source file. It’s incurable.
Some common practice summerized from other great tools.
last update: 2014-09-08
Name
The name should be descriptive and easy to remember.
Options
Use option parser library to parse them. Long-named option is recommended to make them readable.
Some necessary options
Additional options
这是上一篇《用任务分发系统Gearman来加速你的计算任务》的续。
昨天处理一个个可并行化的的CPU密集型计算任务:每个任务里面包括一些有先后关系的子任务,其中一些任务可以并行执行,其后的任务则必须在它们都执行完之后才能继续。工作流如下所示:
|-> subjob2 |
JOB1: subjob1 -> |-> subjob3 | -> subjob5 -> subjob6
|-> subjob4 |
|-> subjob2 |
JOB2: subjob1 -> |-> subjob3 | -> subjob5 -> subjob6
|-> subjob4 |
|-> subjob2 |
JOB3: subjob1 -> |-> subjob3 | -> subjob5 -> subjob6
|-> subjob4 |
困扰于大段的shell命令,写了一个很实用的脚本crun执行部分可并行化的链式任务,即执行一个JOB。