map is not the fastest in go

golang

I wrote a bioinformatics package in golang, in which I used a function to check whether a letter(byte) is a valid DNA/RNA/Protein letter.

The easy way is storing the letters of alphabet in a map and check the existance of a letter. However, when I used go tool pprof to profile the performance, I found the hash functions (mapaccess2, memhash8, memhash) of map cost much time (see figure below).

Then I found a faster way: storing letters in a slice. In detail, saving a letter(byte) at position int(letter) of slice. To check a letter, just chech the value of slice[int(letter)], non-zero means valid letter.

See the benchmark result:

BenchmarkCheckLetterWithMap 2000000000 0.20 ns/op

BenchmarkCheckLetterWithSlice 2000000000 0.01 ns/op

Read more →

Fetch taxon information by species name or taxid

python bioinf

[updates] I wrote a tool to do the same job and it’s even more powerful.

gTaxon - a fast cross-platform NCBI taxonomy data querying tool, with cmd client and REST API server for both local and remote server. http:/github.com/shenwei356/gtaxon

This post presents a script for fetching taxon information by species name or taxid.

Take home message:

1). using cache to avoid repeatly search

2). object of Entrez.read(Entrez.efetch()) could be treated as list, but it could not be rightly pickled. Using Json is also not OK. The right way is cache the xml text.

search = Entrez.efetch(id=taxid, db="taxonomy", retmode="xml")
# data = Entrez.read(search)
##  read and parse xml
data_xml = search.read()
data = list(Entrez.parse(StringIO(data_xml)))

3). pickle file was fragile. A flag file could be used to detect whether data is rightly dumped.

4). using multi-threads to accelerate fetching.

Read more →

Data science learning path on python

python

Basic

  1. NumPy - NumPy is the fundamental package for scientific computing with Python
  2. pandas - pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
  3. matplotlib - matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms
  4. seaborn - Statistical data visualization using matplotlib

Further

  1. Statsmodels - Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests.
  2. scikit-learn - Machine Learning in Python
  3. theano - Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently

Practice

Read more →

Install Python Numba

python

Official site: http://numba.pydata.org/

Easiest way

conda install numba

Manually

Read more →

Make a Linux service auto start

linux

Steps

  1. Create a service script in /etc/init.d/ and chmod a+x for it.
  2. Test, sudo service xxxxx start
  3. At last, sudo systemctl enable xxxxx

Read more →

Migrate from wordpress to hugo

network

Why abandon wordpress

It’s too heavy to my VPS, even with cache plugin. Static page is much more faster!

Why Hugo

See official doc: http://gohugo.io/

  • Hugo is written in golang. It’s just an tiny executable binary file available for most popular operating system.
  • All pages are written in Markdown.
  • Hugo contains a super fast web server, it could monitor file changes and sync the contents in all most realtime (~200 ms for 100 pages).

Read more →

Different implemention of nested dict in Python

python

Different implemention of nested dict in Python.

Code is also available on gist

Read more →

如何编写晦涩难读的Perl代码

perl

如何编写晦涩难读的Perl代码

Read more →

DO NOT use infinality to render fonts!

linux

DO NOT use [infinality][1] to render fonts!

DO NOT use [infinality][1] to render fonts!

DO NOT use [infinality][1] to render fonts!

Read more →

在家下文件:4M ADSL+路由器+移动硬盘+迅雷远程下载+迅雷快鸟

network

需要下几百G的文件,无奈家中只有小水管电信4M ADSL,最大速度仅为540 kb/s。

购买NetGear WNDR 4300路由器,刷“明月永在”的OpenWRT固件,通过USB挂载一块1T的移动硬盘。

Read more →