Illustration of FASTA/Q file parsing strategies. (A) and (C) Main thread parses one sequence, waits (blocked) it to be processed and then parses next one. (B) Sequence parsing thread continuously (non-blocked) parses sequences and passes them to main thread. The width of rectangles representing sequence parsing and sequence processing is proportional with running time. Sequence parsing speeds in (A) and (B) are the same, which are both much slower than that in (C). The speeds of sequence processing are identical in (A), (B) and (C). In (B), chunks of sequences in buffer can be processed in parallel, but most of the time the main thread needs to serially manipulate the sequences.
Here are some experience I’ve got.
src # source code docs # documents tests # tests benchmarks # benchmark results examples # examples LICENSE README.md
I wrote a bioinformatics package in golang,
in which I used a function to check whether a letter(
byte) is a valid DNA/RNA/Protein letter.
The easy way is storing the letters of alphabet in a map and check the existance of a letter.
However, when I used
go tool pprof to profile the performance,
I found the hash functions (
map cost much time (see figure below).
Then I found a faster way: storing letters in a slice,
in detail, saving a letter(
byte) at position
int(letter) of slice.
To check a letter, just chech the value of
slice[int(letter)], non-zero means
[update at 2016-06-02] Two
switch versions were also tested.
They were faster than
map version, but still slower than
Besides, it was affected by the number of
case sentences in
i.e. the bigger the alphabet size is, the slower it runs.
See the benchmark result:
[updates] I wrote a tool to do the same job and it’s even more powerful.
This post presents a script for fetching taxon information by species name or taxid.
Take home message:
1). using cache to avoid repeatly search
2). object of
Entrez.read(Entrez.efetch()) could be treated as
but it could not be rightly pickled. Using Json is also not OK.
The right way is cache the xml text.
search = Entrez.efetch(id=taxid, db="taxonomy", retmode="xml") # data = Entrez.read(search) ## read and parse xml data_xml = search.read() data = list(Entrez.parse(StringIO(data_xml)))
3). pickle file was fragile. A flag file could be used to detect whether data is rightly dumped.
4). using multi-threads to accelerate fetching.
It’s too heavy to my VPS, even with cache plugin. Static page is much more faster!
See official doc: http://gohugo.io/