Fetch taxon information by species name or taxid

[updates] I wrote a tool to do the same job and it’s even more powerful.

gTaxon - a fast cross-platform NCBI taxonomy data querying tool, with cmd client and REST API server for both local and remote server. http:/github.com/shenwei356/gtaxon

This post presents a script for fetching taxon information by species name or taxid.

Take home message:

1). using cache to avoid repeatly search

2). object of Entrez.read(Entrez.efetch()) could be treated as list, but it could not be rightly pickled. Using Json is also not OK. The right way is cache the xml text.

search = Entrez.efetch(id=taxid, db="taxonomy", retmode="xml")
# data = Entrez.read(search)
##  read and parse xml
data_xml = search.read()
data = list(Entrez.parse(StringIO(data_xml)))

3). pickle file was fragile. A flag file could be used to detect whether data is rightly dumped.

4). using multi-threads to accelerate fetching.

Source is available: taxon_fetch.py