[updates] I wrote a tool to do the same job and it’s even more powerful.
gTaxon - a fast cross-platform NCBI taxonomy data querying tool, with cmd client and REST API server for both local and remote server. http:/github.com/shenwei356/gtaxon
This post presents a script for fetching taxon information by species name or taxid.
Take home message:
1). using cache to avoid repeatly search
2). object of Entrez.read(Entrez.efetch())
could be treated as list
,
but it could not be rightly pickled. Using Json is also not OK.
The right way is cache the xml text.
search = Entrez.efetch(id=taxid, db="taxonomy", retmode="xml")
# data = Entrez.read(search)
## read and parse xml
data_xml = search.read()
data = list(Entrez.parse(StringIO(data_xml)))
3). pickle file was fragile. A flag file could be used to detect whether data is rightly dumped.
4). using multi-threads to accelerate fetching.
Source is available: taxon_fetch.py