Convenience function to do all necessary preparations downloading names, nodes and accession2taxid data from NCBI and preprocessing into a SQLite database for downstream use.
Usage
prepareDatabase(
sqlFile = "nameNode.sqlite",
tmpDir = ".",
getAccessions = TRUE,
vocal = TRUE,
...
)
Arguments
- sqlFile
character string giving the file location to store the SQLite database
- tmpDir
location for storing the downloaded files from NCBI. (Note that it may be useful to store these somewhere convenient to avoid redownloading)
- getAccessions
if TRUE download the very large accesssion2taxid files necessary to convert accessions to taxonomic IDs
- vocal
if TRUE output messages describing progress
- ...
Arguments passed on to
getNamesAndNodes
,getAccession2taxid
,read.accession2taxid
url
the url where taxdump.tar.gz is located
fileNames
the filenames desired from the tar.gz file
protocol
the protocol to be used for downloading. Probably either
'http'
or'ftp'
. Overridden ifurl
is provided directlyresume
if TRUE attempt to resume downloading an interrupted file without starting over from the beginning
baseUrl
the url of the directory where accession2taxid.gz files are located
types
the types if accession2taxid.gz files desired where type is the prefix of xxx.accession2taxid.gz. The default is to download all nucl_ accessions. For protein accessions, try
types=c('prot')
.extraSqlCommand
for advanced use. A string giving a command to be called on the SQLite database before loading data. A couple potential uses:
"pragma temp_store = 2;" to keep all SQLite temp files in memory. Don't do this unless you have a lot (>100 Gb) of RAM
indexTaxa
if TRUE add an index for taxa ID. This would only be necessary if you want to look up accessions by taxa ID e.g.
getAccessions
overwrite
If TRUE, delete accessionTaxa table in database if present and regenerate