Adding and Updating Custom Databases
You can add or update custom reference databases using Toolchest – the interface is the same as running any tool,
except inputs
is replaced by database_path
.
Adding and updating databases function as Async Runs. The add_database
and
update_database
functions return after transferring the data with:
- a unique ID that you can use with
.get_status()
to track status - the new
database_name
anddatabase_version
Adding a Custom Database
You can add a custom database for any tool that exists in Toolchest.
The custom database must already be generated for the specific tool. For example, the custom database for Kraken 2 must
be generated by kraken2-build
rather than a collection of FASTQ files.
Arguments:
database_path
: Path or list of paths (local or S3) to file(s) containing the custom database. This can also be a single path to a directory; see this section.tool
: Toolchest tool class with which you use the database (e.g.toolchest.tools.DiamondBlastp
,toolchest.tools.Kraken2
).database_name
: Name of the new custom database.database_primary_name
: If you are uploading multiple database files and your tool takes in a certain file or prefix instead of the whole directory as its command-line argument, use this to specify the name of that file or prefix. See this section for more details.
The return is a Toolchest.api.Output
object, containing:
database_name
database_version
run_id
Here's an example of adding a custom database for Kraken 2 using an S3 prefix URI.
import toolchest_client as tc
tc.set_key("YOUR_KEY")
tc.update_database(
# arbitrary-directory is a prefix containing files like arbitrary-directory/db.files.1
database_path="s3://example-s3-bucket/arbitrary-directory/",
tool=tc.tools.Kraken2,
database_name="example-db-name",
database_primary_name="db.files",
)
When will my database be ready to use?
It takes time for the database to transfer to our system. Once the add / update run has a status of
ready_to_transfer_to_client
, you're good to go. You can check the status
with these steps.
Updating a Custom Database
You can create a new custom version for any tool and database in Toolchest. This is very similar to adding a custom
database, except the database_name
for the database must already exist.
Arguments:
database_path
: Path or list of paths (local or S3) to file(s) containing the custom database. This can also be a single path to a directory; see this section.tool
: Toolchest tool class with which you use the database (e.g.toolchest.tools.DiamondBlastp
,toolchest.tools.Kraken2
).database_name
: Name of the existing database.database_primary_name
: If you are uploading multiple database files and your tool takes in a certain file or prefix instead of the whole directory as its command-line argument, use this to specify the name of that file or prefix. See this section for more details.
database_primary_name
is optional for update_database
If omitted, it assumes the same database_primary_name
as the previous version of the custom database.
Returns a Toolchest.api.Output
object, containing:
database_name
database_version
(auto-incremented from the latest version)run_id
Here's an example update of the standard Kraken 2 database:
import toolchest_client as tc
tc.set_key("YOUR_KEY")
tc.update_database(
database_path=[
"s3://example-s3-bucket-for-databases/my_db_directory/db.files.1",
"s3://example-s3-bucket-for-databases/my_db_directory/db.files.2",
],
tool=tc.tools.Kraken2,
database_name="standard",
database_primary_name="db.files",
)
Preserving File Structure
You can use a directory as your database_path
. For both local and S3 paths, using a directory will place all files
in their implied subdirectories.
If a list of paths is used instead, file structure will not be preserved. All files will be placed into the same directory.
Using database_primary_name
Let's say you have a tool that you would run from the command line like this:
some_command --database databases/my_db
databases/my_db
actually is. my_db
could either represent a
directory, a file itself, or multiple files -- it depends on the context of the tool.
To pass that context to Toolchest, use database_primary_name
. If --database
refers to a path that isn't a directory,
use database_path
for the directory and database_primary_name
for the filename or prefix.
Here's an example call for Bowtie 2, where my_db
is a prefix for multiple files within databases/
:
import toolchest_client as tc
tc.add_database(
tool=tc.tools.Bowtie2,
database_name="example_name_for_my_new_database",
database_path="databases/",
database_primary_name="my_db",
)
Use cases
- If your database is a directory itself, such as
databases/my_db/{various database files}
, then usedatabase_primary_name=None
:tc.add_database( database_path="databases/my_db", database_primary_name=None, ... )
- If your database is a single database file, such as
databases/my_db
, or a prefix, such as:then usedatabases/ ├── my_db.file1 ├── my_db.file2 └── my_db.file3
database_primary_name=my_db
:tc.add_database( database_path="databases/", database_primary_name="my_db", ... )
Behavior differences for update_database
For update_database
, database_primary_name
is optional. If omitted, it assumes the same value as the
previous version of the custom database.
If database_primary_name
needs to be changed to None
for future versions of the database, contact Toolchest to update the primary
name.
Security
By default, the privacy setting of all custom databases is the equivalent of "unlisted". This means that if anybody else knows the name and version of your database, they can access it.
We support fully private custom databases as a part of the managed-hosted and on-prem versions of Toolchest.