Skip to content

Adding and Updating Custom Databases

You can add or update custom reference databases using Toolchest – the interface is the same as running any tool, except inputs is replaced by database_path.

Adding and updating databases function as Async Runs. The add_database and update_database functions return after transferring the data with:

  • a unique ID that you can use with .get_status() to track status
  • the new database_name and database_version

Adding a Custom Database

You can add a custom database for any tool that exists in Toolchest.

The custom database must already be generated for the specific tool. For example, the custom database for Kraken 2 must be generated by kraken2-build rather than a collection of FASTQ files.

Arguments:

  • database_path: Path or list of paths (local or S3) to file(s) containing the custom database. This can also be a single path to a directory; see this section.
  • tool: Toolchest tool class with which you use the database (e.g. toolchest.tools.DiamondBlastp, toolchest.tools.Kraken2).
  • database_name: Name of the new custom database.
  • database_primary_name: If you are uploading multiple database files and your tool takes in a certain file or prefix instead of the whole directory as its command-line argument, use this to specify the name of that file or prefix. See this section for more details.

The return is a Toolchest.api.Output object, containing:

  • database_name
  • database_version
  • run_id

Here's an example of adding a custom database for Kraken 2 using an S3 prefix URI.

import toolchest_client as tc

tc.set_key("YOUR_KEY")

tc.update_database(
  # arbitrary-directory is a prefix containing files like arbitrary-directory/db.files.1
  database_path="s3://example-s3-bucket/arbitrary-directory/",
  tool=tc.tools.Kraken2,
  database_name="example-db-name",
  database_primary_name="db.files",
)

When will my database be ready to use?

It takes time for the database to transfer to our system. Once the add / update run has a status of ready_to_transfer_to_client, you're good to go. You can check the status with these steps.

Updating a Custom Database

You can create a new custom version for any tool and database in Toolchest. This is very similar to adding a custom database, except the database_name for the database must already exist.

Arguments:

  • database_path: Path or list of paths (local or S3) to file(s) containing the custom database. This can also be a single path to a directory; see this section.
  • tool: Toolchest tool class with which you use the database (e.g. toolchest.tools.DiamondBlastp, toolchest.tools.Kraken2).
  • database_name: Name of the existing database.
  • database_primary_name: If you are uploading multiple database files and your tool takes in a certain file or prefix instead of the whole directory as its command-line argument, use this to specify the name of that file or prefix. See this section for more details.

database_primary_name is optional for update_database

If omitted, it assumes the same database_primary_name as the previous version of the custom database.

Returns a Toolchest.api.Output object, containing:

  • database_name
  • database_version (auto-incremented from the latest version)
  • run_id

Here's an example update of the standard Kraken 2 database:

import toolchest_client as tc

tc.set_key("YOUR_KEY")

tc.update_database(
  database_path=[
    "s3://example-s3-bucket-for-databases/my_db_directory/db.files.1",
    "s3://example-s3-bucket-for-databases/my_db_directory/db.files.2",
  ],
  tool=tc.tools.Kraken2,
  database_name="standard",
  database_primary_name="db.files",
)

Preserving File Structure

You can use a directory as your database_path. For both local and S3 paths, using a directory will place all files in their implied subdirectories.

If a list of paths is used instead, file structure will not be preserved. All files will be placed into the same directory.

Using database_primary_name

Let's say you have a tool that you would run from the command line like this:

some_command --database databases/my_db
For custom databases, we can't be sure about what databases/my_db actually is. my_db could either represent a directory, a file itself, or multiple files -- it depends on the context of the tool.

To pass that context to Toolchest, use database_primary_name. If --database refers to a path that isn't a directory, use database_path for the directory and database_primary_name for the filename or prefix.

Here's an example call for Bowtie 2, where my_db is a prefix for multiple files within databases/:

import toolchest_client as tc
tc.add_database(
    tool=tc.tools.Bowtie2,
    database_name="example_name_for_my_new_database",
    database_path="databases/",
    database_primary_name="my_db",
)

Use cases

  1. If your database is a directory itself, such as databases/my_db/{various database files}, then use database_primary_name=None:
    tc.add_database(
        database_path="databases/my_db",
        database_primary_name=None,
        ...
    )
    
  2. If your database is a single database file, such as databases/my_db, or a prefix, such as:
    databases/
    ├── my_db.file1
    ├── my_db.file2
    └── my_db.file3
    
    then use database_primary_name=my_db:
    tc.add_database(
        database_path="databases/",
        database_primary_name="my_db",
        ...
    )
    

Behavior differences for update_database

For update_database, database_primary_name is optional. If omitted, it assumes the same value as the previous version of the custom database.

If database_primary_name needs to be changed to None for future versions of the database, contact Toolchest to update the primary name.

Security

By default, the privacy setting of all custom databases is the equivalent of "unlisted". This means that if anybody else knows the name and version of your database, they can access it.

We support fully private custom databases as a part of the managed-hosted and on-prem versions of Toolchest.