Centrifuge is a rapid and memory-efficient classifier of DNA sequences from microbial samples. Centrifuge requires a relatively small genome index (e.g., 4.3 GB for ~4,100 bacterial genomes) and can process a typical DNA sequencing run within an hour. For more information, see the tool's website and GitHub repo.

Function Call

tc.centrifuge(
    output_path=None,
    tool_args="",
    database_name="centrifuge_refseq_bacteria_archaea_viral_human",
    database_version="1",
    read_one=None,
    read_two=None,
    unpaired=None,
    is_async=False,
)

Function Arguments

Argument	Use in place of:	Description
`read_one`	`-1`	(optional) Path(s) to R1 of paired-end read input files. The files can be a local or remote, see Using Files.
`read_two`	`-2`	(optional) Path(s) to R2 of paired-end read input files. The files can be a local or remote, see Using Files.
`unpaired`	`-U`	(optional) Path(s) to unpaired input files. The files can be a local or remote, see Using Files.
`output_path`	output arguments (`-S`, `--report`)	(optional) Path (directory) to where the output files will be downloaded. If omitted, skips download. The files can be a local or remote, see Using Files.
`tool_args`	all other arguments	(optional) Additional arguments to be passed to Centrifuge. This should be a string of arguments like the command line. See Supported Additional Arguments for more details.
`database_name`	`-x`*	(optional) Name of database to use for Centrifuge classification. Defaults to `"centrifuge_refseq_bacteria_archaea_viral_human"` (Refseq bacteria / archaea / viral / human).
`database_version`	`-x`*	(optional) Version of database to use for Centrifuge classification. Defaults to `"1"`.
`is_async`		Whether to run a job asynchronously. See Async Runs for more.

*See the Databases section for more details.

Output Files

A Centrifuge run will output these files into output_path:

centrifuge_output.txt: Centrifuge output (captured from stdout), from the -S argument.
centrifuge_report.tsv: Centrifuge report file, from the --report argument.

Notes

Paired-end reads

For each paired-end input, make sure the corresponding read is in the same position in the input list. For example, two pairs of paired-end files – one_R1.fastq, one_R2.fastq, two_R1.fastq, two_R2.fastq – should be passed to Toolchest as:

tc.centrifuge(
  read_one=["one_R1.fastq", "two_R1.fastq"],
  read_two=["one_R2.fastq", "two_R2.fastq"],
  ...
)

Tool Versions

Toolchest currently supports version 1.0.4 of Centrifuge.

Databases

Toolchest currently supports the following databases for Bowtie 2:

`database_name`	`database_version`	Description
`centrifuge_refseq_bacteria_archaea_viral_human`	`1`	RefSeq, bacteria / archaea / viral / human, JHU source¹

¹These database indexes were generated by the Langmead Lab at Johns Hopkins and can be found on the lab's database index page.

Supported Additional Arguments

Most additional arguments not related to input, output, or multithreading are supported:

-q
--qseq
-f
-r
-s, --skip
-u, --upto
-5, --trim5
-3, --trim3
--phred33
--phred64
--int-quals
--ignore-quals
--nofw
--norc
--min-hitlen
-k
--host-taxids
--exclude-taxids
--out-fmt
--tab-fmt-cols
-t, --time
--qc-filter
--seed
--non-deterministic

Set additional arguments with tool_args. For example: tool_args="-f -k 10"