Usage¶

Quickstart (for the impatient)¶

To type polysaccharide loci from genome assemblies:

kaptive assembly <db> assemblies/*.fasta > results.tsv

This will run kaptive assembly with the default parameters, and produce a table detailing the best match locus, predicted phenotype, confidence score and detailed typing information for each input genome assembly in the file called kaptive_results.tsv.

Detailed usage¶

We designed Kaptive 3 to be easier to use on the command-line than previous versions by structuring the program as a series of sub-commands that follow the general pattern of kaptive <mode> <database> <input>. There are three modes:

assembly: type assemblies
convert: convert Kaptive results to different formats
extract: extract features from Kaptive databases in different formats

Note

To see the full list of commands and options, run kaptive -h/--help.

kaptive assembly¶

Given a Kaptive database and a bacterial genome assembly, kaptive assembly will perform 3 main tasks:

Determines the most likely locus type of the genome assembly.
Reconstructs the biosynthetic gene cluster from the assembly contig sequences.
Predicts the corresponding serotype/phenotype of the genome assembly.

Note

As of version 3, Kaptive no longer supports allelic (wzi, wzc) typing.

To perform K locus typing on a directory of Klebsiella pneumoniae assemblies, you would run:

kaptive assembly kpsc_k assemblies/*.fasta > results.tsv

Here we have told Kaptive to perform typing of assemblies with assembly and used the database keyword kpsc_k to specify the Klebsiella pneumoniae K locus database. All other parameters are set to the default.

Database keywords are a handy short-cut for using the databases distributed with Kaptive and located in the Kaptive package data. Alternatively, you can specify the full path to your own database.

You may also want to specify the locations and/or filenames of the output files using the following options:

Note, text outputs accept '-' for stdout

-o , --out           Output file to write/append tabular results to (default: stdout)
-f [], --fasta []    Turn on fasta output
                     Accepts a single file or a directory (default: cwd)
-j [], --json []     Turn on JSON lines output
                     Optionally choose file (can be existing) (default: kaptive_results.json)
-s [], --scores []   Dump locus score matrix to tsv (typing will not be performed!)
                     Optionally choose file (can be existing) (default: stdout)
--no-header          Suppress header line

Example:

kaptive assembly kpsc_k assemblies/*.fasta \ -o kaptive_results.tsv -f -j

This will output a tabular file called kaptive_results.tsv, a fasta file for each assembly called {assembly}_kaptive_results.fna and a JSON lines file called kaptive_results.json.

Warning

It is possible to write all text formats (TSV, JSON and FASTA) to the same file (including stdout), however this is not recommended for downstream analysis.

Advanced options¶

Advanced users may wish to customise Kaptive's scoring options (for picking the best match locus), confidence options (for marking matches as 'Typeable' or 'Untypeable') or database parsing options. We recommend keeping the default options for standard typing using the Klebsiella and/or A. baumanii databases distributed with Kaptive.

Scoring options:

--min-cov            Minimum gene %coverage (blen/q_len*100) to be used for scoring (default: 50.0)
--score-metric       Metric for scoring each locus (default: 0)
                       0: AS (alignment score of genes found)
                       1: mlen (matching bases of genes found)
                       2: blen (aligned bases of genes found)
                       3: q_len (query length of genes found)
--weight-metric      Weighting for the 1st stage of the scoring algorithm (default: 3)
                       0: No weighting
                       1: Number of genes found
                       2: Number of genes expected
                       3: Proportion of genes found
                       4: blen (aligned bases of genes found)
                       5: q_len (query length of genes found)
--n-best             Number of best loci from the 1st round of scoring to be
                     fully aligned to the assembly (default: 2)

Confidence options:

--gene-threshold     Species-level locus gene identity threshold (default: database specific)
--max-other-genes    Typeable if <= other genes (default: 1)
--percent-expected   Typeable if >= % expected genes (default: 50)
--below-threshold    Typeable if any genes are below threshold (default: False)

See database options here and other options:

-V, --verbose         Print debug messages to stderr
-v , --version        Show version number and exit
-h , --help           Show this help message and exit
-t , --threads        Number of threads for alignment (default: maximum available CPUs / 32)

kaptive convert¶

The convert command allows you to convert the Kaptive results JSON file into a range of useful formats, including:

tsv: Tabular output (tsv)
json: JSON lines format (same as input but optionally filtered)
fna: Locus nucleotide sequences in fasta format.
ffn: Gene nucleotide sequences in fasta format.
faa: Protein sequences in fasta format.

Warning

The convert command is only compatible with JSON files from Kaptive v3.0.0 onwards.

Usage¶

kaptive convert -h
...

General usage is as follows:

kaptive convert <db> <json> [formats] [options]

Inputs:

db path/keyword       Kaptive database path or keyword
json                  Kaptive JSON lines file or - for stdin

Formats:

Note, text outputs accept '-' for stdout

-t [], --tsv []       Convert to tabular format in file (default: stdout)
-j [], --json []      Convert to JSON lines format in file (default: stdout)
--fna []              Convert to locus nucleotide sequences in fasta format
                      Accepts a single file or a directory (default: cwd)
--ffn []              Convert to locus gene nucleotide sequences in fasta format
                      Accepts a single file or a directory (default: cwd)
--faa []              Convert to locus gene protein sequences in fasta format
                      Accepts a single file or a directory (default: cwd)
--no-header           Suppress header line

Filter options:

-r , --regex          Python regular-expression to select JSON lines (default: All)
-l  [ ...], --loci  [ ...]
                    Space-separated list to filter locus names (default: All)
-s  [ ...], --samples  [ ...]
                    Space-separated list to filter sample names (default: All)

Note

Filters take precedence in descending order

For example, to convert the JSON file to a tabular format, run either of the following commands:

kaptive convert kpsc_k kaptive_results.json --tsv kaptive_results.tsv

cat *.json | kaptive convert kpsc_k - --tsv - > kaptive_results.tsv

To output multiple formats, you can run:

kaptive convert kpsc_k kaptive_results.json --tsv kaptive_results.tsv --fna - --faa proteins/

Where the tabular results will be written to kaptive_results.tsv, the locus nucleotide sequences will be written to stdout, and the protein sequences will be written to the directory proteins/ with the filenames {assembly}_kaptive_results.faa.

Warning

It is possible to write all text formats (TSV, JSON, FNA, FAA and FFN) to the same file (including stdout), however this is not recommended for downstream analysis.

kaptive extract¶

Kaptive 3.0.0 and above includes a new command-line mode extract that allows you to extract features from a Kaptive database in the following formats:

fna: Locus nucleotide sequences in fasta format.
ffn: Gene nucleotide sequences in fasta format.
faa: Protein sequences in fasta format.

Usage¶

General usage is as follows:

kaptive extract <db> [formats] [options]

Formats:

Note, text outputs accept '-' for stdout

--fna []         Convert to locus nucleotide sequences in fasta format
                 Accepts a single file or a directory (default: cwd)
--ffn []         Convert to locus gene nucleotide sequences in fasta format
                 Accepts a single file or a directory (default: cwd)
--faa []         Convert to locus gene protein sequences in fasta format
                 Accepts a single file or a directory (default: cwd)

Database options:

--locus-regex    Python regular-expression to match locus names in db source note
--type-regex     Python regular-expression to match locus types in db source note
--filter         Python regular-expression to select loci to include in the database

Note

These options are useful for customising the database to your needs, for example, to include only a subset of loci or to change the way locus names and types are parsed from the source note.

Other options:

-V, --verbose    Print debug messages to stderr
-v , --version   Show version number and exit
-h , --help      Show this help message and exit

For example, to extract the gene nucleotide sequences from the Klebsiella pneumoniae K locus primary reference database in fasta format, run:

kaptive extract kp_k --fna k_loci.fna

To extract all protein sequences from KL1 and KL2, run either one of the following:

kaptive extract kp_k --filter "^KL(1|2)$" --faa KL1_KL2_proteins.faa
kaptive extract kp_k --filter "^KL(1|2)$" --faa - > KL1_KL2_proteins.faa

To do the same but output each locus to a separate file, run either:

kaptive extract kp_k --filter "^KL(1|2)$" --faa
kaptive extract kp_k --filter "^KL(1|2)$" --faa protein_files/

Which would create two files: KL1.faa and KL2.faa.

kaptive assembly kpsc_k assembly.fasta -j kaptive_results.json

Warning

It is possible to write all text formats (fna, faa and ffn) to the same file (including stdout), however this is not recommended for downstream analysis.

API¶

Whilst Kaptive isn't designed to be a fully-fledged API, it is possible to use it as a module in your own Python scripts. For typing assemblies, you can use the kaptive.assembly.typing_pipeline function, which takes an assembly and a kaptive.database.Database object as input and returns a kaptive.typing.TypingResult object.

from kaptive.assembly import typing_pipeline
from kaptive.database import load_database
from pathlib import Path

db = load_database('kpsc_k')  ## Load the Klebsiella K locus database once and pass it to the typing pipeline
for result in map(lambda a: typing_pipeline(a, db), Path('assemblies').glob('*.fna.gz')):
    if result:  ## If the assembly was successfully typed
        print(result.format('tsv'), end='')  ## TSV format will end in a newline, so we set end to ''

For example, if you wanted to perform K and O locus typing on a single assembly, you could do the following:

## Here, we pass the keyword arguments for the database, they will be loaded inside the typing pipeline
for result in map(lambda d: typing_pipeline('test/kpsc/2018-01-389.fasta', d), ['kpsc_k', 'kpsc_o']):
    if result:  ## If the assembly was successfully typed
        print(result.format('tsv'), end='')  ## TSV format will end in a newline, so we set end to ''

Note

By default the typing_pipeline runs minimap2 on a all available CPUs, however this can be controlled with the threads parameter.