AlleleCall - Determine the allelic profiles of a set of genomes
What is an allele?
In Biology, an allele is a specific sequence variant that occurs at a given locus. However, given a DNA sequence, the assignment of a putative allele to a locus is influenced by several factors:
Quality of the sequence assembly (influenced by several aspects, such as the sequencing method, the assembler used, etc);
If the alleles must correspond to Coding DNA Sequences (CDSs) and open reading frames (ORFs);
Presence of possibly homologous loci (this situation can result in an allele assignment to a possibly wrong locus given the difficulty in distinguishing closely related homologs).
Therefore, in gene-by-gene methods, the definition of an allele is determined by the sequence similarity search method and all the parameters used to decide if an allele can be identified as a bona fide allele of a given locus.
In chewBBACA, by default, an allele needs to be a CDS defined by Prodigal (for chewBBACA <= 3.2.0) or Pyrodigal (for chewBBACA >= 3.3.0). To ensure reproducibility of the CDS prediction, the same Prodigal training file for each bacterial species should be used and provided as input. Users can also provide input files with CDSs, in which case the gene prediction step with Prodigal or Pyrodigal will be skipped.
Important
chewBBACA >=3.3.0 uses Pyrodigal for gene prediction. Pyrodigal is a Python module that provides bindings to Prodigal, including several bug fixes and performance improvements.
Important
Please read the Prodigal wiki for more information about the requirements to create a training file.
The allele calling algorithm has the following main steps:
Gene predictipon with Prodigal followed by CDS extraction to create FASTA files that contain all CDSs extracted from the inputs (There is also the option to provide FASTA files with CDSs and the
--cds
parameter to skip the gene prediction step with Prodigal).Identification of the distinct CDSs (chewBBACA stores information about the distinct CDSs and the genomes that contain those CDSs in a hashtable with the mapping between CDS SHA-256 and list of unique integer identifiers for the inputs that contain each CDS compressed with polyline encoding adapted from numcompress).
Exact matches at DNA level between the alleles of each locus in the schema and the CDSs identified in the input files (Information about the exact matches found for each locus is saved to classification files, one per locus in the schema. The classification files are updated throughout the process with information about the matches and classifications at each step. If
--mode
equals 1, skips the next classification steps and goes directly to the last step).Translation of distinct CDSs that were not an exact match in the previous step (This step identifies and excludes CDSs that contain ambiguous bases and with length below the theshold defined by the
--l
parameter).Protein deduplication to identify the distinct set of proteins and keep information about the inputs that contain CDSs that encode each distinct protein (Hashtable with mapping between protein SHA-256 and list of unique integer identifiers for the distinct CDSs encoded with polyline encoding).
Exact matches at protein level between the translated alleles of each locus in the schema and the translated CDSs identified in the input files (Classification files are updated. If
--mode
equals 2, skips the next classification steps and goes directly to the last step).Minimizer-based clustering. The distinct proteins are sorted in order of decreasing length and clustered based on the percentage of shared distinct minimizers (Default >= 20%, interior minimizers selected based on lexicographic order, k=5, w=5) with the schema representative alleles.
Alignment, using BLASTp, of each cluster representative against all proteins added to its cluster to identify and classify matches with a BLAST Score Ratio (BSR) >0.7 (Classification files are updated. If
--mode
equals 3, skips the next step).Align the schema representatives against the distinct proteins that have not been classified. Compute the BSR for each alignment and classify proteins with alignments with BSR >=0.6. For each locus, determine new representative alleles from the set of proteins that had a BSR between 0.6 and 0.7. Realign the new representatives against the remaining unclassified proteins. Keep iterating until no proteins are classified for any locus (classification files are updated in each iteration).
Assign novel allele identifiers, add novel alleles to the schema and write the output files based on the information stored in the classification files. If
--no-inferred
is passed, this step is bypassed and no novel alleles are added to the database. This results in keeping a stable database to make allele calls against. When in doubt about the quality of the genomes one is performing allele call on, this parameter will prevent entering into the database alleles that may result from poor quality data rather than actual alleles present in the bacterial population.
Perform allele calling
Having defined a cgMLST or wgMLST schema, we can proceed to use it to call alleles on target genomes. chewBBACA’s allele calling algorithm can also use schemas and alleles from existing databases such as Ridom cgMLST, BIGSdb, BIGSdb-Pasteur, Enterobase and Chewie-NS. External schemas can be adapted for usage with chewBBACA through the PrepExternalSchema module.
Warning
If you installed chewBBACA v3 and want to use a schema created with chewBBACA v2, please use the PrepExternalSchema module to convert the schema to a format fully compatible with chewBBACA v3.
Warning
If you want to run chewBBACA in an environment with multiple processes/users accessing the same schema, please read the section about concurrent allele calling at the end of this page.
Important
Although the use of a training file is optional, it is highly recommended to ensure consistent results.
Basic Usage
chewBBACA.py AlleleCall -i /path/to/InputAssembliesFolder -g /path/to/SchemaFolder -o /path/to/OutputFolder --cpu 4
Parameters
-i, --input-files (Required) Path to the directory that contains the input FASTA files or to a file
with a list of full paths to FASTA files, one per line.
-g, --schema-directory (Required) Path to the schema directory. The schema directory contains the loci
FASTA files and a folder named "short" that contains the FASTA files with the
loci representative alleles.
-o, --output-directory (Required) Output directory where the process will store intermediate files and
allele calling results (will create a subdirectory named "results_<TIMESTAMP>"
if the path passed by the user already exists).
--ptf, --training-file (Optional) Path to the Prodigal training file used by Pyrodigal to predict genes.
Default is to use the training file included in the schema's directory (default: None)
--gl, --genes-list (Optional) Path to a file with the list of genes/loci to perform allele calling.
The file must include the full paths to the loci FASTA files or the loci IDs, one
per line. The process will perform allele calling only for the subset of genes
provided in the file (default: False).
--bsr, --blast-score-ratio (Optional) BLAST Score Ratio (BSR) value. The BSR is computed for each BLASTp
alignment and aligned sequences with a BSR >= than the defined value are
considered to be alleles of the same gene (default: uses value defined in the
schema config file).
--l, --minimum-length (Optional) Minimum sequence length value. Predicted coding sequences (CDSs)
shorter than this value are excluded (default value added to the config file is 0).
--t, --translation-table (Optional) Genetic code used to predict genes and to translate coding sequences.
Must match the genetic code used to create the training file (default: uses value
defined in schema config).
--st, --size-threshold (Optional) Coding sequence (CDS) size variation threshold. At the default value of
0.2, CDSs with a size that deviates +-20 percent from the locus length mode are
classified as ASM/ALM (default: uses value defined in schema config).
--cpu, --cpu-cores (Optional) Number of CPU cores that will be used to run the process (chewie resets
to a lower value if it is equal to or exceeds the total number of available CPU cores)
(default: 1).
--b, --blast-path (Optional) Path to the directory that contains the BLAST executables. (default: assumes
BLAST executables were added to PATH).
--pm, --prodigal-mode (Optional) Prodigal running mode ("single" for finished genomes, reasonable
quality draft genomes and big viruses. "meta" for metagenomes, low quality
draft genomes, small viruses, and small plasmids) (default: single).
--cds, --cds-input (Optional) If provided, chewBBACA skips the gene prediction step and assumes the
input FASTA files contain coding sequences (one FASTA file per strain) (default: False).
--no-inferred (Optional) If provided, the process will not add the sequences of inferred alleles
(INF) to the schema. Allelic profiles will still include the allele identifiers
attributed to the inferred alleles. Use this parameter if the schema is being
accessed by multiple processes/users simultaneously (default: False).
--output-unclassified (Optional) Create a Fasta file with the coding sequences (CDSs) that were not
classified (default: False).
--output-missing (Optional) Create a Fasta file with coding sequences classified as NIPH, NIPHEM,
ASM, ALM, PLOT3, PLOT5 and LOTSC (default: False).
--output-novel (Optional) Create Fasta file with the novel alleles inferred during allele calling.
The sequence headers include the locus and allele identifiers attributed by
chewBBACA based on the allele calling results (default: False).
--no-cleanup (Optional) If provided, intermediate files generated during process execution are
not removed at the end (default: False).
--hash-profile (Optional) Create a TSV file with hashed allelic profiles. Profiles can be hashed
with any of the hashing algorithms implemented in the hashlib and zlib Python libraries
(default: None).
--force-continue (Optional) If provided, chewie will not warn users and ask for permission to
continue if any of the provided argument values does not match the values in the
config file (default: False).
--mode (Optional) Execution mode (1: only exact matches at DNA level; 2: exact matches
at DNA and Protein level; 3: exact matches and minimizer-based clustering to find
similar alleles based on BSR+0.1; 4: runs the full process to find exact matches
and similar matches based on BSR value, including the determination of new
representative alleles to add to the schema) (default: 4).
Important
By default, the AlleleCall module uses the Prodigal training file included in the schema’s
directory and it is not necessary to pass a training file to the --ptf
parameter.
Important
If you provide the --cds-input
parameter, chewBBACA assumes that the input FASTA files contain
CDSs and skips the gene prediction step. To avoid issues related to the
format of the sequence headers, chewBBACA renames the sequence headers based on the unique basename
prefix determined for each input file and on the order of the CDSs (e.g.: CDSs
inside a file named GCF_000007125.1_ASM712v1_cds_from_genomic.fna
are renamed to
GCF_000007125-protein1
, GCF_000007125-protein2
, …, GCF_000007125-proteinN
).
Note
If a text file that contains a list of full paths to loci FASTA files or loci IDs, one per line, is
passed to the --gl
parameter, the process will only perform allele calling for the loci in that list.
Outputs
OutputFolderName
├── cds_coordinates.tsv
├── invalid_cds.txt
├── loci_summary_stats.tsv
├── results_statistics.tsv
├── results_contigsInfo.tsv
├── results_alleles.tsv
├── paralogous_counts.tsv
├── paralogous_loci.tsv
└── logging_info.txt
The
cds_coordinates.tsv
file contains the coordinates (genome unique identifier, contig identifier, start position, stop position, protein identifier attributed by chewBBACA, and coding strand (chewBBACA<=3.2.0 assigns 1 to the forward strand and 0 to the reverse strand and chewBBACA>=3.3.0 assigns 1 and -1 to the forward and reverse strands, respectively)) of the CDSs identified in each genome.The
invalid_cds.txt
file contains the list of alleles predicted by Prodigal that were excluded based on the minimum sequence size value and presence of ambiguous bases.The
loci_summary_stats.tsv
file contains the classification type counts (EXC, INF, PLOT3, PLOT5, LOTSC, NIPH, NIPHEM, ALM, ASM, PAMA, LNF) and the total number of classified CDSs (non-LNF) per locus.The
results_statistics.tsv
file contains the classification type counts (EXC, INF, PLOT3, PLOT5, LOTSC, NIPH, NIPHEM, ALM, ASM, PAMA, LNF), the total number of invalid CDSs, the total number of classified CDSs (non-LNF) and the total number of predicted CDSs per genome.
The column headers stand for:
EXC - EXaCt matches (100% DNA identity) with previously identified alleles.
INF - INFerred new alleles that had no exact match in the schema but are highly similar to loci in the schema. The INF- prefix in the allele identifier indicates that such allele was newly inferred in that genome, and the number following the prefix is the allele identifier attributed to such allele. Inferred alleles are added to the FASTA file of the locus they share high similarity with.
LNF - Locus Not Found. No alleles were found for the number of loci in the schema shown. This means that, for those loci, there were no BLAST hits or they were not within the BSR threshold for allele assignment.
PLNF - Probable Locus Not Found. Attributed when a locus is not found during execution modes 1, 2 and 3. Those modes do not perform the complete analysis, that is only performed in mode 4 (default), and the distinct classification indicates that a more thorough analysis might have found a match for the loci that were not found.
PLOT3/PLOT5 - Possible Locus On the Tip of the query genome contigs (see image below). A locus is classified as PLOT when the CDS of the query genome has a BLAST hit with a known larger allele that covers the CDS sequence entirely and the unaligned regions of the larger allele exceed one of the query genome contigs ends (a locus can be classified as PLOT5 or PLOT3 depending on whether the CDS in the genome under analysis matching the schema locus is located in the 5’ end or 3’ end (respectively) of the contig). This could be an artifact caused by genome fragmentation resulting in a shorter CDS prediction by Prodigal. To avoid locus misclassification, loci in such situations are classified as PLOT.
LOTSC - A locus is classified as LOTSC when the contig of the query genome is smaller than the matched allele.
NIPH - Non-Informative Paralogous Hit (see image below). When ≥2 CDSs in the query genome match one locus in the schema with a BSR > 0.6, that locus is classified as NIPH. This suggests that such locus can have paralogous (or orthologous) loci in the query genome and should be removed from the analysis due to the potential uncertainty in allele assignment (for example, due to the presence of multiple copies of the same mobile genetic element (MGE) or as a consequence of gene duplication followed by pseudogenization). A high number of NIPH may also indicate a poorly assembled genome due to a high number of smaller contigs which result in partial CDS predictions. These partial CDSs may contain conserved domains that match multiple loci.
NIPHEM - similar to the NIPH classification, but specifically referring to exact matches. Whenever several CDSs from the same genome match a single or multiple alleles of the same locus with 100% DNA similarity during the first DNA sequence comparison, the NIPHEM tag is attributed.
PAMA - PAralogous MAtch. Attributed to CDSs that are highly similar to more than one locus. This type of classification allows the identification of groups of similar loci in the schema that are classified as paralogous loci and listed in the
paralogous_counts.tsv
andparalogous_loci.tsv
files.
ALM - Alleles 20% Larger than the length Mode of the distribution of the matched loci (CDS length > (locus length mode + locus length mode * 0.2)) (see image below). This determination is based on the currently identified set of alleles for a given locus. It is important to remember that, although infrequently, the mode may change as more alleles for a given locus are called and added to a schema.
ASM - similar to ALM but for Alleles 20% Smaller than the length Mode distribution of the matched loci (CDS length < (locus length mode - locus length mode * 0.2)). As with ALMs it is important to remember that, although infrequently, the mode may change as more alleles for a given locus are called and added to a schema.
Note
The ALM and ASM classifications impose a limit on size variation since for the majority of loci the allele lengths are quite conserved. However, some loci can have larger variation in allele length and those should be manually curated.
The statistics file also helps the user to identify bad quality draft genomes among the analyzed genomes since with a proper schema most identified loci should be exact matches or inferred alleles. A high number of PLOT, ASM, ALM and/or NIPH usually indicates bad quality or contaminated assemblies.
The
results_contigsInfo.tsv
file contains the loci coordinates in the genomes analyzed. The first column contains the identifier of the genome used in the allele calling and the other columns (with loci names in the headers) the locus coordinate information or the classification attributed by chewBBACA if it was not an exact match or inferred allele.
FILE |
locus1 |
locus2 |
… |
---|---|---|---|
SAMD00008628 |
contig2&162560-161414&0 |
LNF |
… |
SAMD00053744 |
contig4&268254-269400&1 |
contig3&272738-274082&1 |
… |
Example for the SAMD00008628
genome:
locus1 with
contig2&161414-162560&0
information was found in this genome. It is located in (&
character is the field delimiter):
the sequence with identifier
contig2
.between 161,414 bp and 162,560 bp (reported as
162560-161414
because the CDS is encoded in the reverse strand). These nucleotide positions are inclusive positions and include the stop codon as well.in the reverse strand (represented by a
0
signal).1
means that the CDS is encoded in the direct strand.locus2 was not found (LNF).
The
results_alleles.tsv
file contains the allelic profiles determined for the input samples. The first column has the identifiers of the genome assemblies for which the allele call was performed. The remaining columns contain the allele call data for loci present in the schema, with the column headers being the locus identifiers.
FILE |
locus1 |
locus2 |
locus3 |
locus4 |
locus5 |
… |
---|---|---|---|---|---|---|
SAMD00008628 |
INF-2 |
1 |
3 |
ASM |
PLOT3 |
… |
SAMD00053744 |
10 |
1 |
3 |
ALM |
PLOT5 |
… |
Note
The allelic profile output can be transformed and imported into PHYLOViZ to generate and visualize a Minimum Spanning Tree.
Important
The ExtractCgMLST module was designed to determine the set of loci that
constitute the core genome based on a given threshold, but it can also be used to
convert the TSV file with allelic profiles into a suitable format that can be imported
into PHYLOViZ by substituting all non-numeric classifications by 0
. To convert an
allelic profile output simply run the ExtractCgMLST module with a threshold
value, --t
, of 0
.
The
paralogous_counts.tsv
file contains the list of paralogous loci and the number of times those loci matched a CDS that was also similar to other loci in the schema.The
paralogous_loci.tsv
file contains the sets of paralogous loci identified per genome (genome identifier, identifiers of the paralogous loci and the coordinates of the CDS that is similar to the group of paralogous loci).
The
logging_info.txt
contains summary information about the allele calling process.If the
--output-unclassified
parameter is provided, the process will create a FASTA file,unclassified_sequences.fasta
, with the DNA sequences of the distinct CDSs that were not classified.If the
--output-missing
parameter is provided, the process will create a FASTA file,missing_classes.fasta
, and a TSV file with information about the classified sequences that led to a locus being classified as ASM, ALM, PLOT3, PLOT5, LOTSC, NIPH, NIPHEM and PAMA.If the
--hash-profiles
parameter is provided, the process will use the provided hash algorithm to create a TSV file,results_alleles_hashed.tsv
, with hashed profiles (each allele identifier is substituted by the hash of the DNA sequence).
Identify genetic clusters
We recommend that you use ReporTree to identify genetic clusters
based on the allelic profiles (contained in the results_alleles.tsv
output file) determined by chewBBACA. ReporTree
includes functionalities to identify genetic clusters at any distance threshold level(s), obtain summary reports with relevant
statistics computed based on sample metadata, identify regions of cluster stability, etc. Cluster nomenclature can be maintained
and updated in subsequent analyses, which is especially useful in surveillance-oriented workflows. Check the
publication and the
GitHub repository to know more about ReporTree.
Concurrent allele calling
In its default mode, mode 4, and in the execution modes 2 and 3, the AlleleCall module updates the schema with the
novel alleles inferred during the allele calling. This is incompatible with concurrent access to the same schema.
If you run chewBBACA in an environment with multiple processes/users accessing the same schema, please use the
--no-inferred
parameter. By providing this parameter, chewBBACA will still identify novel alleles but will not
update the schema files with the information about those novel alleles. When you create a new schema, adapt an external
schema or download a schema from Chewie-NS, you must perform a single allele calling before using the schema for
concurrent allele calling. You can use a single genome assembly; it’s only essential to generate the pre-computed data
that chewBBACA uses to speed up the allele calling. After that, multiple users can concurrently perform allele calling
based on the same schema if they pass the --no-inferred
parameter. chewBBACA will still identify novel alleles and
include them in the final results, but those alleles will not be added to the schema, and the pre-computed files will
not be updated. If you ever want to add new alleles to the schema, you’ll have to perform allele calling without the
--no-inferred
parameter and ensure that there’s only one process working with the schema while it is updated.
Warning
The schema will most likely become corrupted and unusable if you attempt to run multiple concurrent processes
with the same schema without providing the --no-inferred
parameter.