ComputeMSA - Compute multiple sequence alignments (MSAs)
The ComputeMSA module is used to compute MSAs based on allelic profiles generated by the AlleleCall module or on FASTA files provided by the user. The module includes the following functionalities:
Compute loci, sample and complete MSAs based on the allelic profiles determined by chewBBACA (e.g. at the wg/cgMLST level). Gap sequences (the character used to represent gaps is
-) are added whenever a locus was not identified in a sample (e.g. when working at the wgMLST level).Compute a MSA for each FASTA file in a folder (just a way to run MAFFT to compute MSAs).
MSAs can be computed both at the protein and DNA level (i.e. by converting protein MSAs back to DNA).
The
--output-variableoption identifies the variable positions (SNVs) and creates MSAs only for those positions. When determining variable positions, positions with gaps or ambiguous bases can be excluded (--gaps excludeand--ambiguous exclude) or included (--gaps ignoreand--ambiguous ignore) in the MSA if the sequences have other variable non-gap and non-ambiguous nucleotides or amino acids.The SchemaEvaluator and AlleleCallEvaluator modules use the ComputeMSA module to compute the loci MSAs (SchemaEvaluator) and the complete MSA used by FastTree to compute a tree (AlleleCallEvaluator).
Basic Usage
chewBBACA.py ComputeMSA -i /path/to/AlleleCallResultsFolder/results_alleles.tsv -g /path/to/SchemaFolder -o /path/to/OutputFolder
Parameters
-i, --input-path (Required) Path to a TSV file containing allelic profiles or to a folder containing
FASTA files. If a TSV file containing allelic profiles is provided, it is
necessary to provide the path to the schema to the `--schema-directory`
parameter. The module will create a FASTA file with the alleles
identified in the samples for each schema locus and compute a MSA. The
loci MSAs are joined to create the complete MSA based on the allele
calling results. If a path to a folder is provided, the module computes a
MSA for each FASTA file in the folder, but will not attempt to join the
MSAs as it does not have the sample information (in this case, it is not
necessary to pass the schema path) (default: None).
-o, --output-directory (Required) Path to the output directory where the process will store intermediate
and final results (default: None).
-g, --schema-directory (Optional) Path to the schema's directory. This parameter is only required if the
input is a TSV file with allelic profiles (default: None).
--dna-msa (Optional) Converts the protein MSA back to DNA to create an additional output file
with the DNA MSA (default: False).
--output-variable (Optional) Output a reduced MSA including only the variable positions. If the
`--dna-msa` parameter is provided, the process will output a reduced MSA
for both the protein and DNA MSAs (default: False).
--t, --translation-table (Optional) Genetic code used for sequence translation (default: 11).
--cpu, --cpu-cores (Optional) Number of CPU cores/threads that will be used to run the process (chewie
resets to a lower value if it is equal to or exceeds the total number of
available CPU cores/threads) (default: 1).
--only-loci-msas (Optional) Do not compute the full MSA when the input file is a TSV file containing
allelic profiles (this is already the default when the input is a path to
a folder with FASTA files) (default: False).
--gaps (Optional) How to treat gaps when determining the reduced MSA for the variable
positions. The default value, "exclude", removes variable positions if
any of the aligned sequences contain a gap. The "ignore" option allows to
consider variable positions that include gaps in some sequences as long
as other sequences include variable non-gap characters. The character
used to represent gaps is "-" (default: exclude).
--ambiguous (Optional) How to treat ambiguous amino acids or nucleotides when determining the
reduced MSA for the variable positions. The default value, "exclude",
removes variable positions if any of the aligned sequences contain an
ambiguous amino acid or nucleotide. The "ignore" option allows to
consider variable positions that include ambiguous amino acids or
nucleotides in some sequences as long as other sequences include variable
non-ambiguous characters. The characters interpreted as ambiguous amino acids
are [B, Z, X, J]. The characters interpreted as ambiguous nucleotides are
[R, Y, S, W, K, M, B, D, H, V, N]. (default: exclude).
--custom-mafft-params (Optional) Custom parameters to pass to MAFFT when computing the loci MSAs. The
value must be a single string with all parameters enclosed in quotes
(e.g. "--retree 1 --maxiterate 0") (default: None).
--protein-input (Optional) Input files contain protein sequences. This option is only valid for
cases when users provide a path to a directory containing FASTA files (default: False).
--no-cleanup (Optional) Keep intermediate files with locus/file MSAs and sample MSAs if input is
a TSV file containing allelic profiles (default: False).
Outputs
When the input is a TSV file containing allelic profiles, the output folder structure is as follows:
OutputFolderName
├── dna_msa.fasta (if --dna-msa is used)
├── dna_msa_variable.fasta (if --output-variable and --dna-msa are used)
├── protein_msa.fasta
├── protein_msa_variable.fasta (if --output-variable is used)
└── summary_stats.tsv
The
dna_msa.fastacontains the complete MSA at the DNA level based on the allele calling results (only if the--dna-msaoption is provided).The
dna_msa_variable.fastacontains the reduced MSA at the DNA level including only the variable positions (only if both the--output-variableand--dna-msaoptions are provided).The
protein_msa.fastacontains the complete MSA at the protein level based on the allele calling results.The
protein_msa_variable.fastacontains the reduced MSA at the protein level including only the variable positions (only if the--output-variableoption is provided).The
summary_stats.tsvfile contains, for each locus, the number of alleles in the schema, the number of samples in which the locus was identified, and the number of distinct alleles identified in the dataset.
If the --only-loci-msas option is used, the process will exit after computing the loci MSAs (the output folder will include the summary_stats.tsv file and a folder named MSAs containing the loci MSAs at the protein level, and, optionally, the DNA (if the --dna-msa option is used) and variable positions MSAs (if the --output-variable option is used)).
If the --no-cleanup option is provided, the intermediate files will not be deleted. Depending on the options selected by the user, the intermediate files may include the loci FASTA files, loci translated FASTA files, loci MSAs, and sample MSAs.
When the input is a folder containing FASTA files, the output folder structure is as follows:
OutputFolderName
└── MSAs
|── dna (if --dna-msa is used)
| ├── file1.fasta
| ├── ...
| └── fileN.fasta
├── protein
| ├── file1_protein.fasta
| ├── ...
| └── fileN_protein.fasta
├── variable_dna (if --output-variable and --dna-msa are used)
| ├── file1.fasta
| ├── ...
| └── fileN.fasta
└── variable_protein (if --output-variable is used)
├── file1_protein.fasta
├── ...
└── fileN_protein.fasta
The
MSAs/dna/folder contains the DNA MSAs for each FASTA file in the input folder (only if the--dna-msaoption is provided).The
MSAs/protein/folder contains the protein MSAs for each FASTA file in the input folder.The
MSAs/variable_dna/folder contains the reduced DNA MSAs including only the variable positions for each FASTA file in the input folder (only if both the--output-variableand--dna-msaoptions are provided).The
MSAs/variable_protein/folder contains the reduced protein MSAs including only the variable positions for each FASTA file in the input folder (only if the--output-variableoption is provided).
Each MSA file is named after the corresponding input FASTA file. When the input is a folder containing FASTA files, no summary_stats.tsv file is generated.