ComputeMSA - Compute multiple sequence alignments (MSAs)

The ComputeMSA module is used to compute MSAs based on allelic profiles generated by the AlleleCall module or on FASTA files provided by the user. The module includes the following functionalities:

Compute loci, sample and complete MSAs based on the allelic profiles determined by chewBBACA (e.g. at the wg/cgMLST level). Gap sequences (the character used to represent gaps is -) are added whenever a locus was not identified in a sample (e.g. when working at the wgMLST level).
Compute a MSA for each FASTA file in a folder (just a way to run MAFFT to compute MSAs).
MSAs can be computed both at the protein and DNA level (i.e. by converting protein MSAs back to DNA).
The --output-variable option identifies the variable positions (SNVs) and creates MSAs only for those positions. When determining variable positions, positions with gaps or ambiguous bases can be excluded (--gaps exclude and --ambiguous exclude) or included (--gaps ignore and --ambiguous ignore) in the MSA if the sequences have other variable non-gap and non-ambiguous nucleotides or amino acids.
The SchemaEvaluator and AlleleCallEvaluator modules use the ComputeMSA module to compute the loci MSAs (SchemaEvaluator) and the complete MSA used by FastTree to compute a tree (AlleleCallEvaluator).

Basic Usage

chewBBACA.py ComputeMSA -i /path/to/AlleleCallResultsFolder/results_alleles.tsv -g /path/to/SchemaFolder -o /path/to/OutputFolder

Parameters

-i, --input-path         (Required) Path to a TSV file containing allelic profiles or to a folder containing
                         FASTA files. If a TSV file containing allelic profiles is provided, it is
                         necessary to provide the path to the schema to the `--schema-directory`
                         parameter. The module will create a FASTA file with the alleles
                         identified in the samples for each schema locus and compute a MSA. The
                         loci MSAs are joined to create the complete MSA based on the allele
                         calling results. If a path to a folder is provided, the module computes a
                         MSA for each FASTA file in the folder, but will not attempt to join the
                         MSAs as it does not have the sample information (in this case, it is not
                         necessary to pass the schema path) (default: None).

-o, --output-directory   (Required) Path to the output directory where the process will store intermediate
                         and final results (default: None).

-g, --schema-directory   (Optional) Path to the schema's directory. This parameter is only required if the
                         input is a TSV file with allelic profiles (default: None).

--dna-msa                (Optional) Converts the protein MSA back to DNA to create an additional output file
                         with the DNA MSA (default: False).

--output-variable        (Optional) Output a reduced MSA including only the variable positions. If the
                         `--dna-msa` parameter is provided, the process will output a reduced MSA
                         for both the protein and DNA MSAs (default: False).

--t, --translation-table (Optional) Genetic code used for sequence translation (default: 11).

--cpu, --cpu-cores       (Optional) Number of CPU cores/threads that will be used to run the process (chewie
                         resets to a lower value if it is equal to or exceeds the total number of
                         available CPU cores/threads) (default: 1).

--only-loci-msas         (Optional) Do not compute the full MSA when the input file is a TSV file containing
                         allelic profiles (this is already the default when the input is a path to
                         a folder with FASTA files) (default: False).

--gaps                   (Optional) How to treat gaps when determining the reduced MSA for the variable
                         positions. The default value, "exclude", removes variable positions if
                         any of the aligned sequences contain a gap. The "ignore" option allows to
                         consider variable positions that include gaps in some sequences as long
                         as other sequences include variable non-gap characters. The character
                         used to represent gaps is "-" (default: exclude).

--ambiguous              (Optional) How to treat ambiguous amino acids or nucleotides when determining the
                         reduced MSA for the variable positions. The default value, "exclude",
                         removes variable positions if any of the aligned sequences contain an
                         ambiguous amino acid or nucleotide. The "ignore" option allows to
                         consider variable positions that include ambiguous amino acids or
                         nucleotides in some sequences as long as other sequences include variable
                         non-ambiguous characters. The characters interpreted as ambiguous amino acids
                         are [B, Z, X, J]. The characters interpreted as ambiguous nucleotides are
                         [R, Y, S, W, K, M, B, D, H, V, N]. (default: exclude).

--custom-mafft-params    (Optional) Custom parameters to pass to MAFFT when computing the loci MSAs. The
                         value must be a single string with all parameters enclosed in quotes
                         (e.g. "--retree 1 --maxiterate 0") (default: None).

--protein-input          (Optional) Input files contain protein sequences. This option is only valid for
                         cases when users provide a path to a directory containing FASTA files (default: False).

--no-cleanup             (Optional) Keep intermediate files with locus/file MSAs and sample MSAs if input is
                         a TSV file containing allelic profiles (default: False).

Outputs

When the input is a TSV file containing allelic profiles, the output folder structure is as follows:

OutputFolderName
├── dna_msa.fasta (if --dna-msa is used)
├── dna_msa_variable.fasta (if --output-variable and --dna-msa are used)
├── protein_msa.fasta
├── protein_msa_variable.fasta (if --output-variable is used)
└── summary_stats.tsv

The dna_msa.fasta contains the complete MSA at the DNA level based on the allele calling results (only if the --dna-msa option is provided).
The dna_msa_variable.fasta contains the reduced MSA at the DNA level including only the variable positions (only if both the --output-variable and --dna-msa options are provided).
The protein_msa.fasta contains the complete MSA at the protein level based on the allele calling results.
The protein_msa_variable.fasta contains the reduced MSA at the protein level including only the variable positions (only if the --output-variable option is provided).
The summary_stats.tsv file contains, for each locus, the number of alleles in the schema, the number of samples in which the locus was identified, and the number of distinct alleles identified in the dataset.

If the --only-loci-msas option is used, the process will exit after computing the loci MSAs (the output folder will include the summary_stats.tsv file and a folder named MSAs containing the loci MSAs at the protein level, and, optionally, the DNA (if the --dna-msa option is used) and variable positions MSAs (if the --output-variable option is used)). If the --no-cleanup option is provided, the intermediate files will not be deleted. Depending on the options selected by the user, the intermediate files may include the loci FASTA files, loci translated FASTA files, loci MSAs, and sample MSAs.

When the input is a folder containing FASTA files, the output folder structure is as follows:

OutputFolderName
└── MSAs
    |── dna (if --dna-msa is used)
    |   ├── file1.fasta
    |   ├── ...
    |   └── fileN.fasta
    ├── protein
    |   ├── file1_protein.fasta
    |   ├── ...
    |   └── fileN_protein.fasta
    ├── variable_dna (if --output-variable and --dna-msa are used)
    |   ├── file1.fasta
    |   ├── ...
    |   └── fileN.fasta
    └── variable_protein (if --output-variable is used)
        ├── file1_protein.fasta
        ├── ...
        └── fileN_protein.fasta

The MSAs/dna/ folder contains the DNA MSAs for each FASTA file in the input folder (only if the --dna-msa option is provided).
The MSAs/protein/ folder contains the protein MSAs for each FASTA file in the input folder.
The MSAs/variable_dna/ folder contains the reduced DNA MSAs including only the variable positions for each FASTA file in the input folder (only if both the --output-variable and --dna-msa options are provided).
The MSAs/variable_protein/ folder contains the reduced protein MSAs including only the variable positions for each FASTA file in the input folder (only if the --output-variable option is provided).

Each MSA file is named after the corresponding input FASTA file. When the input is a folder containing FASTA files, no summary_stats.tsv file is generated.