ComputeMSA - Compute multiple sequence alignments (MSAs) ======================================================== The *ComputeMSA* module is used to compute MSAs based on allelic profiles generated by the *AlleleCall* module or on FASTA files provided by the user. The module includes the following functionalities: - Compute loci, sample and complete MSAs based on the allelic profiles determined by chewBBACA (e.g. at the wg/cgMLST level). Gap sequences (the character used to represent gaps is ``-``) are added whenever a locus was not identified in a sample (e.g. when working at the wgMLST level). - Compute a MSA for each FASTA file in a folder (just a way to run MAFFT to compute MSAs). - MSAs can be computed both at the protein and DNA level (i.e. by converting protein MSAs back to DNA). - The ``--output-variable`` option identifies the variable positions (SNVs) and creates MSAs only for those positions. When determining variable positions, positions with gaps or ambiguous bases can be excluded (``--gaps exclude`` and ``--ambiguous exclude``) or included (``--gaps ignore`` and ``--ambiguous ignore``) in the MSA if the sequences have other variable non-gap and non-ambiguous nucleotides or amino acids. - The :doc:`SchemaEvaluator ` and :doc:`AlleleCallEvaluator ` modules use the ComputeMSA module to compute the loci MSAs (SchemaEvaluator) and the complete MSA used by FastTree to compute a tree (AlleleCallEvaluator). Basic Usage ----------- :: chewBBACA.py ComputeMSA -i /path/to/AlleleCallResultsFolder/results_alleles.tsv -g /path/to/SchemaFolder -o /path/to/OutputFolder Parameters ---------- :: -i, --input-path (Required) Path to a TSV file containing allelic profiles or to a folder containing FASTA files. If a TSV file containing allelic profiles is provided, it is necessary to provide the path to the schema to the `--schema-directory` parameter. The module will create a FASTA file with the alleles identified in the samples for each schema locus and compute a MSA. The loci MSAs are joined to create the complete MSA based on the allele calling results. If a path to a folder is provided, the module computes a MSA for each FASTA file in the folder, but will not attempt to join the MSAs as it does not have the sample information (in this case, it is not necessary to pass the schema path) (default: None). -o, --output-directory (Required) Path to the output directory where the process will store intermediate and final results (default: None). -g, --schema-directory (Optional) Path to the schema's directory. This parameter is only required if the input is a TSV file with allelic profiles (default: None). --dna-msa (Optional) Converts the protein MSA back to DNA to create an additional output file with the DNA MSA (default: False). --output-variable (Optional) Output a reduced MSA including only the variable positions. If the `--dna-msa` parameter is provided, the process will output a reduced MSA for both the protein and DNA MSAs (default: False). --t, --translation-table (Optional) Genetic code used for sequence translation (default: 11). --cpu, --cpu-cores (Optional) Number of CPU cores/threads that will be used to run the process (chewie resets to a lower value if it is equal to or exceeds the total number of available CPU cores/threads) (default: 1). --only-loci-msas (Optional) Do not compute the full MSA when the input file is a TSV file containing allelic profiles (this is already the default when the input is a path to a folder with FASTA files) (default: False). --gaps (Optional) How to treat gaps when determining the reduced MSA for the variable positions. The default value, "exclude", removes variable positions if any of the aligned sequences contain a gap. The "ignore" option allows to consider variable positions that include gaps in some sequences as long as other sequences include variable non-gap characters. The character used to represent gaps is "-" (default: exclude). --ambiguous (Optional) How to treat ambiguous amino acids or nucleotides when determining the reduced MSA for the variable positions. The default value, "exclude", removes variable positions if any of the aligned sequences contain an ambiguous amino acid or nucleotide. The "ignore" option allows to consider variable positions that include ambiguous amino acids or nucleotides in some sequences as long as other sequences include variable non-ambiguous characters. The characters interpreted as ambiguous amino acids are [B, Z, X, J]. The characters interpreted as ambiguous nucleotides are [R, Y, S, W, K, M, B, D, H, V, N]. (default: exclude). --custom-mafft-params (Optional) Custom parameters to pass to MAFFT when computing the loci MSAs. The value must be a single string with all parameters enclosed in quotes (e.g. "--retree 1 --maxiterate 0") (default: None). --protein-input (Optional) Input files contain protein sequences. This option is only valid for cases when users provide a path to a directory containing FASTA files (default: False). --no-cleanup (Optional) Keep intermediate files with locus/file MSAs and sample MSAs if input is a TSV file containing allelic profiles (default: False). Outputs ::::::: When the input is a TSV file containing allelic profiles, the output folder structure is as follows: :: OutputFolderName ├── dna_msa.fasta (if --dna-msa is used) ├── dna_msa_variable.fasta (if --output-variable and --dna-msa are used) ├── protein_msa.fasta ├── protein_msa_variable.fasta (if --output-variable is used) └── summary_stats.tsv - The ``dna_msa.fasta`` contains the complete MSA at the DNA level based on the allele calling results (only if the ``--dna-msa`` option is provided). - The ``dna_msa_variable.fasta`` contains the reduced MSA at the DNA level including only the variable positions (only if both the ``--output-variable`` and ``--dna-msa`` options are provided). - The ``protein_msa.fasta`` contains the complete MSA at the protein level based on the allele calling results. - The ``protein_msa_variable.fasta`` contains the reduced MSA at the protein level including only the variable positions (only if the ``--output-variable`` option is provided). - The ``summary_stats.tsv`` file contains, for each locus, the number of alleles in the schema, the number of samples in which the locus was identified, and the number of distinct alleles identified in the dataset. If the ``--only-loci-msas`` option is used, the process will exit after computing the loci MSAs (the output folder will include the ``summary_stats.tsv`` file and a folder named ``MSAs`` containing the loci MSAs at the protein level, and, optionally, the DNA (if the ``--dna-msa`` option is used) and variable positions MSAs (if the ``--output-variable`` option is used)). If the ``--no-cleanup`` option is provided, the intermediate files will not be deleted. Depending on the options selected by the user, the intermediate files may include the loci FASTA files, loci translated FASTA files, loci MSAs, and sample MSAs. When the input is a folder containing FASTA files, the output folder structure is as follows: :: OutputFolderName └── MSAs |── dna (if --dna-msa is used) | ├── file1.fasta | ├── ... | └── fileN.fasta ├── protein | ├── file1_protein.fasta | ├── ... | └── fileN_protein.fasta ├── variable_dna (if --output-variable and --dna-msa are used) | ├── file1.fasta | ├── ... | └── fileN.fasta └── variable_protein (if --output-variable is used) ├── file1_protein.fasta ├── ... └── fileN_protein.fasta - The ``MSAs/dna/`` folder contains the DNA MSAs for each FASTA file in the input folder (only if the ``--dna-msa`` option is provided). - The ``MSAs/protein/`` folder contains the protein MSAs for each FASTA file in the input folder. - The ``MSAs/variable_dna/`` folder contains the reduced DNA MSAs including only the variable positions for each FASTA file in the input folder (only if both the ``--output-variable`` and ``--dna-msa`` options are provided). - The ``MSAs/variable_protein/`` folder contains the reduced protein MSAs including only the variable positions for each FASTA file in the input folder (only if the ``--output-variable`` option is provided). Each MSA file is named after the corresponding input FASTA file. When the input is a folder containing FASTA files, no ``summary_stats.tsv`` file is generated.