ExtractCgMLST - Determine the set of loci that constitute the core genome
Requirements to define a core genome MLST (cgMLST) schema
cgMLST schemas are defined as the set of loci that are present in all strains under analysis or, due to sequencing/assembly limitations, >95% of strains analyzed. In order to have a robust definition of a cgMLST schema for a given bacterial species, a set of representative strains of the diversity of a given species should be selected. Furthermore, since cgMLST schema definition is based on pre-defined thresholds, only when a sufficient number of strains have been analyzed can the cgMLST schema be considered stable. This number will always depend on the population structure and diversity of the species in question, with non-recombinant monomorphic species possibly requiring a smaller number of strais to define cgMLST schemas than panmictic highly recombinogenic species that are prone to have large numbers of accessory genes and mobile genetic elements. It is also important to refer that the same strategy described here can be used to defined lineage specific schemas for more detailed analysis within a given bacterial lineage. Also, by definition, all the loci that are not considered core genome, can be classified as being part of an accessory genome MLST (agMLST) schema.
Determine the loci that constitute the cgMLST
Determine the set of loci that constitute the core genome based on loci presence thresholds.
Basic Usage
chewBBACA.py ExtractCgMLST -i /path/to/AlleleCallResultsFolder/results_alleles.tsv -o /path/to/OutputFolder
Parameters
-i, --input-file (Required) Path to the TSV file that contains the allelic profiles determined
by the AlleleCall module.
-o, --output-directory (Required) Path to the directory where the process will store the output
files.
--t, --threshold (Optional) Genes that constitute the core genome must be in a proportion
of genomes that is at least equal to this value. Users can provide multiple
values to compute the core genome for multiple threshold values (default:
[0.95, 0.99, 1]).
--s, --step (Optional) The allele calling results are processed iteratively to evaluate
the impact of adding subsets of the results in computing the core genome. The
step value controls the number of profiles added in each iteration until all
profiles are included (default: 1).
--r, --genes2remove (Optional) Path to a file with a list of gene IDs to exclude from the
analysis (one gene identifier per line) (default: False).
--g, --genomes2remove (Optional) Path to a file with a list of genome IDs to exclude from
the analysis (one genome identifier per line) (default: False).
Outputs
The output folder contains 3 files:
Presence_Abscence.tsv
- allele presence and absence matrix (1 or 0, respectively) for all the loci found in the-i
file (excluding the loci and genomes that were flagged to be excluded).mdata_stats.tsv
- total number and percentage of loci missing from each genome.cgMLST<threshold>.tsv
- a file for each specified threshold that contains the matrix with the allelic profiles for the cgMLST (already excluding the list of loci and list of genomes passed to the--r
and--g
parameters, respectively).cgMLSTschema<threshold>.txt
- a file for each specified threshold that contains the list of loci that constitute the cgMLST schema.cgMLST.html
- HTML file with a line plot for the number of loci in the cgMLST per threshold. Also includes a black line with the number of loci present in each genome that is added to the analysis.
Important
The ExtractCgMLST module converts/masks all non-integer classifications in the profile matrix to 0
and removes all the INF-
prefixes.
Example of the plot created by the ExtractCgMLST module based on the allelic profiles for 680 Streptococcus agalactiae genomes:
Important
The cgMLSTschema<threshold>.txt
file can be passed to the --gl
parameter of the AlleleCall
module to perform allele calling only for the loci in the cgMLST schema.
Note
The matrix with allelic profiles created by the ExtractCgMLST process can be imported into PHYLOViZ to visualize and explore typing results.