Step-by-step Tutorial
Objective
The objective of this tutorial is to illustrate a complete wg/cgMLST analysis using chewBBACA. The tutorial includes step-by-step instructions to create a schema from scratch for Streptococcus agalactiae, how to use the created schema to perform allele calling to classify strains of interest and add newly identified alleles to the schema, and how to use the functionalities provided by chewBBACA to evaluate the schema and allele calling results. The analysis described in this tutorial is divided into the following sections:
Creating a schema seed: creation of a wgMLST schema seed based on a set of 32 Streptococcus agalactiae complete genomes.
Determining allelic profiles and adding new alleles to the schema through allele calling: allele calling for the 32 complete genomes to identify the loci and alleles for each strain, as well as expanding the schema with newly inferred alleles.
Determining the core genome: identification of the core loci based on the allele calling results for the 32 complete genomes to define a subschema for cgMLST.
Allele calling with additional genomes: allele calling for 680 draft genomes at the cgMLST level.
Annotating the schema loci: annotation of the loci included in the schema.
Generating an interactive report for schema evaluation: generate an interactive report to explore the diversity of the schema loci.
Generating an interactive report to evaluate allele calling results: generate an interactive report to evaluate strain similarity based on the allele calling results.
Using other software to analyse the data generated by chewBBACA: examples of additional analyses that can be performed using the data generated by chewBBACA.
Before starting, please make sure to go through the following preparatory steps:
Install chewBBACA. Check Installation for instructions on how to install chewBBACA.
Download the ZIP file with the datasets and expected results for the tutorial here.
Uncompress the ZIP file (this will create a folder named
chewBBACA_tutorialthat has all the necessary data).
The simplified representation of the folder structure after uncompressing the ZIP file is the following:
chewBBACA_tutorial
|── genomes : Folder containing the genomes used in the tutorial
│ ├── sagalactiae_32_complete_genomes.zip : Compressed folder with 32 complete genomes
│ └── sagalactiae_680_draft_genomes.zip : Compressed folder with 680 draft genomes
├── expected_results : Folder containing a subfolder with the expected results for each section of the tutorial
│ ├── Creating_a_schema_seed : Folder with the expected results for the "Creating a schema seed" section
│ └── ... : Folders with the expected results for the other sections
└── Streptococcus_agalactiae.trn : Prodigal training file used for gene prediction
chewBBACA includes Prodigal training files for several species. You can check the list of available training files here. We have included the training file for Streptococcus agalactiae, Streptococcus_agalactiae.trn, in the tutorial data for convenience.
Important
The paths and commands used in this tutorial assume that the working directory is the top-level directory of the folder “chewBBACA_tutorial” created after uncompressing the ZIP file. The commands should be modified if they are executed from a different working directory.
The expected results for each section were included in the
expected_resultsfolder for reference (each subfolder has the name of one of the sections).The execution time for each command will vary depending on the version of chewBBACA that is installed and the specifications of the computer used to perform the analyses. The
--cpuparameter controls the number of CPU cores used for each analysis step and should be adjusted based on the specifications of your machine.
Creating a schema seed
We will start by using chewBBACA’s CreateSchema module to create a wgMLST schema seed based on 32 Streptococcus agalactiae complete genomes. The schema creation process identifies the set of distinct loci in the input genomes and defines them as target loci for a new schema. To create the schema seed, uncompress the genomes/sagalactiae_32_complete_genomes.zip file to create the folder genomes/sagalactiae_32_complete_genomes and run the following command:
chewBBACA.py CreateSchema -i genomes/sagalactiae_32_complete_genomes/ -o tutorial_schema --ptf Streptococcus_agalactiae.trn --cpu 6
- Where:
chewBBACA.py CreateSchemaselects the CreateSchema module.-i genomes/sagalactiae_32_complete_genomes/points to the folder containing the 32 complete genomes.-o tutorial_schemaspecifies the output folder where the created schema seed and other output files will be stored.--ptf Streptococcus_agalactiae.trnpoints to a Prodigal training file used for gene prediction and included in the folder of the newly created schema seed to be used in future analyses.--cpu 6specifies the number of CPU cores to be used for schema creation. Remember to adjust this parameter based on your machine’s specifications.
The CreateSchema module will analyze the genomes to identify the set of distinct loci and create a wgMLST schema seed. The results will be stored in the tutorial_schema folder, which will have the following (simplified) structure:
tutorial_schema
├── schema_seed : Folder containing the FASTA files for the loci included in the schema seed
│ ├── GCA-000007265-protein1.fasta : FASTA file used to store all alleles for locus GCA-000007265-protein1
│ ├── ... : FASTA files for the other loci included in the schema
| └── short : Subfolder containing FASTA files with the representative alleles for each locus
| ├── GCA-000007265-protein1_short.fasta : FASTA file used to store the representative alleles for locus GCA-000007265-protein1
| └── ... : FASTA files used to store the representative alleles for the other loci included in the schema
├── cds_coordinates.tsv : TSV file with the coordinates of all CDSs identified in the input genomes during gene prediction
└── invalid_cds.txt : File listing CDSs that were removed during schema creation due to not meeting quality criteria
The wgMLST schema seed is available at tutorial_schema/schema_seed. The process created a wgMLST schema seed with 3,127 loci. The schema seed includes one main FASTA file for each locus where all the alleles for that locus will be stored and a second FASTA file inside the short subfolder used to store only the alleles selected as representatives. At this point, each locus has a single allele that was also selected as its first representative allele. For example, both FASTA files for locus GCA-000007265-protein1 have the following content:
>GCA-000007265-protein1_1
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTACCAGCAATCTAGCAATGAAGTAAATACTTTACCAATTAGAAAAATCGACTCTAACCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGGGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTCTTAAGAGATAATCCAAACGCGAGGGTTTTATACATCACTGCTGAGAATTTTATTAATGAATTTGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTAGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGGACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTTGATTCCAACGTACGTGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGATGGTCCTATTGTTACTGTCATTCCTATAGAAGAAATTCAAATACAAGTTGGTAAATTCTATGGCGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCAAGACAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAACAGTCTCCCAAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATAAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGATAG
Note
The schemas created with chewBBACA’s CreateSchema module are referred to as “schema seeds” because they represent the initial version of a schema, containing only one allele per locus. After the schema seed has been used to perform allele calling for a set of genomes, and has been expanded with newly identified alleles, it is no longer considered a “schema seed”.
In the next section we will explain how to use the created schema seed to perform allele calling for the 32 complete genomes to determine the allelic profile of each genome and to expand the schema with new alleles identified in the genomes.
Determining allelic profiles and adding new alleles to the schema through allele calling
We can use the schema seed created in the previous section to perform alelle calling for the 32 complete genomes. Allele calling is a process that uses a schema to determine the allelic profiles of a set of genomes, identifying the schema loci and corresponding alleles in each genome. New alleles identified in the analyzed genomes are assigned an allele identifier and added to the schema to expand the schema with the allelic diversity identified in the dataset. To perform allele call for the 32 complete genomes with chewBBACA’s AlleleCall module, run the following command:
chewBBACA.py AlleleCall -i genomes/sagalactiae_32_complete_genomes -g tutorial_schema/schema_seed -o results32_wgMLST --cpu 6
- Where:
chewBBACA.py AlleleCallselects the AlleleCall module.-i genomes/sagalactiae_32_complete_genomespoints to the folder containing the 32 complete genomes.-g tutorial_schema/schema_seedpoints to the folder containing the wgMLST schema seed created in the previous section.-o results32_wgMLSTspecifies the output folder where the allele calling results will be stored.--cpu 6specifies the number of CPU cores to be used for schema creation. Remember to adjust this parameter based on your machine’s specifications.
The AlleleCall module will analyze the genomes to identify the loci included in the schema and determine the allelic profile for each genome. New alleles identified in the analyzed genomes will be added to the schema to expand it. The results will be stored in the results32_wgMLST folder, which will have the following (simplified) structure:
results32_wgMLST
├── results_alleles.tsv : TSV file with the allelic profiles for the analyzed genomes
├── loci_summary.tsv : TSV file summarizing the classification counts per locus
├── results_statistics.tsv : TSV file summarizing the classification counts per genome
└── ... : Additional files with relevant results determined by the allele calling process
The page about the AlleleCall module provides a detailed description of all the output files generated by the module and the parameters used to identify and classify alleles. Allele calling identified 14,704 new alleles, including 105 representative alleles, for 2,255 loci and added those alleles to the schema, increasing the number of alleles in the schema from 3,127 to 17,831. In the case of the GCA-000007265-protein1 locus, allele calling identified 10 new alleles (alleles 2 to 11) in addition to the original allele (allele 1) that was already present in the schema seed. The structure of the main FASTA file for that locus after allele calling is the following:
>GCA-000007265-protein1_1
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTACCAGCAATCTAGCAATGAAGTAAATACTTTACCAATTAGAAAAATCGACTCTAACCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGGGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTCTTAAGAGATAATCCAAACGCGAGGGTTTTATACATCACTGCTGAGAATTTTATTAATGAATTTGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTAGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGGACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTTGATTCCAACGTACGTGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGATGGTCCTATTGTTACTGTCATTCCTATAGAAGAAATTCAAATACAAGTTGGTAAATTCTATGGCGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCAAGACAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAACAGTCTCCCAAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATAAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGATAG
>GCA-000007265-protein1_2
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTACCAGCAATCTAGCAATGAAGTAAATACTTTACCAATTAGAAAAATCGACTCTAACCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGGGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTTTTAAGAGATAATCCAAACGCGAGGGTTTTATACATCACTGCTGAGAATTTTATTAATGAATTTGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTGGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGGACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTTGATTCCAACGTACGTGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGATGGTCCTATTGTTACTGTCATTCCTATAGAAGAAATTCAAATACAAGTTGGTAAATTCTATGGCGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCAAGACAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAACAGTCTCCCAAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATAAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGATAG
>GCA-000007265-protein1_3
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTACCAGCAATCTAGCAATGAAGTAAATACTTTACCAATTAGAAAAATCGACTCTAACCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGGGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTTTTAAGAGATAATCCAAACGCGAGGGTTCTATACATCACTGCTGAGAATTTTATTAATGAATTTGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTGGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGGACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTTGATTCCAACGTACGTGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGACGGTCCTATTGTTACTGTTATTCCTATAGAAGAAATTCAAATACAAGTTGGTAAATTCTATGGCGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCAAGACAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAACAGTCTCCCCAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATAAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGGTAG
>GCA-000007265-protein1_4
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTACCAGCAATCTAGCAATGAAGTAAATACTTTACCAATTAGAAAAATCGACTCTAATCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGGGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTCTTAAGAGATAATCCAAACGCGAGGGTTTTATACATCACTGCTGAGAATTTTATTAATGAATTCGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTGGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGGACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTTGATTCCAACGTACGTGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGACGGTCCTATTGTTACTGTCATTCCTATAGAAGAAATTCAAATACAGGTTGGTAAATTCTATGGCGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCAAGACAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAACAGTCTCCCCAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATAAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGGTAG
>GCA-000007265-protein1_5
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTACCAGCAATCTAGCAATGAAGTAAATACTTTACCAATTAGAAAAATCAACTCTAACCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGGGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTTTTAAGAGATAATCCAAACGCGAGGGTTTTATACATCACTGCTGAGAATTTTATTAATGAATTTGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTGGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGAACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTTGATTCCAACGTACGTGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGACGGTCCTATTGTTACTGTCATTCCTATAGAAGAGATTCAAATACAGGTTGGTAAATTCTATGGTGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCAAGACAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAACAGTCTCCCCAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATAAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGGTAG
>GCA-000007265-protein1_6
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTATCAGCAATCTAGCAATGAAGTAAATACTTTACCAATTAGAAAAATCGACTCTAATCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGGGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTTTTAAGAGATAATCCAAACGCGAGGGTTTTATACATCACTGCTGAGAATTTTATTAATGAATTTGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTGGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGAACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTTGATTCCAACGTACGCGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGATGGTCCTATTGTTACTGTCATTCCTATAGAAGAAATTCAAATACAAGTTGGTAAATTCTATGGCGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCAAGACAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAACAGTCTCCCCAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATAAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGGTAG
>GCA-000007265-protein1_7
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTACCAGCAATCTAGCAATGAAGTAAATACTTTACCAATTAGAAAAATCGACTCTAACCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGGGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTTTTAAGAGATAATCCAAACGCGAGGGTTCTATACATCACTGCTGAGAATTTTATTAATGAATTTGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTGGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGGACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTTGATTCCAACGTACGTGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGACGGTCCTATTGTTACTGTTATTCCTATAGAAGAAATTCAAATACAAGTTGGTAAATTCTATGGGGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCAAGACAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAACAGTCTCCCCAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATTAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGGTAG
>GCA-000007265-protein1_8
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTACCAGCAATCTAGCAACGAAGTAAATACTTTACCAATTAGAAAAATCGACTCTAACCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGGGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTTTTAAGAGATAATCCAAACGCGAGGGTTTTATACATCACTGCTGAGAATTTTATTAATGAATTTGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTGGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGGACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTTGATTCCAACGTACGTGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGACGGTCCTATTGTTACTGTTATTCCTATAGAAGAAATTCAAATACAAGTTGGTAAATTCTATGGCGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCAAGACAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAACAGTCTCCCCAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATAAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGGTAG
>GCA-000007265-protein1_9
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTACCAGCAATCTAGCAATGAAGTAAATACTTTACCAATTAGAAAAATCGACTCTAACCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGGGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTTTTAAGAGATAATCCAAACGCGAGGGTTTTATACATCACTGCTGAGAATTTTATTAATGAATTTGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTGGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGAACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTTGATTCCAACGTACGTGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGACGGTCCTATTGTTACTGTCATTCCTATAGAAGAGATTCAAATACAGGTTGGTAAATTCTATGGTGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCAAGGCAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAACAGTCTCCCCAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATAAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGGTAG
>GCA-000007265-protein1_10
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTACCAGCAATCTAGCAATGAAGTAAATACTTTACCAATTAGAAAAATCGACTCTAACCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGGGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTTTTAAGAGATAATCCAAACGCGAGGGTTTTATACATCACTGCTGAGAATTTTATTAATGAATTTGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTGGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGAACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTTGATTCCAACGTACGTGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGACGGTCCTATTGTTACTGTCATTCCTATAGAAGAGATTCAAATACAGGTTGGTAAATTCTATGGTGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCTAGACAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAATAGTCTCCCCAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATAAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGGTAG
>GCA-000007265-protein1_11
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTATCAGCAATCTAGCAATGAAGTAAATACTTTACCAATTAGAAAAATCGACTCTAATCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGGGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTTTTAAGAGATAATCCAAACGCGAGGGTTTTATACATCACTGCTGAGAATTTTATTAATGAATTTGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTGGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGAACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTCGATTCCAACGTACGTGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGACGGTCCTATTGTTACTGTTATTCCTATAGAAGAAATTCAAATACAAGTTGGTAAATTCTATGGCGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCAAGGCAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAACAGTCTCCCCAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATAAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGGTAG
Allele calling did not identify new representative alleles for the locus, so the FASTA file in the short subfolder still contains only the original representative allele.
In the next section we will demonstrate how to determine the set of core loci, the loci identified in all or the majority of the genomes, based on the allele calling results to be able to perform allele calling at the cgMLST level.
Determining the core genome
We can use chewBBACA’s ExtractCgMLST module to determine the set of loci that constitute the core genome based on the allele calling results for a specific dataset. The set of core loci is determined based on a loci presence threshold, which defines the minimum percentage of genomes in which a locus must be present to be considered part of the core genome. By default, chewBBACA’s ExtractCgMLST module determines the set of core loci based on three different loci presence thresholds: 95%, 99% and 100%. The list of core loci can be used to perform allele calling at the cgMLST level. cgMLST is widely used for epidemiological surveillance and outbreak detection, providing high resolution for strain discrimination. To determine the cgMLST based on the allele calling results for the 32 complete genomes, run the following command:
chewBBACA.py ExtractCgMLST -i results32_wgMLST/results_alleles.tsv -o results32_wgMLST/cgMLST
- Where:
chewBBACA.py ExtractCgMLSTselects the ExtractCgMLST module.-i results32_wgMLST/results_alleles.tsvpoints to the TSV file with the allelic profiles.-o results32_wgMLST/cgMLSTspecifies the output folder to store the cgMLST results.
The results will be stored in the results32_wgMLST/cgMLST folder, which will have the following (simplified) structure:
cgMLST
├── cgMLSTschema95.txt : list of loci present in at least 95% of genomes
├── cgMLST95.tsv : cgMLST allelic profiles including only loci present in at least 95% of genomes
├── cgMLSTschema99.txt : list of loci present in at least 99% of genomes
├── cgMLST99.tsv : cgMLST allelic profiles including only loci present in at least 99% of genomes
├── cgMLSTschema100.txt : list of loci present in 100% of genomes
├── cgMLST100.tsv : cgMLST allelic profiles including only loci present in 100% of genomes
├── cgMLST.html : interactive plot showing the variation in the number of loci per threshold
└── ... : Additional files with relevant data used to generate the cgMLST results
The ExtractCgMLST module creates a file with the list of core loci and the cgMLST allelic profiles for each threshold (95%, 99% and 100%). Additionally, it generates a HTML file with an interactive line plot that displays the variation in the number of core loci as genomes are added to the analysis for each threshold. The plot also includes a black line with the number of loci identified in each genome that is added to the analysis, allowing users to visualize how individual genomes contribute to the overall core genome determination and identify possible outlier genomes with an unexpected number of loci that may be affecting the cgMLST determination and should be excluded from the analysis (e.g. genomes with a very low number of loci compared to the rest of the dataset may correspond to low-quality genomes with a high number of truncated loci due to misassembly, contributing to a decrease in the frequency of loci that are in fact present in all or the majority of the genomes).
Note
The ExtractCgMLST module removes all INF- prefixes and converts/masks all non-EXC classifications in the profile matrix to 0. This means that when determining the core genome, only loci assigned a single valid allele identifier (EXC and INF-) are considered as present in a genome. Loci assigned other classifications are treated as absent (0) and are not counted as present when determining the core genome.
We will use the cgMLST determined at the 95% threshold for further analyses. The 95% allows to strike a balance between including a sufficient number of loci for high-resolution typing and accounting for potential missing data due to sequencing or assembly issues. The list with the 1,271 core loci is available in results32_wgMLST/cgMLST/cgMLSTschema95.txt. This file can be passed to the --gl parameter of the AlleleCall module to perform allele calling at the cgMLST level.
In the next section we will demonstrate how one can use the list of core loci determined by the ExtractCgMLST module to perform allele calling for a larger set of drat genomes at the cgMLST level.
Allele calling with additional genomes
The file with the list of core loci for the 95% threshold can be passed to the --gl parameter of the AlleleCall module to perform allele calling at the cgMLST level. We will use it to perform allele calling at the cgMLST level for a set of 680 draft genomes. Uncompress the genomes/sagalactiae_680_draft_genomes.zip file to create a folder named sagalactiae_680_draft_genomes and run the following command:
chewBBACA.py AlleleCall -i genomes/sagalactiae_680_draft_genomes/ -g tutorial_schema/schema_seed --gl results32_wgMLST/cgMLST/cgMLSTschema95.txt -o results680_cgMLST --cpu 6
The structure of the command is similar to the one used to perform allele calling for the 32 complete genomes, with the addition of the --gl parameter to specify the list of core loci. The output folder will also have the same structure, but the files include results at the cgMLST level, including results only for the 1,271 core loci listed in the file passed to the --gl parameter. The process added 23,765 novel alleles to the schema.
The set of 1,271 core loci used for allele calling at the cgMLST level was defined based on the allele calling results for the 32 complete genomes and may include loci that are not present in the majority of the genomes for more diverse datasets, such as the one with 680 draft genomes that we have just analyzed. To take into account the diversity of the more diverse set of genomes that we have analyzed, we will concatenate the cgMLST results for the 32 complete genomes with the cgMLST results for the 680 draft genomes and redetermine the cgMLST based on the allele calling results for the combined dataset. We can concatenate the allelic profiles for both datasets with the JoinProfiles module:
chewBBACA.py JoinProfiles -p results32_wgMLST/cgMLST/cgMLST95.tsv results680_cgMLST/results_alleles.tsv -o cgMLST_712.tsv
- Where:
chewBBACA.py JoinProfilesselects the JoinProfiles module.-p results32_wgMLST/cgMLST/cgMLST95.tsv results680_cgMLST/results_alleles.tsvpoints to the TSV files containing the allelic profiles for both datasets.-o cgMLST_712.tsvspecifies the output file to store the concatenated profiles.
To redetermine the cgMLST based on the allele calling results for the combined dataset of 712 genomes (32 complete + 680 draft), we can pass the cgMLST_712.tsv file to the ExtractCgMLST module:
chewBBACA.py ExtractCgMLST -i cgMLST_712.tsv -o cgMLST_712
The number of loci present in 95% of genomes based on the results for the 712 genomes is 1,194, a slight decrease from the number of loci present in 95% of the 32 complete genomes.
Annotating the schema loci
The schema we have created can be used for high-resolution typing, but it does not include any information that can help us identify loci of interest, such as gene and protein names. To add a little bit of context when looking for specific loci, we can use the UniprotFinder module to compare the schema loci against reference proteomes available at UniProt and retrieve annotations for the best hits. To annotate the loci included in the schema based on reference proteomes for Streptococcus agalactiae, run the following command:
chewBBACA.py UniprotFinder -g tutorial_schema/schema_seed -o schema_annotations -t tutorial_schema/cds_coordinates.tsv --taxa "Streptococcus agalactiae" --no-sparql --cpu 6
- Where:
chewBBACA.py UniprotFinderselects the UniprotFinder module.-g tutorial_schema/schema_seedpoints to the schema’s directory.-o schema_annotationsspecifies the output folder to store the annotation results.-t tutorial_schema/cds_coordinates.tsvpoints to the TSV file with the genomic coordinates of the representative alleles initially chosen for the schema seed.--taxa "Streptococcus agalactiae"specifies that the process should download reference proteomes for Streptococcus agalactiae from UniProt and compare the schema loci against them.--no-sparqldisables looking for annotations using UniProt’s SPARQL endpoint, which may take a a while or may be unavailable at times.--cpu 6specifies the number of CPU cores to use. Remember to adjust this parameter based on your machine’s specifications.
The loci annotations will be stored in the schema_annotations/schema_annotations.tsv file. A more detailed explanation about each column in the output file is available in the UniprotFinder module documentation.
The analyses we have performed up until now allowed us to determine the allelic profiles at the cgMLST level for 712 genomes and add tens of thousands of new alleles to the schema. However, how can we analyse the schema files and the allele calling results to explore the loci diversity and infer strain similarity? It certainly is not easy to do so by just looking at the files generated by chewBBACA so far. And that is why chewBBACA includes modules to analyze schemas and allele calling results and generate interactive reports that help users explore and interpret the results. We will demonstrate how to use those modules in the next sections.
Generating an interactive report for schema evaluation
Typically, schemas are only used to determine the allelic profiles of strains of interest. The allelic profiles can be used to estimate strain relatedness based on, for example, loci presence-absence analysis and pairwise distance computation. However, as a schema is used by chewBBACA to perform allele calling, new alleles are added to the schema, increasing the loci diversity captured by the schema. Therefore, schemas can also be a source of valuable data to study the diversity of the loci for a given species.
To allow users to explore the loci diversity contained in the schemas, chewBBACA includes the SchemaEvaluator module. The SchemaEvaluator module generates an interactive report with various plots and statistics that help users to explore the diversity of the loci included in a schema. We can use the SchemaEvaluator module to evaluate the diversity of the loci included in the schema after performing allele calling and adding the alleles identified in the 712 genomes. To analyze the schema and generate an interactive report, run the following command:
chewBBACA.py SchemaEvaluator -g tutorial_schema/schema_seed -o schema_report -a schema_annotations/schema_seed_annotations.tsv --loci-reports --add-sequences --cpu 6
- Where:
chewBBACA.py SchemaEvaluatorselects the SchemaEvaluator module.-g tutorial_schema/schema_seedpoints to the schema’s directory.-o schema_reportspecifies the name of the output folder.-a schema_annotations/schema_seed_annotations.tsvpoints to the file with the locus annotations generated by the UniprotFinder module.--loci-reportsgenerates individual report pages with a detailed analysis of each locus.--add-sequencesadds the DNA and translated allele sequences to the locus reports.--cpu 6specifies the number of CPU cores to use. Remember to adjust this parameter based on your machine’s specifications.
The results will be stored in the schema_report folder, which will have the following (simplified) structure:
schema_report
├── schema_report.html : HTML file for the main page of the report.
├── loci_reports : folder with the individual HTML reports for each locus in the schema.
| ├── GCA-000007265-protein1.html : HTML report for locus GCA-000007265-protein1.
| └── ... : Additional HTML files for the other loci in the schema.
└── report_bundle.js : JavaScript file required for the interactive plots.
To access the main page of the report, open the schema_report/schema_report.html file in a web browser.
The main page of the report includes various plots and statistics that summarize the diversity of the loci included in the schema. Additionally, the loci_reports subfolder contains individual HTML reports for each locus in the schema, providing a detailed analysis of the diversity observed for each locus, including allele frequency distributions, sequence alignments, and annotations (which are available because we provided the annotations file created by the UniprotFinder module). The page for the SchemaEvaluator module includes detailed information about the components and functionalities in the report.
In the next section, we will use the AlleleCallEvaluator module to analyze the allele calling results for the 680 draft genomes at the cgMLST level.
Generating an interactive report to evaluate allele calling results
To provide built-in functionalities to analyze allele calling results and enable users to explore results intuitively, chewBBACA includes the AlleleCallEvaluator module. The AlleleCallEvaluator module generates an interactive report with plots and statistics for the classifications distribution per genome and per locus, as well as heatmaps representing loci presence-absence and allelic differences/pairwise distances, and a NJ tree based on the MSA of the concatenated cgMLST loci. We can use the AlleleCallEvaluator module to analyze the allele calling results at the cgMLST level for the 680 draft genomes by running the following command:
chewBBACA.py AlleleCallEvaluator -i results680_cgMLST -g tutorial_schema/schema_seed -o results_report --a schema_annotations/schema_seed_annotations.tsv --cpu 6
- Where:
chewBBACA.py AlleleCallEvaluatorselects the AlleleCallEvaluator module.-i results680_cgMLSTpoints to the folder with the allele calling results.-g tutorial_schema/schema_seedpoints to the schema’s directory.-o results_reportspecifies the output directory.--a schema_annotations/schema_seed_annotations.tsvpoints to the TSV file with loci annotations generated by the UniprotFinder module.--cpu 6specifies the number of CPU cores to use. Remember to adjust this parameter based on your machine’s specifications.
The results will be stored in the results_report folder, which will have the following (simplified) structure:
results_report
├── allelecall_report.html : HTML file for the main page of the report.
├── cgMLST_MSA.fasta : Multi-FASTA file with the MSA for the concatenated cgMLST loci.
├── distance_matrix_symmetric.tsv : TSV file with the symmetric distance matrix computed based on the cgMLST profiles.
└── ... : Additional output files with relevant data included in the report and that users can use for further analyses.
Similarly to the SchemaEvaluator module, to access the main page of the report, open the results_report/allelecall_report.html file in a web browser. The “Classification Counts” and “Detailed Statistics” components of the report show counts and statistics related to the different classifications assigned during allele calling for each genome and locus. These components can help users identify possible low-quality genomes (e.g. genomes with a high number of contigs, invalid CDSs, or special classifications) or problematic loci (e.g. loci with a high number of special classifications). The “Loci Presence-Absence” component allows to visualize the loci presence-absence profiles per genome, which can help identify sets of genomes that share the same loci, but can also help identify low-quality genomes and problematic loci when observing large blocks or stretches of missing loci. The “Allelic Distances” component enables to quickly identify clusters of similar genomes based on the pairwise allelic differences computed from the cgMLST profiles. Finally, the “Core-genome Neighbor-Joining Tree” component provides a phylogenetic tree based on the MSA of the concatenated cgMLST loci, providing further insights into the relatedness of the analyzed genomes. The page for the AlleleCallEvaluator module includes detailed information about the components and functionalities in the report.
With the analysis of the allele calling results with the AlleleCallEvaluator module, we conclude the tutorial for a complete wg/cgMLST analysis with chewBBACA. While chewBBACA provides functionalities to analyze schemas and allele calling results, it still does not include functionalities to perform all types of analyses that users might want to perform on the generated data. Thus, we think it is important to include one last section with a list of tools users can use to perform additional analyses.
Using other software to analyse the data generated by chewBBACA
Visualization of Minimum Spanning Trees
You can upload the allelic profiles generated by chewBBACA and associated metadata to software such as PHYLOViZ Online to visualize Minimum Spanning Trees (MSTs) and perform various dataset operations that allow you to explore and analyse the results.
It is also possible to use other software such as GrapeTree to visualize MSTs based on the allelic profiles generated by chewBBACA, but this requires some data formatting to convert the allelic profiles generated by chewBBACA to a format compatible with GrapeTree (more information about the format supported by GrapeTree here).
Tree visualization and spatio-temporal analysis
You can also upload the Newick tree files exported from chewBBACA’s interactive reports and associated metadata to SPREAD, a program based on GrapeTree that enables tree visualization and spatio-temporal analyses. A Newick tree file can be downloaded from the “Neighbor-Joining Tree” and the “Core-genome Neighbor-Joining Tree” components of the interactive reports generated by the SchemaEvaluator and AlleleCallEvaluator modules, respectively.
Profile clustering
You can use ReporTree to identify genetic clusters based on the allelic profiles (contained in the results_alleles.tsv output file) determined by chewBBACA. ReporTree includes functionalities to identify genetic clusters at any distance threshold level(s), obtain summary reports with relevant statistics computed based on sample metadata, identify regions of cluster stability, etc. Cluster nomenclature can be maintained and updated in subsequent analyses, which is especially useful in surveillance-oriented workflows.