AlleleCallEvaluator - Build an interactive report for allele calling results evaluation

The AlleleCallEvaluator module allows users to generate an interactive HTML report to evaluate allele calling results generated by the AlleleCall module. The report provides summary statistics to evaluate results per sample and per locus (with the possibility to provide a TSV file with loci annotations to include on a table). The report includes components to display a heatmap representing the loci presence-absence matrix, a heatmap representing the distance matrix based on allelic differences and a Neighbor-Joining (NJ) tree based on the MSA of the core genome loci.

Warning

The peak memory usage will be considerably higher and the process will take a while to finish for datasets with thousands of strains (mainly due to the computation of the Multiple Sequence Alignments of the core loci and of the NJ tree).

Important

The AlleleCallEvaluator module uses the data in the results_statistics.tsv, loci_summary_stats.tsv and results_alleles.tsv files. These files are available in the results folder of each AlleleCall run.

Basic Usage

chewBBACA.py AlleleCallEvaluator -i /path/to/AlleleCallResultsFolder -g /path/to/SchemaFolder -o /path/to/OutputFolder --cpu 4

Parameters

-i, --input-files           (Required) Path to the directory that contains the allele calling results
                            generated by the AlleleCall module.

-g, --schema-directory      (Required) Path to the schema's directory.

-o, --output-directory      (Optional) Path to the output directory where the module will store intermediate
                            files and create the report HTML files (default: None).

-a, --annotations           (Optional) Path to the TSV file created by the UniprotFinder module (
                            default: None).

--cpu, --cpu-cores          (Optional) Number of CPU cores/threads that will be used to run the
                            process (chewie resets to a lower value if it is equal to or
                            exceeds the totalnumber of available CPU cores/threads) (default: 1).

--light                     (Optional) Do not compute the presence-absence matrix, the distance matrix and
                            and the Neighbor-Joining tree (default: False).

--no-pa                     (Optional) Do not compute the presence-absence matrix (default: False).

--no-dm                     (Optional) Do not compute the distance matrix (default: False).

--no-tree                   (Optional) Do not compute the Neighbor-Joining tree (default: False).

--cg-alignment              (Optional) Compute the MSA of the core genome loci, even if `--no-tree` is
                            provided (default: False).

Outputs

OutputFolderName
├── distance_matrix_symmetric.tsv
├── cgMLST_MSA.fasta
├── cgMLST_profiles.tsv
├── allelecall_report.html
├── masked_profiles.tsv
├── presence_absence.tsv
└── report_bundle.js
  • A TSV file, distance_matrix_symmetric.tsv, that contains the symmetric distance matrix. The distances are computed by determining the number of allelic differences from the set of core loci (shared by 100% of the samples) between each pair of samples.

  • A FASTA file, cgMLST_MSA.fasta, that contains the MSA of the core loci. For each locus in the core genome, the alleles found in all samples are translated and aligned with MAFFT. The alignment files are concatenated to generate the full alignment.

  • A TSV file, cgMLST_profiles.tsv, that contains the allelic profiles for the set of core loci.

  • A HTML report, allelecall_report.html, that contains the following components:

    • A table with the total number of samples, total number of loci, total number of coding sequences (CDSs) extracted from the samples, total number of CDSs classified and totals per classification type.

    • A tab panel with stacked bar charts for the classification type counts per sample and per locus.

    • A tab panel with detailed sample and locus statistics.

    • If a TSV file with annotations is provided to the --annotations parameter, the report will also include a table with the provided annotations. Otherwise, it will display a warning informing that no annotations were provided.

    • A Heatmap chart representing the loci presence-absence matrix for all samples in the dataset.

    • A Heatmap chart representing the allelic distance matrix for all samples in the dataset.

    • A tree drawn with Phylocanvas.gl based on the Neighbor-Joining (NJ) tree computed by FastTree.

  • A TSV file, masked_profiles.tsv, that contains the masked allelic profiles (results from masking the allelic profiles in the results_alleles.tsv file generated by the AlleleCall module).

  • A TSV file, presence_absence.tsv, that contains the loci presence-absence matrix.

  • A JavaScript bundle file, report_bundle.js, necessary to visualize the report.

Warning

The JS bundle is necessary to visualize the HTML report. Do not move or delete this file.

Note

If you want to share the report, simply compress the output folder and share the compressed archive. The receiver can simply uncompress the archive and open the HTML file with a browser.

Report Components

The first component gives a small introduction that details the type of information contained in each component of the report.

../../_images/allelecall_report_description.png

Results Summary Data

The second component is a table with summary statistics about the allele calling results, such as:

  • Total Samples: Total number of samples in the dataset.

  • Total Loci: Total number of loci used to perform allele calling.

  • Total CDSs: Total number of CDSs identified in all the samples.

  • Total CDSs Classified: Total number of CDSs that were classified.

  • EXC: Total number of CDSs classified as EXC.

  • INF: Total number of CDSs classified as INF.

  • PLOT3: Total number of CDSs classified as PLOT3.

  • PLOT5: Total number of CDSs classified as PLOT5.

  • LOTSC: Total number of CDSs classified as LOTSC.

  • NIPH: Total number of NIPH classifications (the NIPH classification includes multiple CDSs).

  • NIPHEM: Total number of NIPHEM classifications (the NIPHEM classification includes multiple CDSs).

  • ALM: Total number of CDSs classified as ALM.

  • ASM: Total number of CDSs classified as ASM.

  • PAMA: Total number of PAMA classifications.

../../_images/allelecall_report_summary.png

Please visit the section about the AlleleCall module if you want to know more about the classification types.

Classification Counts

The third component contains two panels with stacked bar charts displaying the classification type counts per sample and per locus.

  • Panel A, Counts Per Sample, displays the stacked bar charts for the sample classification type counts.

../../_images/allelecall_report_sample_counts.png
  • Panel B, Counts Per Locus, displays the stacked bar charts for the loci classification type counts.

../../_images/allelecall_report_loci_counts.png

The plot area will display at most data for 300 samples/loci. You can click the left/right arrows to view the previous/next 300 samples/loci and the double left/right arrows to view the data for the first/last 300 samples/loci. The component includes a slider to select the range of sample/loci bars that are visible.

Detailed Statistics

The fourth component contains two panels with tables with detailed statistics about the results per sample and per locus.

The Sample Stats table includes the following columns:

  • Sample: The sample unique identifier.

  • Total Contigs: Total number of contigs in the sample FASTA file.

  • Total CDSs: Total number of CDSs identified in the sample.

  • Proportion of Classified CDSs: The proportion of CDSs identified in the sample that were classified.

  • Identified Loci: The number of schema loci identified in the sample.

  • Proportion of Identified Loci: The proportion of schema loci that were identified in the sample.

  • Valid Classifications: Total number of valid classifications (EXC and INF).

  • Invalid Classifications: Total number of invalid classifications (PLOT3, PLOT5, LOTSC, NIPH, NIPHEM, ALM, ASM and PAMA).

The Loci Stats table includes the following columns:

  • Locus: The locus unique identifier.

  • Total CDSs: Total number of CDSs classified for that locus.

  • Valid Classifications: Total number of valid classifications (EXC+INF).

  • Invalid Classifications: Total number of invalid classifications (PLOT3, PLOT5, LOTSC, NIPH, NIPHEM, ALM, ASM, PAMA.

  • Proportion Samples: The proportion of samples the locus was identified in.

Note

You can use the table View Columns feature to display columns with the count for each classification type.

The dropdown menu below the tables allows the selection of a single column to generate a histogram for the values in the selected column.

../../_images/allelecall_report_detailed_stats.png

Loci annotations

If a TSV file with loci annotations is provided, the fifth component of the schema report is a table with the list of annotations. Otherwise, it will display a warning informing that no annotations were provided.

../../_images/allelecall_report_annotations.png

If a column name includes URL, the AlleleCallEvaluator module assumes that the values in that column are URLs and creates links to the web pages.

Important

The first column in the TSV file with annotations must be named Locus and contain the identifiers of the loci (the basename of the locus FASTA file without the .fasta extension).

You can use the UniprotFinder module to annotate the loci in a schema created with chewBBACA. If you want to annotate an external schema, you can adapt it with the PrepExternalSchema module followed by annotation with the UniprotFinder module.

Loci Presence-Absence

The sixth component displays a heatmap representing the loci presence-absence matrix for all samples in the dataset. Blue cells (z=1) correspond to loci presence and grey cells (z=0) to loci absence. The Select Sample dropdown menu enables the selection of a single sample to display its heatmap on top of the main heatmap. The Select Locus dropdown menu enables the selection of a single locus to display its heatmap on the right of the main heatmap.

../../_images/allelecall_report_pa_heatmap.png

Allelic Distances

The seventh component displays a heatmap representing the symmetric distance matrix. The distances are computed by determining the number of allelic differences from the set of core loci (shared by 100% of the samples) between each pair of samples. The Select Sample dropdown menu enables the selection of a single sample to display its heatmap on top of the main heatmap. The menu after the heatmap enables the selection of a single sample and of a distance threshold to display a table with the list of samples at a distance equal or smaller than the specified distance value.

../../_images/allelecall_report_dm_heatmap.png

Core-genome Neighbor-Joining Tree

The last component displays a tree drawn with Phylocanvas.gl based on the Neighbor-Joining (NJ) tree computed by FastTree (with the options -fastest, -nosupport and -noml). The tree is computed based on the MSA for the set of loci that constitute the core-genome (The MSA for each core locus is determined with MAFFT, with the options --retree 1 and --maxiterate 0. The MSAs for all the core loci are concatenated to create the full MSA).

../../_images/allelecall_report_cgMLST_tree.png