AlleleCallEvaluator - Build an interactive report for allele calling results evaluation
=======================================================================================

The *AlleleCallEvaluator* module allows users to generate an interactive HTML report to evaluate allele calling results generated by the :doc:`AlleleCall </user/modules/AlleleCall>` module. The report provides summary statistics to evaluate results per sample and per locus (with the possibility to provide a TSV file with loci annotations to include on a table). The report includes components to display a heatmap representing the loci presence-absence matrix, a heatmap representing the distance matrix based on allelic differences and a Neighbor-Joining (NJ) tree based on the MSA of the core genome loci.

.. warning::
  The peak memory usage will be considerably higher and the process will take a while to finish for datasets with thousands of strains (mainly due to the computation of the Multiple Sequence Alignments of the core loci and of the NJ tree).

.. important::
  The *AlleleCallEvaluator* module uses the data in the ``results_statistics.tsv``, ``loci_summary_stats.tsv`` and ``results_alleles.tsv`` files. These files are available in the results folder of each *AlleleCall* run.

Basic Usage
:::::::::::

::

	chewBBACA.py AlleleCallEvaluator -i /path/to/AlleleCallResultsFolder -g /path/to/SchemaFolder -o /path/to/OutputFolder --cpu 4

Parameters
::::::::::

::

    -i, --input-files           (Required) Path to the directory that contains the allele calling results
                                generated by the AlleleCall module.

    -g, --schema-directory      (Required) Path to the schema's directory.

    -o, --output-directory      (Optional) Path to the output directory where the module will store intermediate
                                files and create the report HTML files (default: None).

    -a, --annotations           (Optional) Path to the TSV file created by the UniprotFinder module (
                                default: None).

    --cpu, --cpu-cores          (Optional) Number of CPU cores/threads that will be used to run the
                                process (chewie resets to a lower value if it is equal to or
                                exceeds the totalnumber of available CPU cores/threads) (default: 1).

    --light                     (Optional) Do not compute the presence-absence matrix, the distance matrix,
                                and the Neighbor-Joining tree (default: False).

    --no-pa                     (Optional) Do not include the presence-absence matrix in the report.
                                The presence-absence is still computed and included in the 
                                output directory. Use this parameter if the presence-absence
                                matrix is not relevant for the evaluation or if the number
                                of genomes is too high and displaying the matrix would affect
                                the responsiveness of the report (default: False).

    --no-dm                     (Optional) Do not compute the distance matrix. Use this parameter if the
                                distance matrix is not relevant for the evaluation or if the number
                                of genomes is too high and computing and/or displaying the distance
                                matrix would become impracticable with the available resources (default: False).

    --no-tree                   (Optional) Do not compute the Neighbor-Joining tree. Use this parameter if the
                                tree is not relevant for the evaluation or if the number of genomes
                                is too high and computing the tree would become impracticable with
                                the available resources (default: False).

    --cg-alignment              (Optional) Compute the MSA of the core genome loci, even if `--no-tree` is
                                provided (default: False).

Outputs
:::::::

::

   OutputFolderName
   ├── allelecall_report.html
   ├── masked_profiles.tsv
   ├── presence_absence.tsv
   ├── cgMLST_profiles.tsv
   ├── distance_matrix_symmetric.tsv
   ├── protein_msa.fasta
   ├── cgMLST.tree
   └── report_bundle.js

Brief description of each output file:

- ``allelecall_report.html``: The report HTML file that can be opened with a web browser. The report contains the following components:

  - A table with the total number of samples, total number of loci, total number of coding sequences (CDSs) extracted from the samples, total number of CDSs classified and totals per classification type.
  - A tab panel with stacked bar charts for the classification type counts per sample and per locus.
  - A tab panel with detailed sample and locus statistics.
  - If a TSV file with annotations is provided to the ``--annotations`` parameter, the report will also include a table with the provided annotations. Otherwise, it will display a warning informing that no annotations were provided.
  - A Heatmap chart representing the loci presence-absence matrix for all samples in the dataset.
  - A Heatmap chart representing the allelic distance matrix for all samples in the dataset.
  - A tree drawn with `Phylocanvas.gl <https://www.npmjs.com/package/@phylocanvas/phylocanvas.gl>`_ based on the Neighbor-Joining (NJ) tree computed by `FastTree <http://www.microbesonline.org/fasttree/>`_.

- ``masked_profiles.tsv``: A TSV file containing the masked input profiles (the allelic profiles in the ``results_alleles.tsv`` file are masked to remove *INF-* prefixes and substitute all *non-EXC* and *non-INF* classification by ``0``).

- ``presence_absence.tsv``: A TSV file containing the a loci presence-absence matrix computed based on the input profiles.

- ``cgMLST_profiles.tsv``: A TSV file that contains the allelic profiles for the set of core loci.

- ``distance_matrix_symmetric.tsv``: A TSV file that contains the symmetric distance matrix. The matrix contains the pairwise allelic distances computed based on the set of core loci (shared by 100% of the samples).

- ``protein_msa.fasta``: A FASTA file containing the protein MSA for the core loci. For each locus in the core genome, the alleles found in all samples are translated and aligned with `MAFFT <https://mafft.cbrc.jp/alignment/software/>`_. The alignment files are concatenated to generate the full protein MSA. This file is created by calling the :doc:`ComputeMSA </user/modules/ComputeMSA>` module

- ``cgMLST.tree``: A file that contains the Neighbor-Joining tree computed by FastTree in Newick format.

- ``report_bundle.js``: A JavaScript bundle file, necessary to generate the interactive HTML report.

.. warning::
  The JS bundle is necessary to generate the interactive HTML report. Do not move or delete this file.
  
.. note::
  If you want to share the report, simply compress the output folder and share the compressed archive. The receiver can simply uncompress the archive and open the report HTML file with a web browser.

Report Components
-----------------

The first component gives a small introduction that details the type of information contained in each component of the report.

.. image:: /_static/images/allelecall_report_description.png
   :width: 1400px
   :align: center

Results Summary Data
....................

The second component is a table with summary statistics about the allele calling results, such as:

  - **Total Samples**: Total number of samples in the dataset.
  - **Total Loci**: Total number of loci used to perform allele calling.
  - **Total CDSs**: Total number of CDSs identified in all the samples.
  - **Total CDSs Classified**: Total number of CDSs that were classified.
  - **EXC**: Total number of CDSs classified as EXC.
  - **INF**: Total number of CDSs classified as INF.
  - **PLOT3**: Total number of CDSs classified as PLOT3.
  - **PLOT5**: Total number of CDSs classified as PLOT5.
  - **LOTSC**: Total number of CDSs classified as LOTSC.
  - **NIPH**: Total number of NIPH classifications (the NIPH classification includes multiple CDSs).
  - **NIPHEM**: Total number of NIPHEM classifications (the NIPHEM classification includes multiple CDSs).
  - **ALM**: Total number of CDSs classified as ALM.
  - **ASM**: Total number of CDSs classified as ASM.
  - **PAMA**: Total number of PAMA classifications.

.. image:: /_static/images/allelecall_report_summary.png
   :width: 1400px
   :align: center

Please visit the section about the :doc:`AlleleCall </user/modules/AlleleCall>` module if you want to know more about the classification types.

Classification Counts
.....................

The third component contains two panels with stacked bar charts displaying the classification type counts per sample and per locus.

- Panel A, ``Counts Per Sample``, displays the stacked bar charts for the sample classification type counts.

.. image:: /_static/images/allelecall_report_sample_counts.png
   :width: 1400px
   :align: center

- Panel B, ``Counts Per Locus``, displays the stacked bar charts for the loci classification type counts.

.. image:: /_static/images/allelecall_report_loci_counts.png
   :width: 1400px
   :align: center

The plot area will display at most data for 300 samples/loci. You can click the left/right arrows to view the previous/next 300 samples/loci and the double left/right arrows to view the data for the first/last 300 samples/loci. The component includes a slider to select the range of sample/loci bars that are visible.

Detailed Statistics
...................

The fourth component contains two panels with tables with detailed statistics about the results per sample and per locus.

The ``Sample Stats`` table includes the following columns:
  
  - **Sample**: The sample unique identifier.
  - **Total Contigs**: Total number of contigs in the sample FASTA file.
  - **Total CDSs**: Total number of CDSs identified in the sample.
  - **Proportion of Classified CDSs**: The proportion of CDSs identified in the sample that were classified.
  - **Identified Loci**: The number of schema loci identified in the sample.
  - **Proportion of Identified Loci**: The proportion of schema loci that were identified in the sample.
  - **Valid Classifications**: Total number of valid classifications (EXC and INF).
  - **Invalid Classifications**: Total number of invalid classifications (PLOT3, PLOT5, LOTSC, NIPH, NIPHEM,
    ALM, ASM and PAMA).

The ``Loci Stats`` table includes the following columns:

  - **Locus**: The locus unique identifier.
  - **Total CDSs**: Total number of CDSs classified for that locus.
  - **Valid Classifications**: Total number of valid classifications (EXC+INF).
  - **Invalid Classifications**: Total number of invalid classifications (PLOT3, PLOT5, LOTSC, NIPH, NIPHEM,
    ALM, ASM, PAMA.
  - **Proportion Samples**: The proportion of samples the locus was identified in.

.. note::
   You can use the table **View Columns** feature to display columns with the count for each classification type.

The dropdown menu below the tables allows the selection of a single column to generate a histogram for the values in the selected column.

.. image:: /_static/images/allelecall_report_detailed_stats.png
   :width: 1400px
   :align: center

Loci annotations
................

If a TSV file with loci annotations is provided, the fifth component of the schema report is a table with the list of annotations. Otherwise, it will display a warning informing that no annotations were provided.

.. image:: /_static/images/allelecall_report_annotations.png
   :width: 1400px
   :align: center

If a column name includes ``URL``, the AlleleCallEvaluator module assumes that the values in that column are URLs and creates links to the web pages.

.. important::
  The first column in the TSV file with annotations must be named ``Locus`` and contain the identifiers of the loci (the basename of the locus FASTA file without the ``.fasta`` extension).

You can use the :doc:`UniprotFinder </user/modules/UniprotFinder>` module to annotate the loci in a schema created with chewBBACA. If you want to annotate an external schema, you can adapt it with the :doc:`PrepExternalSchema </user/modules/PrepExternalSchema>` module followed by annotation with the :doc:`UniprotFinder </user/modules/UniprotFinder>` module.

Loci Presence-Absence
.....................

The sixth component displays a heatmap representing the loci presence-absence matrix for all samples in the dataset. Blue cells (z=1) correspond to loci presence and grey cells (z=0) to loci absence. The **Select Sample** dropdown menu enables the selection of a single sample to display its heatmap on top of the main heatmap. The **Select Locus** dropdown menu enables the selection of a single locus to display its heatmap on the right of the main heatmap.

.. image:: /_static/images/allelecall_report_pa_heatmap.png
   :width: 1400px
   :align: center

Allelic Distances
.................

The seventh component displays a heatmap representing the symmetric distance matrix. The distances are computed by determining the number of allelic differences from the set of core loci (shared by 100% of the samples) between each pair of samples. The **Select Sample** dropdown menu enables the selection of a single sample to display its heatmap on top of the main heatmap. The menu after the heatmap enables the selection of a single sample and of a distance threshold to display a table with the list of samples at a distance equal or smaller than the specified distance value.

.. image:: /_static/images/allelecall_report_dm_heatmap.png
   :width: 1400px
   :align: center

Core-genome Neighbor-Joining Tree
.................................

The last component displays a tree drawn with `Phylocanvas.gl <https://www.npmjs.com/package/@phylocanvas/phylocanvas.gl>`_ based on the Neighbor-Joining (NJ) tree computed by `FastTree <http://www.microbesonline.org/fasttree/>`_ (with the options ``-fastest``, ``-nosupport`` and ``-noml``). The tree is computed based on the MSA for the set of loci that constitute the core-genome (The MSA for each core locus is determined with `MAFFT <https://mafft.cbrc.jp/alignment/software/>`_, with the options ``--retree 1`` and ``--maxiterate 0``. The MSAs for all the core loci are concatenated to create the full MSA).

.. image:: /_static/images/allelecall_report_cgMLST_tree.png
   :width: 1400px
   :align: center

Workflow of the AlleleCallEvaluator module
::::::::::::::::::::::::::::::::::::::::::

.. image:: /_static/images/AlleleCallEvaluator.png
   :width: 1000px
   :align: center

The AlleleCallEvaluator module analyses allele calling results to create a report that allows users to explore results interactively. Brief description of the workflow:

	- The process starts by importing and computing sample and loci statistics based on the allele calling results. The sample and loci statistics are included in interactive data tables and charts on the main page of the HTML report.

	- Loci annotations are imported and included in the main page of the report if provided. If the ``--light`` option is provided, the process does not add more information to the report.

	- Otherwise, the allelic profiles are imported and masked to remove *INF-* prefixes and substitute special classifications by ``0``.

	- The masked profiles serve as the basis for computing a presence-absence (PA) matrix, enabling the determination of the set of loci that constitute the core genome.

	- The profile data for the core loci are used to compute a matrix of allelic distances.

	- The core loci alleles identified per strain and locus are imported to compute the cgMLST alignment that FastTree uses to compute a Neighbour-Joining (NJ) tree.

	- The PA and distance matrices and NJ tree data are included in the report to be displayed and explored interactively.