UniprotFinder - Retrieve annotations for loci in a schema

The UniprotFinder module can be used to retrieve annotations for the loci in a schema through requests to UniProt’s SPARQL endpoint and through alignment with BLASTp against UniProt’s reference proteomes for a set of taxa.

Basic Usage

chewBBACA.py UniprotFinder -g /path/to/SchemaFolder -o /path/to/OutputFolder -t /path/to/cds_coordinates.tsv --taxa "Species Name" --cpu 4

Parameters

-g, --schema-directory (Required) Path to the schema's directory.

-o, --output-directory (Required) Output directory where the process will store intermediate
                       files and save the final TSV file with the loci annotations.

--gl, --genes-list     (Optional) Path to a file with the list of loci in the schema that the process should
                       find annotations for (one per line, full paths or loci IDs) (default: False).

-t, --protein-table    (Optional) Path to the TSV file with coding sequence (CDS) coordinate data,
                       "cds_coordinates.tsv", created by the CreateSchema process. (default: None).

--bsr                  (Optional) BLAST Score Ratio value. The BSR is only used when taxa names are provided
                       to the --taxa parameter and local sequences are aligned against reference
                       proteomes downloaded from UniProt. Annotations are selected based on a BSR
                       >= than the specified value (default: 0.6).

--cpu, --cpu-cores     (Optional) Number of CPU cores/threads that will be used to run the process
                       (chewie resets to a lower value if it is equal to or exceeds the total number
                       of available CPU cores/threads) (default: 1).

--taxa                 (Optional) List of scientific names for a set of taxa. The process will download
                       reference proteomes from UniProt associated to taxa names that contain any
                       of the provided terms. The schema representative alleles are aligned
                       against the reference proteomes to assign annotations based on high-BSR matches
                       (default: None).

--pm                   (Optional) Maximum number of proteome matches to report (default: 1).

--no-sparql            (Optional) Do not search for annotations through the UniProt SPARQL endpoint.

--no-cleanup           (Optional) If provided, intermediate files generated during process execution are not
                       removed at the end (default: False).

--b, --blast-path      (Optional) Path to the directory that contains the BLAST executables.

Outputs

The process writes a TSV file, schema_annotations.tsv, with the annotations found for each locus in the directory passed to the -o parameter. By default, the process searches for annotations through UniProt’s SPARQL endpoint, reporting the product name and UniProt URL for local loci with an exact match in UniProt’s database. If the cds_coordinates.tsv file is passed to the -t parameter, the output file will also include the loci coordinates. The --taxa parameter receives a set of taxa names and searches for reference proteomes that match the provided terms. The reference proteomes are downloaded and the process aligns the loci representative alleles against the reference proteomes to include the product and gene name in the reference proteomes.