POTION an end-to-end pipeline for positive Darwinian selection detection in genome-scale data through phylogenetic comparison of protein-coding genes

POTION an end-to-end pipeline for positive Darwinian selection detection in genome-scale data through phylogenetic comparison of protein-coding genes



Detection of genes evolving under positive Darwinian evolution in genome-scale information is nowadays a current strategy in relative genomics studies to identify genes potentially associated with adaptation procedures. Despite the many studies l king to detect and contextualize such gene sets, there is without any software open to perform this task in a general, automated, large-scale and manner that is reliable. This undoubtedly does occur as a result of the computational challenges tangled up in this task, like the appropriate modeling of information under analysis, the computation time and energy to perform several of the desired steps whenever dealing with genome-scale data and also the highly error-prone nature regarding the sequence and alignment information structures required for genome-wide selection detection that is positive.


We current POTION, an available supply, modular and end-to-end pc software for genome-scale detection of g d Darwinian selection in sets of homologous coding sequences. Our computer software represents a key step towards genome-scale, automatic detection of g d selection, from predicted coding sequences and their homology relationships to top-quality groups of definitely selected genes. POTION decreases false positives through a few sophisticated sequence and team filters predicated on numeric, phylogenetic, quality and preservation criteria to eliminate spurious data and through numerous theory corrections, and considerably reduces calculation time because of a parallelized design. Our software realized a high classification performance when used to evaluate a curated dataset of Trypanosoma brucei paralogs previously surveyed for positive selection. When utilized to evaluate predicted sets of homologous genes of 19 strains of Mycobacterium tuberculosis being a full example we demonstrated the filters implemented in POTION to get rid of sourced elements of errors that commonly inflate mistakes in g d selection detection. a thorough literature review found no other software much like POTION when it comes to customization, scale and automation.


To your best of our knowledge, POTION may be the first device to allow users to create and check hypotheses regarding the incident of site-based pr f of g d selection in non-curated, genome-scale data inside a feasible period of time along with no peoples intervention after initial setup. POTION can be obtained at


Maturation of second-generation sequencing technologies has established a wealth of genomic information to be methodically analyzed through several comparative genomic techniques so that you can draw out information that is biological the patterns of preservation and variation observed in genomic elements shared within genomes [1–3]. a conventional analysis in the area of relative genomics is the genome-scale computational seek out sets of homologous genes evolving under positive Darwinian selection, often defined as genes having an elevated nonsynonymous substitution price, since these sets of genes are of many interest to your understanding of exactly how evolution works at the molecular level [4, 5].

Studies with this nature are utilized to identify genes tangled up in speciation [6] plus in the emergence of new phenotypic traits that increase fitness [7–9] that is evolutionary. Genome-scale l ks for positive selection were also trusted to detect genes tangled up in host-pathogen co-evolutionary “arms competition” within the genomes of several important pathogenic taxa such as Escherichia coli [10, 11], Salmonella [12], Staphylococcus [13], Streptococcus [14], Trypanosoma brucei [15] and Campylobacter [16], among numerous others. A significantly high number of genes involved in immunity-related processes were also detected in genome-wide searches for positive selection in mammalian genomes [8] on the host side.

While the considerable amount of genome-scale positive selection detection (GSPSD) studies produced a lot of valuable biological information, there is a not enough specific software to execute such task in an over-all, automated, fast and statistically sound manner. A few facets have the effect of this situation. One important aspect is the automatic detection of positive selection on molecular data is perhaps not trivial from the computational point of view, requiring the generation of data structures computationally costly become calculated. It’s prohibitive to run analyses on a huge number of groups of homologs, such as in numerous sequence alignment, phylogenetic tree reconstruction and fitting of distinct codon evolutionary models to the data, making use of solitary processor software inside a feasible timeframe [17].

Another aspect that is important the very error-prone nature of this series and positioning information structures necessary for GSPSD [18]. A few sources of mistake that may produce spurious selection that is positive are manufactured during common bioinformatics procedures, such as in genome assembly and gene prediction. Among these errors are frame shifts, series ambiguities, gene fragments, chimeric sequences and pseudogenes regarded as functional coding regions. Other common sourced elements of error are the recruiting of exceptionally divergent sequences to sets of homologous genes during automatic homology forecast. Every one of the aforementioned mistakes can produce spurious positioning of non-homologous codons and dramatically restrict the reliable detection of positive selection [18, 19]. The occurrence of recombination occasions within homologous sequences may also notably hinder reliable GSPSD, since the codon evolution models commonly used to detect g d selection do not account fully for recombination as an source of variation of homologous positions and assume all of the columns of a numerous codon alignment to share with you equivalent evolutionary tale [20]. A few predicted groups of homologous genes also include mixed sets of 1-1 orthologs and paralogs, two biologically distinct gene teams that ought to be examined separately to research various biological questions [21]. Finally, the simultaneous search for recombination and/or g d selection in several groups of homologs creates a numerous hypothesis-testing scenario that needs proper statistical therapy to manage the regularity of Type 1 errors [8, 22].

Here we report POTION (POsitive selecTION), a unique end-to-end modular, customizable and parallelized pipeline that overcomes the above reported challenges to detect positive selection on genome-scale data in batch mode. POTION enables users to effortlessly and quickly review their particular genomic information of interest–large variety of predicted genes and their homology relationships–for signs of g d selection. We demonstrate POTION has the capacity to classify a curated dataset of t. brucei paralogs previously surveyed for positive selection with high precision. As a example to illustrate a few of the unique features found in POTION, such as the sequence that is sophisticated groups filters additionally the heavily parallelized design, we used our program to survey the whole pair of coding sequences of 19 Mycobacterium tuberculosis strains using distinct setup sets to particularly stress exactly how such features significantly change the quantity and the quality of homologs predicted to evolve under g d selection, or enough time to process genome-scale datasets. POTION detected a few categories of positively selected homologous genes with known roles into the host-pathogen “arms race”, as expected for genes under Darwinian selection in a species that is parasitic. An extensive literary works review found not one pipeline which contains most of the computer software, features and flexibility tied together within an integrated environment to execute GSPSD within an manner that is automated. To scientists lacking bioinformatics expertise, POTION supplies the first end-to-end workflow to perform GSPSD, although some bioinformatics abilities are still had a need to properly install and configure POTION. To bioinformaticians, POTION provides a customizable computational scaffold to perform GSPSD experiments in a managed and built-in environment. POTION is distributed under GNU General Public License version 3.0 and may be downloaded at lmb.cnptia.embrapa.br/share/POTION/.