G4RNA_screener

G4RNA screener help


A short overview of the G4RNA screener Web GUI. It is designed to provide fast and easy access to the tool described by Garant JM, Scott MS, Perreault JP. Bioinformatics. 2017. G4RNA screener identifies regions in RNA sequences prone to fold into a G-quadruplex structure (G4). Three scoring systems are available to describe the likelihood of G4 observation:

cGcC (Consecutive G over consecutive C ratio)

This ratio was implemented to address the issue of competition in between G4 and Watson-Crick based structures. The presence of cytosine runs in the vicinity of a potential G4 was demonstrated to be an important feature to consider in the identification of potential G4 since the base pairing of those C runs with G runs involved in the potential G4 can hinder its formation. This score varies on a logarithmic scale by its ratio nature. More information on the cGcC score, Beaudoin JD, Jodoin R, Perreault JP. Nucleic Acids Res. 2014.

G4H (G4Hunter)

This arithmetic score was designed in a similar way to the cGcC score but was built to analyze DNA sequences. We demonstrated its relevance in RNA as well. It attributes an increasing positive score to each contiguous G and a negative counterpart for contiguous C. The sequence is then scored by the average of the values. More information on the G4H score, Bedrat A, Lacroix L, Mergny JL. Nucleic Acids Res. 2016.

G4NN (G4 Neural Network)

Sequences of the G4RNA database were converted into vectors of their trinucleotide content to train an artificial neural network. The resulting tool can evaluate the similitude of a particular sequence to the sequences of G4RNA and express this similitude as a score between 0 and 1. More information on the G4NN score, Garant JM, Perreault JP, Scott MS. Bioinformatics. 2017.


Analysis parameters

FASTA input area

FASTA input can be provided explicitly in the text area. Both multi lines and single line formats are supported. To ensure reliable computing speed and accessibility to all users, FASTA submission is limited to a total of 20 000 characters. The sequence must be entirely described with G,C,A,U/T (R,Y,S,W,N... alphabet is not supported). The user can provide any description line but in order to fully benefit from the automated retrieval of information by the tool, the usage of RefSeq description lines format is recommended, Ensembl format is supported to a lesser extent*:

>Accession_number_or_any_ID range=chr11:62573802-62574000 strand=+ >Ensembl_ID dna:chromosome chromosome:GRCh38:19:18440663:18524127:1

* Indicating negative strand will trigger the association of windows to the end of the genomic postion range. The sequence provided must be the transcript sequence. G4RNA screener will not process the reverse complementary strand of the sequence provided.
Providing a RefSeq accesion number or Ensembl ID will trigger the retrieval of cross reference in between RefSeq, Ensembl and HGNC.

FASTA file import button

FASTA file import is supported using the browse button to select a file from from the user end. The file must fit the same restrictions as the input area described above. The importation function allows file size up to 30 kBytes; slightly larger than the 20 000 characters of the area input since it requires a little less data formating.

Window size

The sequences provided are processed using a sliding window. A size of 60 nt is recommended for G4NN as this is the size used for G4NN's training. G4H was originally reported to be used with a 25 nt window but we demonstrated its good predictive power with a 60 nt window. The cGcC score was originally used on sequences ranging between 44 and 131 nt.

Step size

The step size defines the length of the sliding movement of the window. The default value is 10 nt which is an efficient compromise between resolution and computing time. Reducing the step size to 5 nt will double the number of windows processed and the computing time.

Score thresholds

All scores are included by default but any of the three can be unselected for the required analysis. We provide thresholds that were shown to maximise specificity and sensitivity in the validation assay performed on the sequences of G4RNA database. All windows are displayed in the results. Each score above its associated threshold in a particular window is displayed in green for a fast identification.


Display parameters

Each selectable parameters will display an additional column in the results table.

Description

Displays the description that follows the ">" character for each sequence of the FASTA. We recommend the usage of a RefSeq accession number (starting by NM, XM, NP or XP) or an Ensembl ID (starting by ENSG or ENST) followed by a space character. This description pattern will trigger G4RNA screener's automatic retrieval of information from servers of UCSC and Ensembl.

RefSeq mRNA accession

Automated retrieval of the RefSeq transcript accession number (NM00000 or NR00000) from UCSC's servers. Clickable link towards corresponding NCBI nucleotide search result.

RefSeq protein accession

Automated retrieval of the RefSeq protein accession number (NP00000) from UCSC's servers. Clickable link towards corresponding NCBI protein search result.

Ensembl gene ID

Automated retrieval of the Ensembl gene ID (ENG00000000) from Ensembl's servers. Clickable link towards corresponding Ensembl gene search result.

Ensembl transcript ID

Automated retrieval of the Ensembl transcript (ENST0000000) from Ensembl's servers. Clickable link towards corresponding Ensembl gene search result.

Gene full name

Full name field provide the complete description of the related gene as retrieved from Ensembl cross-reference to HGNC.

HGNC ID

HGNC ID is the numerical ID used to link to the HGNC database. Clickable link towards corresponding HGNC ID search result.

Custom identifier

The identifier field can be used to retrieve a user defined identifier if provided in description between the RefSeq accession or Ensembl ID and the chromosomal position.

Source

The source field returns the identity of the database corresponding to the description format (i.e. RefGene, Ensembl, etc).

Genome assembly

Returns the genome assembly used to extract the sequence submitted (i.e. GhG19, hg38).

Chromosome

The chromosome field retrieves the chromosome identity from the range if provided.

Start & End position

The start and end fields provide the position of the window within the sequence unless the chromosomal position is provided as a range.

The start field provides the starting position of each analyzed window. Starting from the lowest value of the provided chromosomal range and increasing for "+" strand sequence and starting from the highest value and decreasing for "-" strand sequence. By doing so, the position annotations of the window will fit the genomic position. The end field provides the complementary information to the start field with the position of the end position.

Strand

The strand is only available when it is provided in the description along with the chromosomal position.

Sequence length

The length provided is the window length.

Sequence

The sequence field provides the sequences of the window. These sequence usually overlap with one another depnding on the anlysis parameters.


Results table

The windows are displayed in a dynamic table. Columns are rearrangeable with a simple drag'n'drop and rows can be reordered by clicking the header; multiordering is possible with shift + click. The table is searchable and the total number of hits will be displayed. The table can be downloaded either in .csv or .xlsx format.


Contact and accessibility

The source code and terminal version of the program is available on the gitlab server to clone or download.

G4RNA screener is developed in collaboration with the research group of Jean-Pierre Perreault Ph.D. and is supported by the RiboClub.

G4RNA screener is managed by Jean-Michel Garant. All comments, questions or suggestions should be communicated via email : jean-michel(dot)garant(at)usherbrooke(dot)ca