Prediction of RNA Pseudoknotted Secondary Structure using Stochastic Context Free Grammars (SCFG) 1

Pseudoknots are a frequent RNA structure that assumes essential roles for varied biocatalyst cell’s fu nctions. One of the most challenging fields in bioinformatics is the pr ediction of this secondary structure based on the b ase-pair sequence that dictates it. Previously, a model adapted from computational linguistics ‐ Stochastic Context Free Grammars (SCFG) ‐ has been used to predict RNA secondary structure . However, to this date the SCFG approach impose a prohibitive complexity cost [O(n 4 )] when they are applied to the prediction of pseudo knots, mainly because a context-sensitive grammar is formally required to analyze them. Other hybrids approaches (energy maximization) give a O(n 3 ) complexity in the best case, besides having several restrictions in the maximum length of the sequence for practical analysis. Here we introduce a novel algorithm, based on patte rn matching techniques, that uses a sequential appr oximation strategy to solve the original problem. This algori thm not only reduces the complexity to O(n 2 logn), but also widens the maximum length of the sequence, as well as the capacity of analyzing several pseudoknots simultane ously.


Introduction
Bioinformatics is an applied science where mathematical and computational theories and technologies are used in order to process, relate and derive predictions and inferences from data obtained in molecular biology.Bioinformatic's goal is to understand and analyze the information control and flow within different organisms.There is a synergic interaction between computer science, mathematics and biology, each with its own richness and limitations.Within the bioinformatics scope, one of the most important fields is biological sequence analysis, which assumes the representation of the molecules that are essential for life (DNA, RNA, Proteins) as residues chains represented as symbols in a particular alphabet, thus allowing its study with grammatical analysis in order to find relations among these residues.These found relations determine, according to the central dogma 2 of molecular biology, the biological properties of such molecules [21][18].This sequential representation of a molecule is known as its primary structure.As many other proteins, RNA molecules presents distant pairings within a sequence.These bindings (secondary structure) manifest themselves as bonds within nucleotides located far and indistinctively in the sequence, resulting in a structural folding, known as Pseudoknot 3 .Since its discovery [17] these pseudoknots has drawn considerable attention because they give 3-D structure to the molecule, a structure that will determine in most cases its particular biological function 4 .Even though determining the molecular structure is of vital importance, it is also particularly hard and expensive to obtain structural data from RNA spectrometry and crystallography, so that an a priori prediction based on the residue sequence is an essential subject to bioinformatics [20].We describe briefly current approaches to achieve these predictions.In a molecular level it is plausible that RNA folding is dictated more by biophysical characteristics than by the mere count and relating of base pairs.Zuckers' minimization algorithm [27] assumes that the correct structural configuration is the one that presents the lowest equilibrium energy (∆G o ).This prediction can be refined by combining it with experimental data, and statistical and thermodynamic information [24].Nonetheless, given that the model for exact energy associated with the pseudoknots is not yet available, these existing models that use secondary structure energy approximations with relatively success [12] [9].These approaches can't determine an optimal solution and most of the time can't tell how far from the optimal solution is the obtained one [20].This is why nowadays no algorithm can predict in an accurate fashion pseudoknot classes; in fact the most important contributions present complexities that make them unusable 5 .There are several linguistic approaches to pseudoknots, most of them are variations of Stochastic Context Free Grammars (SCFG), which also aim to predict the structures and find them in databases.By its very definition, pseudoknots are described by a "copy language".This is the reason why formally it would be necessary a context sensitive grammar to analyze them, its corresponding problems would NP-Complete and their solutions have prohibitive complexity [6].In order to avoid this problem, linguistic methods try the following approaches: intersections of associated SCFG [2]; special non terminal symbols and specific productions for each problem [19]; or parallel grammars that communicate between them [4].Even so, its implementations require computational times that reach unfeasible levels [O(n 6 )] and are not yet effective nor trustworthy [20].
In this article we propose a novel algorithm to solve the classical problems described above.This proposal consists in altering an existing effective algorithm known as Covariance Models (CM) [7], which is used to analyze secondary structure, in such a way that it is able to analyze pseudoknots.This is achieved by disaggregating pseudoknot prediction in two predictions of consecutive and related secondary structures, managing to divide the analysis in two iterations, thus avoiding simultaneous analysis, as well as its corresponding complexity and computational cost.This method yields a structure prediction in O(n 2 logn), improving the complexities in the algorithms developed so far and enlarging the maximum window of the sequence being analyzed.This approach was inspired in the theory of information [22] and its handling of entropy, which states that the information is not evenly distributed across the message (sequence) but that there are regions or symbols that contain more information than others.In preserving or analyzing these "special" regions one can obtain a better comprehension of the whole message that most of the times is a cryptic message that encloses large quantities of unknown information -as in the case of the bioinformaticist's point of view -.The article structure is as follows: in the second section the basic concepts of grammars that describe RNA sequences and basic concepts of SCFGs are introduced.In the third section the elements of the covariance model are exposed along with an innovative approach to predict pseudoknots in secondary structure as an extension to such models.In the fourth section an experimental evaluation of the proposed approach is presented, and in the last section conclusions and further work are presented. 4Virtually, pseudoknots are presents in all RNA types (ARNm, ARNt, lsuARN, ssuARN, 7-SL ARN, U2-snARN, I-group of introns, viral ARN) and possess many biological functions (see [12], [8], [16] or [5]). 5For example, the Rivas-Eddy algorithm [18] have temporal complexity O(n 6 ), spatial complexity O(n 4 ) and a constrained set of sequences to analyze (the sequence cannot contains more than 150 bases) [25]; the Lyngsø-Pedersen algorithm [14] have temporal complexity O(n 4 ), spatial complexity O(n 3 ) and make predictions about H pseudoknots; or another comparative divide and conquer algorithm with temporal complexity O(n 2 ) and without constrains about the pseudoknot type.Though these ad hoc approaches needs manual and expert mediation [25].At last, the ILM algorithm combines the thermodynamic and comparative approaches to obtain a more desirable and specific one than predecessors, but their final result is not optimal [19].

Basic Concepts
One of the most promising techniques in bioinformatics is the analysis of stochastic grammars, since they allow the generation of sequence patterns in a natural way, besides having a broader range of action than other architectures [21].Stochastic grammars have its origins in formal grammars that were developed as a model to analyze natural languages; in fact, these were developed at the same time that the double-helix model was developed by Watson and Crick [1].Grammars are useful tools to model character sequences, in a certain way are useful to model molecular biology sequences [18] [1] [3].Many bioinformatics problems can be reformulated in terms of formal languages, producing the corresponding grammar from the available data [1].Among several utilities contributed by grammars, the main contribution is the ability to test by derivations if a sequence is syntactically correct, that is, if it belongs to a determined language.A derivation can be represented as a tree-like structure known as derivation tree.This tree reflects the syntactical structure of a sequence.It is possible that for a given sequence there are more than one derivation tree.In this case, we say that the grammar is ambiguous.In ambiguous grammars, complexity for the derivation rises given that the possible trees grows exponentially with the length of the sequence to be derived [1].For a complete revision on basic concepts of formal language theory, including the study of grammars, reading of [10] or [13] is recommended.

Stochastic Context Free Grammars.
There is a wide variety of palindrome examples present in RNA/DNA.Biological palindromes have a different connotation than the usual one, for its letters are not identical but base-complementary 6 , starting from the two ends.This is, in the same way that "a man, a plan, a canal, panama!" is usually seen as a palindrome (if the blanks and punctuation are removed), for its sequence mean the same thing even if we it is inverted, it is understood that AGAUUUCGAAUCU is a biological palindrome (given that if read from right to left a sequence of complementary bases to the original sequence).In other words, due to the complementary nature of the DNA double-helix structure, each half of a palindrome on a strand has its mirror image in the opposing strand7 [1].This distant pairings causes that the computational tools used commonly for protein analysis can't be used for RNA secondary structure.This is due to the nested structure observed frequently in RNA that can't be modeled in an efficient way using classic sequence correspondences (Neural Networks, regular languages or Hidden Markov Models (HMM)).For this reason an SCFG can be a better approach [6].An SCFG is built by adding a probabilistic structure to the production rules of a given grammar [1].In other words, each production rule ϕ→δ has an assigned probability P(ϕ→δ) P(ϕ→δ) denotes the probability of using rule ϕ→δ in a derivation, thus complying with the second axiom of probability Σ δ (P δ (ϕ→δ))=1.A stochastic grammar is then characterized by a set of probabilistic models that generate the corresponding language.In this way, to measure the probability P(O|w) 8 of the stochastic derivation of a sequence O according to a grammar M = M(w) with parameter w, is enough to compute: where π  is the sequence of derivations needed to produce O.

A modification to the Covariance Models as an approach to the prediction of secondary structure 3.1 Covariance Model construction
Using an approach inherited from HMM, covariance models specify an SCFG architecture adequate to model consensus RNA secondary structures.Consensus structures are repetitive patterns in a RNA family that share various structural motifs among them 9 [6].This method is inspired in the CYK dynamic programming algorithm, which has been the standard to analyze SCFG alignments.In synthesis, this algorithm finds an optimal derivation tree for a parameterized model in a sequence, being an optimization to the inside/outside algorithm [11].At the same time, this algorithm computes the best score assigned to a given sequence from the various alternatives that a given SCFG may generate.
To describe with a SCFG this multiple alignment between RNA homologous sequences, various types of non terminal symbols are needed to model the different known structures.
The Non-terminals of the grammar included in the CM, and their semantic are: (see Figure 1) [6]: • P: the paired columns in Watson-Crick bridges (bonds A-U C-G) are described by a non terminal that emits a base pairing.• L: The non paired columns are described by a non terminal that emits to the left (direction 5´→ 3´) 10  whenever possible; that is, when no possible ambiguous sequences may arise.• R: non terminal that emits to the right (direction 3' 5').Case that can occasionally happen in protuberances between stems and loops in the right part of the structure (strand 3') 11 .It is used when ambiguous sequences emerge when L is used.• B: Bifurcation non terminal used to split several stems or loops with various branches arising from it.
• S: Beginning non terminal that acts as immediate son to a bifurcation's derivation or a sequence start.
• E: Ending non terminal that finishes the derivation of sequences.
• D: Suppression non terminal that is used to describe a production that does not emit terminal symbols and does not describe one of the previous cases.
Figure 1.Types of grammatical rules and its probabilities in a Covariance Model 12 . 9Homologous structures in molecular vocabulary.. 10 In molecular biology, 5´ and 3´ are the ending points of the DNA/RNA sequences.The direction is associated with many celular functions such as duplication and replication. 11The secuence must be readed in 5´→ 3´ direction. 12Quoted from [6], pp 278.To build a CM it is necessary to begin with an alignment 15 of RNA sequences, with its correspondent annotation of the consensus secondary structure and the annotations of which columns must be considered insertions and which should be considered consensus columns (see Figure 2).From these columns a consensus structural tree is built.The CM will be a directed graph of M states with transitions given by t v (y), with each state numbered in a way that (y,z)→v.The CM can then be visualized as an array of transitions that run in only one direction, which guarantees two things: an iterative dynamic programming calculation through all the model states and that all transitions for state v can be maintained as a displacement and a count.

A Pseudoknot secondary structure prediction implementation as an extension of Covariance Models
Understanding CM, and generally speaking all SCFG, as computational tools that allow to probabilistically model distant relations between symbols in the molecular language, then it is possible to extend this concept to pseudoknots.This extension makes possible its prediction when in the original sequence the distant paired segments (copy language) are deleted, and then it is feasible to analyze the resulting sequence as usual.
Just as the relation that exist between residues of a stem is separated by non related residues, in the same way a pair of residues that form a pseudoknot will be separated by residues that are not necessarily related to the pseudoknot being analyzed 16 .Because of this, just as non related residues are set apart to determine the relation between base pairs that form the stem, in the same way it is possible to set apart existing relations between non related residues to determine the particular relation between the base pairs that form the pseudoknot.
Definition.If γ and δ are RNA sequences such that |γ|=|δ| and, always that 1≤i≤|γ| you have that γ i y (δ -1 ) i are base-complementary, where δ -1 is δ written backwards.Then it is stated that γδ is a biological palindrome.
Note that if x is a RNA sequence that can be written as the concatenation of five subsequences this is x=πγψωσ (π y σ possibly empty) such that γω is a biological palindrome, then the CM model must predict the pairing (γ,ω -1 ) as part of the secondary structure of x (see Figure 4).Now, if γω is a biological palindrome and πψσ can be decomposed as the concatenation of five substrings αβµνρ such that βν is a biological palindrome, then the pairing (β,ν -1 ) is also predicted by the CM model as part of the secondary structure of x.
On the other hand, if γω is a biological palindrome and πψσ can't be decomposed as in the last paragraph, then it is possible to infer that in x=πγψωσ one cannot find pairings that don't propose pseudoknots in the secondary structure of x.Let x'=πψσ be the sequences obtained of x by eliminating all paired base pairs according to the CM model.If there are sequences π', γ', ψ', ω' and σ' such that x' = πψσ = π'γ'ψ'ω'σ' and γ'ω' is a biological palindrome, then the pairing (γ',ω' -1 ), that is part of the secondary structure of x', report a pseudoknot present in the secondary structure of x.
The previous statement can be concluded because only two cases make sense:  Thus, it is intended that πψσ preserve the information from the previous analysis while relations between γ and ω are analyzed.In this way the problem is divided and a second iteration can be made to find the two levels of relations in a separate way.After this step, the two analysis can be fused together and give the complete analysis at the end.In synthesis, instead of doing a parallel analysis as in [4], we propose a sequential analysis of two instances of the same problem, with evident gain in computational resources, which were before shared and now can be entirely dedicated to each of the two instances; besides the scale of structural complexity is reduced in both cases in a dramatic way.

General description for the validation of the proposed extension
To test the proposed algorithm a program was written in C/C++, which was executed in two systems, the result diverging only in the time of the required analysis.The fastest implementation was executed in an INTEL/Linux platform with 1.8GHz processor and 1024 Mbytes of RAM.The second implementation was executed in an IRIX 6.5 platform (SGI Origin 200) with 4 parallel processors and 1024 Mbytes in RAM 17 .Such program analyze an alignment in STOCKHOLM 18 format, augmented with a marker PK_cons 19 , which indicates the relations of pseudoknots according to the standard implemented by PseudoBase [26].As an example figure 5 shows a theoretical alignment with a consensus pseudoknot for the sequences shown.The written software abstracts the two levels of secondary structure and pseudoknots in two separate levels, one to indicate the secondary structure SS_cons (see Figure 6) and other to indicate the structure of pseudoknots PK_cons (see Figure 7) abstracting the one previously analyzed.Once the information is processed in these two levels, the algorithm proceeds to analyze them separately to construct the consensus models in both structure levels.With these two models built it is possible to predict the structure of a nucleotide string without any included structure (see Figure 8).The software will then compare separately the same string with the two consensus models, predicting the secondary structure that would probably have that sequence, as well as the probable pseudoknot that it would contain.The probable pseudoknot is abstracted in the secondary structure.After this step, the data is unified by building the pseudoknot structure to give the final result with the complete structure.The secondary structure and the pseudoknot structure (see Figure 8).

Results and Experimental Evaluation
The set of source sequences for the test process has the same origin as the one used in [4]  23 .This set, which can be found in the tmRNA Database 24 , is already aligned and annotated for secondary structure and pseudoknots [28].The tmRNA has at most 4 rightly annotated pseudoknots.From this database 35 sequences were downloaded.These sequences were organized in a phylogenic tree in such way that a test group for the model training and another group for the model testing were both in equilibrium at a phylogenic level, without any over represented class in the samples.The source data were then aligned, deleting the non-shared columns between the bacteria sequences results in a complete alignment, obtaining a source suitable for bioinformatics analysis.The appropriate model for the two abstracted structures was built from the 5 bacteria sequences that are in the phylum 25 of the bacteria to be analyzed.The following structure was derived from these two models starting with a sequence without structural annotation (see Figure 9).,,,,,,,,,,,,,,,,,,,,,,  # If the predicted structure is compared to the structure provided by the tmRNA Database (see Figure 10), it is found that the algorithm predicts in general the whole structure, placing correctly the two pseudoknots that are found, even though it presents certain inconsistencies in the size of the pseudoknot or in the conforming bases in some region of the whole molecule.A false positive was never produced, that is, a pseudoknot placed by the algorithm where it should not be.This proves that the algorithm and it's prototype are able to analyze big quantities of data in a relatively trustworthy fashion, improving other algorithms which analyzed pseudoknots with similar results, but could only analyze little fragments, only one pseudoknot or totally ignoring the secondary structure that was not related to the pseudoknot [4].
It is quite important to point out that systems to date successfully identify pseudoknots simultaneously with sequences around 100 base pairs [4][5].The sequence above has more than 740 base pairs, which is a noteworthy improvement in the practical field for pseudoknot analysis.

Conclusions
Given the rising amount of data the genome projects produce, the need for new sensible filters to find adequate sequences is preponderant.Besides, given that in some cases there is a lack of information to analyze the relations between the sequences, this filters need new characteristics to compare the sequences.One of the factors that are only starting to be exploited to such end is pseudoknots, that is why the algorithm proposed here is a contribution in that sense.For the particular case of RNA and its secondary structural prediction, there is an attempt to decouple and disaggregate a complex problem by dividing it in sub problems of an easier analysis, paying special attention to preserving the coherence and in a rigorous way put the data back together.Because the results obtained were satisfactory, the success of this integrated approach is confirmed, approaches that have already been successful in other bioinformatics fields 28 .
Each one of the non terminals has, by its stochastic characteristic, a probability e v (a,b) of emitting one or two pairs of bases.Here v is the non terminal and a, b ∈ {A,G,C,U}.The symbol t v (Y) represents the probability to go from state v to state Y.

Figure 2 .
Figure2.13A fictitious alignment of three sequences where 24 consensus columns are modeled out of 28 possible columns.The annotated structure sequence is consensus for the structure on the right, a structure that corresponds to the human sequence14 .

Figure 3 .
Figure 3. Canonical sketch of a pseudoknotted structure, divided in substring αβγδεµφρωτκ, ready to subsequent abstraction.

Figure 4 .
Figure 4.Abstraction sketch of the pseudoknotted structure πγψωσ, ready to subsequent analysis.

Figure 9 .
Figure 9.A extract of the base sequence and corresponding predicted structure of E. coli built from an alignment retrieved from tmRNA DataBase 26 .

Fragment of the predictedFigure 10 .
Figure 10.Real structure of E. coli retrieved from tmRNA DataBase 27 .
Five RNA fragment alignment.Note that it is used <> to indicate relations between loops, and [] to indicate pseudoknots binding20.