Understanding the principles that determine mRNA translation efficiency is of primary value for deciphering translational control of gene expression and for optimization of protein synthesis in biotechnology. In this work we combined Flowseq method with randomization of a region within the spacer between the Shine‐Dalgarno box and AUG start codon to decipher an influence of this mRNA part on translation efficiency.
Protein synthesis in heterological expression systems, such as bacteria, is one of the major goals of biotechnology. A protein is synthesized by ribosomes following instructions encoded in the sequence of mRNA (Brenner et al. , 1961; Gros et al. , 1961). Non‐coding regions of mRNA, in particular the 5’‐untranslated region (5’‐UTR), contribute greatly to the efficiency of protein synthesis (Laursen et al. , 2005; Brenneis and Soppa, 2009). Computational and experimental analysis of bacterial 5’‐UTRs revealed a number of features affecting translation (Chen et al. , 1994; Salis et al. , 2009; Salis, 2011; Espah Borujeni et al. , 2014; Farasat et al. , 2014). The most important and best‐known 5’‐UTR sequence element of bacterial mRNAs is the Shine–Dalgarno (SD) box complementary to the 3’‐end region of the small subunit 16S rRNA (Shine and Dalgarno, 1974). The length of the complementary region in E. coli varies between 4 and 8 nucleotides (nt) (Shultzaberger et al. , 2001). The optimal distance between the SD box and the start codon for efficient gene expression is 5–9 nt (Hartz et al. , 1991; Osterman et al. , 2013).
Experimental analysis of 5’‐UTR sequence elements that affect protein synthesis is commonly done via monitoring the expression of a reporter gene placed downstream of a set of specific 5’‐UTR sequences (Vimberg et al. , 2007). High‐throughput analysis of the translation efficiency as a function of mRNA sequence became possible with development of the FlowSeq method (Kudla et al. , 2009; Goodman et al. , 2013; Evfratov et al. , 2017). This method is based on sorting of cells transformed by a library of reporter constructs encoding a fluorescent protein, followed by next‐generation sequencing of sorted fractions of the library. This method allowed us to identify multiple features of the 5’‐UTR influencing the translation efficiency (Evfratov et al. , 2017). However, complete randomization of 20 or 30 nt 5’‐UTR regions (Evfratov et al. , 2017) resulted in a library whose diversity exceeded the throughput of the FlowSeq method, precluding complete sampling of the sequence space and hence complicating in‐depth analysis of particular 5’‐UTR determinants of the translation efficiency.
Here, we have applied FlowSeq to assess the translational influence of the least studied region of the 5’‐UTR, the spacer between the SD box and the start codon.
To analyse the influence of the spacer region sequence on the translation efficiency, we used the plasmid encoding the red (RFP) and cerulean (CER) fluorescent proteins (Osterman et al. , 2012; Osterman et al. , 2013; Evfratov et al. , 2017). Both fluorescent protein genes were controlled by identical T5 promoters (Fig. 1, see Fig. S1A for details). 5’‐UTRs of the rfp gene were identical in all plasmids in the library, allowing us to use RFP as an internal standard, whereas 5’‐UTRs of cer were subjected to partial randomization. Randomized inserts of four nucleotides were placed into the 8 nt spacer region between the SD box and the start codon of the 22 nt long 5’‐UTR of the cer gene (Fig. 1, see Fig. S1B for details). E. coli cells transformed by this plasmid library were sorted into six fractions according to the CER/RFP ratio measured as the ratio of fluorescence intensities at 405/530 and 561/582 nm and indicated by the inclined frames of the fractions on the real FACS plot (Fig. 1).
Sorted cells from these fractions were collected, used for plasmid isolation and PCR amplification of the region containing the randomized 5’‐UTR part (Fig. S2). Next‐generation sequencing of the amplicons allowed us to deduce the distribution of cells carrying particular variants among the fractions separated by the CER/RFP ratio (Table S1) and hence to assign the translation efficiency value to each construct.
Sequencing of the cer 5’‐UTR regions from the sorted cells yielded 249 unique inserts out of 256 variants theoretically possible for a 4 nt randomized region (Tables S1 and S2). Unlike our FlowSeq analysis of reporter construct libraries with 5’‐UTRs containing 20 and 30 nt randomized regions (Evfratov et al. , 2017) demonstrating four orders of magnitude span of translation efficiencies, the reporters containing 4 nt randomized region between the SD box and the start codon demonstrated at most a 100‐fold difference between the highest and lowest CER protein yield (Fig. S3). This efficiency range provides for moderate, yet substantial contribution of the spacer region sequence to the overall translation rate.
Taking into account a relatively narrow distribution of the observed translation efficiencies and an uneven distribution of 5’‐UTR variants among six initially obtained fractions (Fig. S3), we combined these fractions into two groups, representing high and low translation efficiencies in such a way that both groups contained (almost) equal number of sequences. These two groups were subjected to further computational analysis.
The comparison of the nucleotide composition in the randomized spacer (Fig. 2A) for efficiently and poorly translated mRNAs revealed a significant difference in the nucleotide composition for all positions of the spacer (chi‐square test, P‐value < 10–6). At all four positions, adenosine appeared to be the most beneficial for the translation, while cytidines were unfavourable, in agreement with the results previously obtained for reporters carrying larger randomized regions (Evfratov et al. , 2017). Our results are also in consent with previous work (Mirzadeh et al. , 2015) where an influence of the spacer region positions −6 to −1 on translation efficiency was examined. Likewise, in these data, oligoadenosine tracks of 3 or 4 residues at positions −6 to −3 could be found predominantly in mRNAs possessing high expression efficiency. Many efficiently translated mRNAs in our library possess A‐rich sequences with U residue in the spacer region. For instance, mRNAs with AAAU, AAUA, AUAA, AUAU sequence variants have the mean fraction number more than five (Table S1) demonstrating one of the highest translation efficiency. Analysis of GC content of the randomized mRNA region for mRNAs possessing different translation efficiencies (Fig. S4) also suggests a positive role of AU content of the examined mRNA part on protein yield. This might reflect a positive influence of AU‐rich enhancers (Komarova et al. , 2005) that may bind ribosomal protein S1 (Boni et al. , 1991; Komarova et al. , 2005; Duval et al. , 2013; Osterman et al. , 2013), although these enhancers are generally assumed to be positioned upstream of the SD sequence. Alternative explanation for the preference towards adenosines in the spacer region of efficiently translated mRNAs is enhanced interaction with the ribosome. While spacer region of mRNA is not involved in base pairing with the 16S rRNA, nucleotides −1 to −3 of the 4 nt long spacer are stacked on each other and on top of the 16S rRNA nucleotide G926 (Hussain et al. , 2016). Nucleotide A1503 of the 16S rRNA contacts mRNA nucleotide −4 (relative to the first P‐site nucleotide) in several initiation complexes (Jenner et al. , 2010; Hussain et al. , 2016) and nucleotide −2 in the intermediate of translocation (Zhou et al. , 2013). While little is known about the structure of initiation complexes containing longer spacer regions, such as ones used in our study, it might be hypothesized that ribosome may form some sequence‐specific, most likely stacking interactions with this mRNA region.
The spacer region might be a part of a secondary structure that may mask other components of the translation initiation site, such as the SD box or the start codon. Indeed, formation of a secondary structure inhibits the initiation of translation (Smit and van Duin, 1990; Osterman et al. , 2012; Evfratov et al. , 2017). To analyse the influence of the spacer sequence on the folding energy of the translation initiation site, and hence on the translation efficiency, we modelled secondary structures for all mRNA sequences in our data set, using a window encompassing the entire 22 nt 5’‐UTR and the first 50 nt of the coding region. These sequences were used to calculate the minimal free energy (MFE), a standard measure of the predicted secondary structure stability, with lower MFE values corresponding to more stable secondary structures (Fig. 2B). The Kolmogorov–Smirnov test demonstrated that the difference in the secondary structure stability between efficiently and poorly translated mRNAs was significant at the level 10–15 . To show that this effect was not a trivial consequence of the difference in the nucleotide content, we performed permutational analysis (see Methods) and demonstrated that the difference of MFE among fractions was significantly larger than expected given the observed positional nucleotide frequencies (P ‐value = 0.001, see Fig. S5).
All 5’‐UTR sequences in our data set contained a four‐nucleotide SD box, located 8 nt upstream of the AUG start codon. To check whether extension of this standard box yielding additional regions complementary to the 16S rRNA 3’‐end region could influence the efficiency of translation, we calculated the free hybridization energy of the anti‐SD sequence CACCUCCU at the 3’‐terminal region of the 16S rRNA with CNNNNCAUA 5’‐UTR part containing the entire randomized region (Fig. 2C). Overall, two observed distributions of the free hybridization energy did not differ significantly (the Kolmogorov–Smirnov test p‐value > 0.05). However, spacers from poorly translated cer mRNAs had no SD‐like patches in addition to the standard SD box, whereas several spacers from the efficiently translated set contained a sequence that could form complementary interactions with the 16S rRNA 3’‐end region, such as the ones whose randomized regions are AAGG, AGGA, GAGG, GGAG, GGGG and GGGA. This observation corroborates our earlier result (Evfratov et al. , 2017) on an additive influence of multiple SD‐like sequences in the 5’‐UTR on the efficiency of translation. However, this mechanism may explain the observed high translation rate of only a limited subset of efficiently translated mRNAs.
Several tools for prediction of the translation efficiency are published for monocistronic (Salis et al. , 2009; Salis, 2011; Bonde et al. , 2016) and bicistronic constructs (Mutalik et al. , 2013; Nieuwkoop et al. , 2019). Comparison of the measured translation efficiency of mRNAs in our data set with that predicted by RBS Calculator (Salis et al. , 2009; Salis, 2011) (Table S1, Predicted translation efficiency) demonstrated relatively good, although not ideal correlation (r = 0.62).
The data obtained here allow us to suggest the following recommendations for the choice of spacer regions between the SD box and the start codon for bacterial expression systems. In order to boost translation and hence protein yield, it seems reasonable to use oligoadenylate or other A‐rich spacers while avoiding cytidine residues, although it cannot be ruled out that some particular mRNAs with A‐rich spacers might mask translation initiation site within the secondary structure if, e.g., coding region would be U‐rich. Our results might help to adjust the level of exogenous gene expression to a particular biotechnological need. For coexpression of genes whose products should be synthesized at a particular stoichiometry, e.g., if the proteins are subunits of a heteromultimeric complex, the expression levels might be fine‐tuned by proper selection of spacers between the SD boxes and the start codons of the designed mRNAs.
All cloning manipulations were carried out on the Escherichia coli strain JM109. Cells were grown at 37⁰C in LB medium with 100 µg ml−1 ampicillin added if required.
Plasmid pRFPCER (Osterman et al. , 2012) was used as the host vector for randomized library construction (Fig. S1), similar to our previously published protocol (Evfratov et al. , 2017) presented in the supporting information (Fig. S1B). The methodology was adapted from published procedure (Oliphant et al. , 1986). Briefly, oligonucleotide 5’‐ACTGCCGCGGCACACACC GGAGCNNNNCATATG‐3’ containing randomized region was self‐annealed and converted to double‐stranded form by Klenow fragment of DNA polymerase I. Double‐stranded inserts were cloned into the pRFPCER vector via SacII and NdeI recognition sites. Products of ligation were used for transformation of ultra‐competent JM109 cells (Sambrook and Russell, 2006). In parallel, the control plasmid with the same 5’‐UTR upstream the start codons of the both fluorescent protein genes and as a consequence with approximately the same ratio of CER to RFP proteins was transformed into JM109 E. coli strain and next prepared for sorting.
Cells transformed by the plasmid library were grown overnight at 37°C in the liquid LB medium supplemented with 100 µg ml−1 ampicillin with agitation. After washing in phosphate‐buffered saline and diluting in PBS to ca 0.004 A600 , the cells were sorted by Becton Dickinson FACSAria III while simultaneously monitoring CER and RFP fluorescence intensities at 405/530 and 561/582 nm correspondingly. Six fractions with different CER/RFP fluorescence intensity ratio (log scale) were collected according to the indicated inclined frames of the fractions on the real FACS plot (Fig. 1). Inclined frames are used since they represent cells of equal CER/RFP fluorescence ratio.
The number of cells collected for each fraction was proportional to the total abundance of cells with particular CER/RFP fluorescence intensity ratio, i.e. without artificial enrichment for rare variants. The total number of the sorted cells was 104 covering several times the diversity of 4 nt randomized library.
The sorted pools were grown overnight at 37°C in the liquid LB medium supplemented with 100 µg ml−1 ampicillin with agitation and used for plasmid preparation and PCR amplification using primers 5’‐CCATCTCATCCCTGCGTGTCTCATTTGCTTTCAGGAAAATTTTTCTG‐3’ and 5’‐CCACTACGCCTCCGCTTTCCTC NNNN TCACCAGGCCGCTCTCGTCC ‐3’, for the last one the region NNNN corresponds to six barcode variants for further sequencing, the CER coding region is underlined (Fig. S2). The sequencing of the amplicon library was conducted with Ion Torrent (Rothberg et al., 2011) PGM (Life Technologies) using Ion PGMTM Template OT2 200 Kit for emulsion PCR amplification and Ion Chips 314 or 318 along with the reagent kit Ion PGMTM Sequencing 200 Kit v2 following instructions of manufacturer.
To extract 5’‐UTR sequences from raw read data in the FASTQ format, all 50 to 100 nt long reads were used to search for regions that differed by at most two mismatches, including indels, from the GGCACACACCGGAGC and CATATGAAAGAGACGGACGAGAGCGGCCTGGTGA sequences flanking the 4 nt randomized region. Each sequence variant was searched for in all sorted fractions. The collected data were summarized as a table counting the occurrences of each sequence variant in each of the sorted fractions (Table S1). Whereas the same sequence variant is fractionally spread, each sequence variant in the experiment was assigned a mean weighted number of its fraction assuming a Gaussian distribution (the last column in Table S1). Due to strongly different numbers of variants in the fractions and a relatively narrow overall range of the translation efficiency in our experiment (Fig. S3), we divided all 5’‐UTR variants into two groups of roughly equal size. The group of low translation efficiency contains 125 out of 249 mRNA variants whose mean fraction numbers belong to the range from 1.47 to 3.07, while the group of high translation efficiency contains 124 out of 249 mRNA variants whose mean fraction numbers belong to the range from 3.08 to 5. Following analysis was conducted on these two classes.
Nucleotide frequencies were counted for each position within the 4 nt randomized regions separately for efficiently and poorly translated mRNAs. Minimal folding energies (MFE) were calculated for the region encompassing the entire 22 nt 5’‐UTR and 50 nt of the downstream cer coding region using RNAfold ver. 2.1.7 of the Vienna RNA package with default parameters (Lorenz et al. , 2011). To estimate the statistical significance of the difference in MFE between efficiently and poorly translated mRNAs, while controlling for the nucleotide content, we used a Monte Carlo modelling to generate 1000 pairs of sets of shuffled sequence variants by randomly permuting nucleotides in each position, separately for both groups. The MFE calculation was performed for each randomly shuffled variant, the distribution of MFE was constructed for each set, and then the Kolmogorov–Smirnov statistic was calculated for the pairs of these modelled distributions, yielding the distribution shown in Fig. S5. The Kolmogorov–Smirnov statistic for the real pair was calculated and used to estimate the p‐value of the observed difference.
To estimate a propensity of a particular spacer region to hybridize with the 16S rRNA 3’‐terminal region, we calculated the free hybridization energy of 5’‐UTR fragments CNNNNCAUA that contained the randomized part with the anti‐SD sequence CACCUCCU using the RNAfold programme of the Vienna RNA package (Lorenz et al. , 2011). The distributions of the energy of interaction with the anti‐SD sequence were constructed for both mRNAs groups.
This work was supported by RSF grant 19‐14‐00043 (P.S.) with library preparation supported by RFBR grant 17‐00‐00366 (P.S.) and computational analysis supported by RSF grant 18‐14‐00358 (M.G.). Moscow State University Scientific School (O.D.) and Institute of functional genomics government support (P.S.) were used for stipends of participants.