## Abstract

The use of *in vivo* Förster resonance energy transfer (FRET) data to determine the molecular architecture of a protein complex in living cells is challenging due to data sparseness, sample heterogeneity, signal contributions from multiple donors and acceptors, unequal fluorophore brightness, photobleaching, flexibility of the linker connecting the fluorophore to the tagged protein, and spectral cross-talk. We addressed these challenges by using a Bayesian approach that produces the posterior probability of a model, given the input data. The posterior probability is defined as a function of the dependence of our FRET metric FRET_{R} on a structure (forward model), a model of noise in the data, as well as prior information about the structure, relative populations of distinct states in the sample, forward model parameters, and data noise. The forward model was validated against kinetic Monte Carlo simulations and *in vivo* experimental data collected on nine systems of known structure. In addition, our Bayesian approach was validated by a benchmark of 16 protein complexes of known structure. Given the structures of each subunit of the complexes, models were computed from synthetic FRET_{R} data with a distance root-mean-squared deviation error of 14 to 17 Å. The approach is implemented in the open-source Integrative Modeling Platform, allowing us to determine macromolecular structures through a combination of *in vivo* FRET_{R} data and data from other sources, such as electron microscopy and chemical cross-linking.

Mapping the organization and function of the cell requires characterization of the structure and dynamics of biological assemblies (1, 2). However, the construction of models consistent with experimental data is often hampered by data sparseness due to incomplete measurements, data noise due to measurement errors, data ambiguity due to multiple copies of the same component in the assembly, and data mixture due to multiple structural states in a compositionally and conformationally heterogeneous sample.

Traditional modeling aims to find a single structural model by minimizing the difference between the data computed from the model and the experimental data. The noise in the data is typically not modeled accurately and thus biases the estimate of model precision. In contrast, Bayesian structural modeling (3, 4) interprets experimental data more objectively by explicitly accounting for data noise and prior knowledge about the system. Here, we developed a Bayesian approach that converts data from *in vivo* Förster resonance energy transfer (FRET)^{1} spectroscopy into quantitative distance restraints suitable for structural modeling. The approach is available as part of the open-source Integrative Modeling Platform (IMP) (5, 6). IMP is a platform for integrative structure determination of macromolecular assemblies, based on a variety of experimental data, such as electron microscopy images and density maps, chemically cross-linked residue pairs, small angle x-ray scattering profiles, and various proteomics data (2, 7⇓⇓–10).

FRET is a powerful technique for studying protein–protein interactions both *in vitro* and in living cells (11, 12). FRET occurs when two spectrally matched fluorescent molecules are in close proximity and excitation energy is transferred from the donor to the acceptor fluorophore through nonradiative dipole–dipole coupling (Fig. 1*A*). The efficiency of this process (13) is a common experimentally derived variable of *in vitro* single-molecule experiments (14). It has been used to probe distances over the range of 1 to 10 nm, resulting in spatial restraints for modeling the structure of the studied complex (15, 16).

Compared with *in vitro* FRET, *in vivo* FRET measurements present several additional challenges (17) (Fig. 1*B*) that mainly originate from the use of donor–acceptor pairs of color variants of the green fluorescent protein (GFP) (18, 19). Despite significant progress (20), these proteins are not ideal FRET partners, and four sources of noise that affect *in vitro* FRET are amplified. First, the unequal brightness of the two fluorophores can lead to different saturation levels in the donor and acceptor images. Second, the emission and excitation wavelengths of the GFP variants are broad and lead to contamination of the emission from energy transfer with light derived from direct emission from both donor and acceptor (direct acceptor excitation and spectral cross-talk). Third, in the case of the common FRET pair CFP–YFP, YFP is photobleached with exposure to the CFP excitation light and thus becomes gradually inactive during data collection. Fourth, fluorescent proteins are often attached to the tagged protein by means of long, flexible linkers that increase the structural variability of the system. In addition, some complexes may be composed of proteins that do not have 1:1 stoichiometry, and this complicates the interpretation of FRET data in terms of distances between individual components. Many of these problems can be overcome with the use of an experimental approach that measures fluorescence lifetimes of FRET donors (21). However, in many situations in live cells in which a complex is in low abundance, fluorescence lifetime measurements are not feasible (22).

The measurement of additional observables has been proposed to supplement the FRET efficiency as a way to address some of these problems (23). Among these observables is the FRET_{R} index (24⇓–26), a ratio that measures the fluorescence intensity at donor excitation and acceptor emission wavelengths relative to a calculated baseline expected in the absence of FRET. Our Bayesian approach computes this observable for a given structure while accounting for all sources of uncertainty of the *in vivo* FRET_{R} data listed above, as well as for the presence of multiple distinct conformations in the sample (28, 29).^{2} As a result, we can now use FRET_{R} data to determine the molecular architectures of protein complexes *in vivo*.

### Computational Methods and Experimental Procedures—

##### The FRET_{R} Index

FRET_{R} (24, 25) is an index of relative FRET in cells, based on the measurement of fluorescence intensities *I*_{YFP}, *I*_{FRET}, and *I*_{CFP} by an epifluorescence microscope configured with three filter set combinations. In this work, we used filter sets from Chroma® that yielded the YFP (excitation filter at λ_{ex} = 500 nm, emission filter at λ* _{em}* = 535 nm), FRET (λ

*= 430 nm, λ*

_{ex}*= 535 nm), and CFP (λ*

_{em}*= 430 nm, λ*

_{ex}*= 470 nm) images. The baseline fluorescence detected in the FRET image that is not the result of FRET is quantified by the spillover factors*

_{em}*S*and

_{d}*S*, measured in two separate experiments where YFP and CFP are expressed individually. The

_{d}*S*factor quantifies the cross-talk between donor and acceptor emission spectra in the filter sets, and the

_{d}*S*factor quantifies the direct excitation of the acceptor. In an experiment in which YFP and CFP are co-expressed and energy transfer is measured, FRET

_{a}_{R}measures the fold-increase in the intensities in the FRET image relative to a computed and expected baseline. where

*S*=

_{tot}*S*

_{d}· I

_{CFP}+

*S*

_{a}· I

_{YFP}.

##### Bayesian Model of FRET_{R} Data

The Bayesian approach (3, 4) estimates the probability of a model given information available about the system, including prior knowledge and newly acquired experimental data. In the multi-state modeling of FRET_{R} data, the model *M* consists of a set of *N* modeled structures *X* = {*X _{k}*}, their relative populations in the sample {

*w*}, and additional parameters defined below. The posterior probability

_{k}*p*(

*M*|

*D*,

*I*) of model

*M*, given data

*D*and prior information

*I*, is where the likelihood function

*p*(

*D*|

*M*,

*I*) is the probability of observing data

*D*given

*M*and

*I*, and the prior

*p*(

*M*|

*I*) is the probability of model

*M*given

*I*. To define the likelihood function, one needs a forward model

*f*(

*X*) that predicts the data point that would have been observed for structure(s)

*X*and a noise model that specifies the distribution of the deviation between the observed and predicted data points. The Bayesian scoring function

*S*(

*M*) is defined as

*S*(

*M*) = −log[

*p*(

*D*|

*M*,

*I*)·

*p*(

*M*|

*I*)] which ranks alternative models the same as the posterior probability.

##### Forward Model

An ensemble of CFPs and YFPs that are continuously excited by external radiation can return to the ground state through different independent decay pathways, including fluorescence and energy transfer from excited donors to non-excited acceptors. Following Förster theory (13), the rate of energy transfer between donor *i* and acceptor *j* is conveniently written as , where *R _{ij}* is the distance between the two fluorophores and

*R*

_{0}is the Förster radius. The donor fluorescence quantum yield

*Q*is the ratio between the fluorescence rate

_{d}*k*and the total rate of decay and is proportional to the donor brightness. In general,

^{F}_{d}*R*

_{0}depends on the orientation factor κ

^{2}of the interacting dipoles. We adopt the common assumption that donor and acceptor sample their orientations randomly on the time scale of the measurement (30), so that κ

^{2}= 2/3. This is considered particularly valid for fluorescent proteins attached by long, flexible linkers to targeted proteins. The linkers do not adopt a fixed conformation. Finally, the MD simulations described in “Results” showed that the linkers were sufficiently long to allow for orientational averaging during the time of image acquisition.

In the limit of rapid de-excitation and slow excitation rate (SI), the donor and acceptor fluorescence intensities are *I _{d}^{F}* =

*Q*·

_{d}*k*·

_{d}^{X}*g*(

*X*) and

*I*=

_{a}^{F}*Q*· {

_{a}*k*· [

_{a}^{X}*A*] +

*k*· ([

_{d}^{X}*D*] −

*g*(

*X*))} where quantifies the donor fluorescent intensity in terms of CFP and YFP concentrations and relative proximities.

*F*is computed from the Förster expression that relates the rate of energy transfer and distance

_{i}*R*between the two fluorophores

_{ij}*i*and

*j*(13): · [

*D*] and [

*A*] are the CFP donor and YFP acceptor concentrations, respectively, and

*k*and

^{X}_{d}*k*are their excitation rates. The FRET

^{X}_{d}_{R}forward model (supplemental Fig. S1

*A*) is where

*I*is the ratio of CFP and YFP fluorescence in two FRET images when each fluorescent protein is expressed individually at equal levels in separate cells. This quantity is treated as a free parameter, but its value is restrained by the experimental measurement (

_{da}*I*and σ

_{da}^{exp}*I*).

_{da}^{exp}*k*=

_{da}*k*

_{d}

^{X,430}/

*k*

_{a}

^{X,430}is the ratio between donor and acceptor excitation rates at λ

*= 430 nm; it is determined by the ratio between CFP and YFP absorption cross-sections at 430 nm. However, because each fluorescent protein has a different absorption spectrum and the excitation wavelength varies with the filter set,*

_{ex}*k*is treated as a free parameter and is inferred along with the coordinates and the other unknown parameters.

_{da}##### Multi-state Forward Model

For FRET measurements of complexes within living cells, the observed FRET_{R} may arise from multiple conformations of the complex. In such a case, FRET_{R} should be expressed in terms of partial contributions resulting from the individual conformations *X _{k}* and proportional to their relative populations

*w*. The single-state forward model (Eq. 3) can be generalized to take into account multiple states. where 〈

_{k}*g*(

*X*)〉 = ∑

*(*

_{k}w_{k}g*X*).

_{k}##### Photobleaching

YFP fluorophores are photochemically destroyed by prolonged exposure to radiation at wavelengths near the CFP absorption peak. For *in vivo* measurements, the observed FRET_{R} is thus averaged over multiple copies of the system in which photobleached fluorophores do not contribute to the signal. Thus, the same multi-state forward model described above (Eq. 4) can be used, except that *w _{k}* corresponds to the proportion of molecules that are both non-photobleached and in state

*X*.

_{k}##### Likelihood Function

The likelihood function *p*(*D*|*M*, *I*) for dataset *D* = {*d _{n}*} of

*N*independently measured FRET

_{F}_{R}values is a product of likelihood functions

*p*(

*d*|{

_{n}*X*,

_{k}*w*},

_{k}*I*,

_{da}*k*, σ

_{da}*) for each data point. Because the observed FRET*

_{n}_{R}values were strictly positive and unbounded, we modeled the uncertainty with a log-normal distribution:

To account for varying levels of noise in the data, each data point has an individual uncertainty σ* _{n}*.

##### Prior

The prior distribution *p*(*M*|*I*) is a product of priors on the state coordinates *X _{k}*, relative populations

*w*, forward model parameters

_{k}*I*and

_{da}*k*, and uncertainties σ

_{da}*. The priors on the coordinates*

_{n}*p*(

*X*) include terms to maintain the correct stereochemistry of the system, to avoid steric clashes between components, and to incorporate information other than FRET

_{k}_{R}data. The priors

*p*(

*w*) are uniform distributions over the range from 0 to 1, with the constraint ∑

_{k}*= 1. The priors*

_{k}w_{k}*p*(σ

*) are unimodal distributions (31): where σ*

_{n}_{0}corresponds to an unknown experimental uncertainty; the heavy tail of the distribution allows for outliers (supplemental Fig. S1

*C*). The prior

*p*(σ

_{0}) is a uniform distribution over the range from 0.001 to 0.01. If all FRET

_{R}values are measured with the same filter sets and fluorescent proteins, the same values of

*I*and

_{da}*k*can be used for all data points. The prior

_{da}*p*(

*I*|

_{da}*I*, σ

_{da}^{exp}*) is a normal distribution in which*

_{Idaexp}*I*and σ

_{da}^{exp}*are the average and standard error of the experimental measurements. The prior*

_{Idaex}*p*(

*k*) is a uniform distribution over the range from 1 to 15, based on typical ratios of CFP to YFP absorption cross-sections (32).

_{da}To facilitate sampling of the posterior distribution, we eliminate its dependence on the uncertainties σ* _{n}* by integrating the likelihood function and prior

*p*(σ

*σ*

_{n}_{0}) with respect to σ

*. Thus, the marginal likelihood function (supplemental Fig. S1*

_{n}*B*) is

A detailed description is provided in the supplemental material.

##### Kinetic Monte Carlo

KMC simulations (33, 34) were performed on *in silico* models of multiple CFP donors and YFP acceptors (one CFP–one YFP, two CFP–one YFP, and one CFP–two YFP). At each KMC step, one of the following reactions was randomly chosen on the basis of their rates: (a) excitation of either a single non-excited YFP (*k ^{x}_{a}*) or (b) CFP (

*k*); (c) de-excitation of a single excited YFP by either fluorescence (

^{X}_{d}*k*) or (d) other pathways; or (e) de-excitation of a single excited CFP by fluorescence (

^{F}_{a}*k*), (f) energy transfer to a non-excited YFP (

^{F}_{d}*k*), or (g) other pathways. The rate of decay via pathways other than fluorescence was defined by the CFP and YFP quantum yields of fluorescence

^{ET}_{ij}*Q*and

_{d}*Q*, which were both set at 0.5. The factor

_{a}*k*was equal to where the Förster radius

^{ET}_{ij}*R*

_{0}was set at 4.9 nm.

*k*and

^{F}_{d}*k*were set (35) at 0.4 ns

^{F}_{a}^{−1}. Simulations were run for multiple values of

*k*

_{d}

^{X,430}and

*k*

_{a}

^{X,430}, and

*k*

_{a}

^{X,500}was calculated from supplemental Eq. S1. The distance between CFP and YFP was varied between 3 and 10 nm in steps of 0.5 nm. For each choice of the parameters, FRET

_{R}was calculated from Eq. 1 based on the results of three 0.1-s KMC runs used to simulate imaging experiments with 0.1-s exposures. The intensities in the CFP, FRET, and YFP images were calculated from the number of reactions of a given type occurring during the simulations. Based on experimental measurements,

*S*and

_{d}*S*were set at 0.831 and 0.249, respectively. To account for photobleaching, YFPs were randomly labeled as inactive during the acquisition of the CFP image (with the probability set at 0.3) and then removed from the list of possible reactions. FRET

_{a}_{R}was thus calculated by averaging quantities over 3200 independent KMC simulations.

##### Molecular Dynamics

MD simulations were performed with GROMACS4 (36) and PLUMED (37, 38), using the AMBER99SB-ILDN (39) all-atom force field. An implicit solvent based on the Generalized Born formalism combined with the Still method (40) for calculating the Born radii was used. Temperature was controlled by the Bussi–Donadio–Parrinello (41) thermostat. A cutoff of 1.5 nm was used for electrostatic and Lennard–Jones interactions. The parallel tempering algorithm (42) was used to accelerate sampling.

##### Parallel Tempering Simulation of GFP and Linker

The crystal structure of recombinant wild-type green fluorescent protein (PDB code 1GFL (43)) was used as a template. Modeler 9v8 (44) was used to model the C-terminal residues (HGMDELYKGA) present in the GFP sequence, but not in the crystal structure, and the GlyAla motif at the N terminus. The first 7 and the last 14 residues were treated as flexible segments based on the fluctuations observed in a preliminary MD run. The positions of the other heavy atoms of the protein were restrained by a harmonic potential, with the spring constant equal to 9 × 10^{3} kJ · mol^{−1} · nm^{−2}. 32 replicas were distributed over a temperature range from 300 to 500 K. Simulations were carried out for an aggregate time of 1 μs.

##### Combined Parallel Tempering and Metadynamics Simulations of Polyprolines

The polyproline constructs YFP–(PRO)* _{n}*–CFP with

*n*= (0, 5, 10, 15, 20) were simulated through a combination of parallel tempering and metadynamics (45⇓–47). 16 to 40 replicas were used to span a temperature range from 300 to 600 K. A collective variable measuring the number of prolines in

*cis*and

*trans*conformations was used to accelerate proline

*cis–trans*isomerization. For an

*n*-mer peptide, this collective variable was defined (48) as Ω = ∑

_{i=1}

^{n−1}

*cos*ω

*where the torsional angle ω formed by the quadruplet Cα–C–N–Cα was equal to 0° for the*

_{i}*cis*isomer and to 180° for the

*trans*isomer. The well-tempered (49) variant of metadynamics was used, with a bias factor equal to 30 and an initial deposition rate of 1 kJ · mol

^{−1}· ps

^{−1}. YFPs and CFPs were not simulated at atomistic resolution; only the residues belonging to the flexible N- and C-terminal fragments defined in the previous paragraph were explicitly modeled. The fluorescent proteins were instead represented as virtual atoms defined in the fixed reference frame of the first and last modeled residues. Restraints on all distances between virtual and other atoms were used to enforce steric repulsion. A reweighting algorithm (50) was applied to obtain the unbiased distribution of distances between the two virtual atoms representing the center of the fluorophores. Simulations were carried out for an aggregate time ranging from 1 to 8 μs.

##### Parallel Tempering Simulations of Other Proteins

The NMR structures of the THP12-carrier protein from yellow meal worm (PDB code 1C3Y (51)) and the fourth LIM domain of PINCH protein (PDB code 1NYP (52)), as well as the crystal structures of the human TBP-associated factor hTAF(II)28/hTAF(II)18 heterodimer (here abbreviated as TAF28-TAF18) (PDB code 1BH8 (53)) and the ferrodoxin:thioredoxin reductase (PDB code 1DJ7 (54)), were used as templates. Modeler was used to model the flexible linkers at the N and C termini. Preliminary short MD simulations at 300 K were carried out to measure the fluctuations in terms of distance-root-mean-square (dRMS) deviation from the native state. A restraint on the dRMS was then used during the parallel tempering simulations to avoid unfolding at high temperatures. The terminal flexible residues were not considered in the dRMS calculation. Multiple replicas (from 16 to 64) were used to span a temperature range from 300 to 600 K. YFPs and CFPs were not simulated explicitly (see previous paragraph).

##### Benchmark

The benchmark was carried out with the open-source IMP (5, 6), version develop-c47408c. The benchmark results and scripts are available online. The method was tested on 11 ternary and 5 quaternary complexes of known structure, selected from 3D Complex (55). For each pair of subunits in the complex, simulated data were generated for all combinations of the N and C termini of the pair, corresponding to 12 and 24 data points for ternary and quaternary complexes, respectively. Low- and high-noise datasets were generated by setting σ_{0} equal to 0.001 and 0.01, respectively. The average of 50 different random extractions from the marginal likelihood distribution (Eq. 7) was used to simulate the average from repeated experiments, with the typical standard deviation equal to 0.04 and 0.19 for low- and high-noise data, respectively. The typical standard deviation for *in vivo* data is 0.15. Different percentages (100% and 50%) of the total amount of data were used to assess the role of data sparseness in modeling accuracy. To model linker flexibility, a Gaussian mixture model was fit on a set of 5000 probes of radius equal to 10 Å using 10 Gaussian components. The conformation of each subunit was obtained from the crystal structure of the entire complex; it was represented with Cα atoms for each residue and treated as an independent rigid body. An excluded volume potential was used to avoid steric clashes between subunits. Coordinates, forward model, and likelihood parameters were sampled via a Gibbs sampling scheme combined with a simulated annealing Monte Carlo algorithm. A Monte Carlo move of each rigid subunit consisted of a random rotation and translation of at most 17° and 1.0 Å, respectively. A Monte Carlo move of the forward model parameters *k _{da}*,

*I*, and σ

_{da}_{0}consisted of a random perturbation of at most 0.3, 0.3, and 0.001, respectively. Temperature was varied between 1.0 and 5.0

*k*. The initial positions were randomized in a cubic box with dimensions of 100 Å. For each structure and choice of parameters, 20 independent simulated annealing Monte Carlo runs were performed. A total of 2560 tests were conducted, each for a total of 3 × 10

_{B}T^{7}simulated annealing Monte Carlo steps (supplemental Fig. S9).

##### In Vivo FRET_{R} Measurements

*Saccharomyces cerevisiae* strains expressing the YFP and CFP tagged proteins were grown and imaged as previously described (25). The fluorescent proteins were linked to the target proteins through unstructured linkers. Exposure times were either 0.08 or 0.1 s for each image, allowing for a prolonged sampling of an ensemble of proteins such that each can adopt different relative orientations of the fluorescent proteins. Expression of all constructs was driven by the strong TEF promoter. Importantly, all constructs were engineered with a nuclear localization signal, resulting in two advantages. First, the uniform nuclear fluorescence was used as an indication of proper protein folding, and second, nuclear localization allowed the cytoplasm to be used to measure a local background in the cell. All constructs were integrated into the host genome to ensure uniform cell-to-cell gene expression. Plasmids used for integrating the constructs are described in supplemental Table S1.

Image analysis was performed with FRETSCAL, an integrated collection of MATLAB scripts with a graphical user interface. FRETSCAL identifies an area of interest (AOI) within the images and calculates FRET_{R} for each AOI. FRETSCAL has user-controlled selection criteria that (i) define the size of the AOI, (ii) set a maximum pixel intensity of the AOI to ensure that selected AOIs are within the linear range of the image acquisition CCD camera, (iii) set a minimum signal-to-background ratio, (iv) set a maximum cutoff value for the width of a Gaussian fit of the intensity values within the AOI, and (v) define other parameters that automate AOI selection and analysis. The software is open source and is available online at the MATLAB Central website.

A single value of FRET_{R} is calculated as a ratio of the mean background subtracted value of the whole nuclear region in the FRET image divided by the projected value if there was no energy transfer. The projected value is calculated from the corresponding nuclei in the YFP and CFP images of the same field. The projected value is the sum of the mean background subtracted value of the whole nuclear region in the YFP image multiplied by the YFP spillover factor plus the mean background subtracted value of the whole nuclear region in the CFP image multiplied by the CFP spillover factor. The spillover factors are determined as described above under the FRET_{R} heading.

All images used in this study are available online from the YRC Public Image Repository. In addition, a composite image is shown that displays the FRETSCAL output. In the online composite image, the nuclei that satisfied the selection criteria used in FRETSCAL are framed in yellow. The corresponding background pixels are shown in gray.

## RESULTS

Our Bayesian approach for determining a macromolecular architecture from *in vivo* FRET data is based on a microscopic interpretation (forward model) of the experimental observable FRET_{R} in terms of structural models and other parameters. It is thus crucial to first assess the validity of the forward model. To do so, we began with computational validation by means of KMC simulations (33, 34) of *in silico* models of multiple CFP donors and YFP acceptors. We then proceeded with comparisons of FRET_{R} predictions from molecular dynamics simulations to *in vivo* experimental data that were collected from yeast cells expressing constructs of CFP and YFP separated by any one of nine defined linkers and protein structures. Finally, the accuracy of structural modeling using synthetic FRET_{R} data and the structures of each individual subunit was assessed via comparison of native molecular architectures of 16 protein complexes with their models computed with our Bayesian approach.

##### Kinetic Monte Carlo Validation of the Forward Model

Based on the physics of fluorescent molecules, we derived master equations that express the excitation and emission of an ensemble of FRET donors and acceptors as visualized with a fluorescent microscope (supplemental Eqs. S2*A* and S2*B*). The FRET_{R} forward model (Eq. 4) is derived from an approximate solution of these master equations in the limit of rapid de-excitation and slow excitation rate. As a validation of this approximation, the value of the FRET_{R} predicted by Eq. 4 was compared with the results of KMC simulations governed by the master equations S2*A* and S2*B*. The KMC simulations described the evolution of an *in silico* model of multiple CFP donors and YFP acceptors and computed FRET_{R} in every excitation/de-excitation regime. For this comparison, we represented CFP and YFP as dimensionless points whose distance and other parameters were varied (“Computational Methods and Experimental Procedures”).

FRET_{R} changed smoothly with the distance between a single CFP and YFP over the range from 3 to 10 nm (Fig. 2). When the CFP excitation rate *k ^{X}_{d}* was much smaller than its fluorescent rate

*k*(

^{F}_{d}*k*/

^{X}_{d}*k*< 0.05), excellent agreement was found between FRET

^{F}_{D}_{R}from the forward model and KMC simulations, with deviations of less than 1% under all conditions (supplemental Fig. S2

*A*).

FRET_{R} was also computed from KMC simulations of systems of two CFPs and one YFP (supplemental Fig. S3*A*) and of one CFP and two YFPs (supplemental Fig. S3*B*). The behavior of FRET_{R} differs in the two cases. When multiple donors surround a single acceptor, adjacent donors compete for non-excited acceptors. In contrast, a relative abundance of acceptors increases the chance of energy transfer. However, the effect on energy transfer is shaped by the relative rates of excitation and emission of the donor and acceptor (supplemental Fig. S3*C*). In the limit of rapid de-excitation and slow excitation rate, the agreement between the forward model and KMC simulations was still excellent in both cases, with deviations of less than 1% under all conditions (supplemental Figs. S2*B* and S2*C*).

In all the KMC simulations mentioned above, we included the effect on YFP photobleaching during the experiment. To examine this effect directly, we investigated a model system of multiple YFP acceptors. As expected, with fewer acceptors available because of photobleaching, energy transfer was attenuated at all CFP–YFP distances (compare value in supplemental Fig. S4*A* with that in supplemental Fig. S4*B*); again, the FRET_{R} computed by the forward model, which included the effect of YFP photobleaching (supplemental Fig. S4*C*), agreed with that from the KMC simulations that included photobleaching (supplemental Fig. S4*B*).

These comparisons demonstrate that the approximate expression for FRET_{R} given by the forward model (Eq. 4) agrees well with more complex (and far more computationally expensive) simulations based on a more comprehensive physical treatment.

##### In Vivo Experimental Validation of the Forward Model

We further validated the FRET_{R} forward model by comparing the predictions from MD simulations to *in vivo* experimental data that we collected on nine proteins of known structure that were expressed in *S. cerevisiae* (supplemental Table S1). These nine systems included a tandem YFP–CFP; YFP–[Pro]* _{n}*–CFP in which

*n*was equal to 5, 10, 15, or 20 prolines; and four constructs in which CFP and YFP were attached to the N or C termini of proteins of known structure. The latter four constructs were as follows: (i) YFP-THP12-CFP; (ii) YFP-Lim4-CFP; (iii) YFP -TAF28-CFP co-expressed with TAF18; and (iv) FTR117-CFP co-expressed with FTR74-YFP. Finally, a control measurement on the co-expressed but unlinked YFP and CFP pair showed no energy transfer (FRET

_{R}= 1.04). In each case hundreds of images of hundreds of cells were acquired. A sample set of images is shown in Fig. 3

*A*. All the images used in the dataset are available online at the YRC Public Image Repository. Automated processing of the images was accomplished with the software FRETSCAL. The large number (

*n*≥ 200) of identified AOIs provided a strong statistical foundation for the FRET

_{R}measurements used in the Bayesian analysis.

In comparing our forward model against experimental data, we took into account the dependence of the measured FRET_{R} on the presence of multiple conformations in the sample. To do so, we used MD simulations combined with advanced sampling techniques to explore the conformational landscape of the test structures. Although polyproline peptides have often been employed as a spectroscopic ruler, several experimental (56⇓–58) and computational (48, 57) studies have questioned the role of polyproline as a “rigid rod” in a single dominant conformation. Prolyl isomerization from the *trans* to *cis* isomer, whose activation energy is on the order of 10 to 20 kcal/mol (59, 60), converts the left-handed polyproline II helix (PPII) to the more compact right-handed polyproline I helix (PPI). Thus, a heterogeneous population of structures with distinct patterns of *cis* and *trans* isomers of proline is expected to be present in a cell.

The conformational landscape of polyprolines in solution was predicted by all-atom MD simulations in implicit solvent using parallel tempering (42) and metadynamics (45, 46). These techniques allow (i) exhaustive sampling by accelerating proline *trans-cis* isomerization and (ii) estimates of the equilibrium relative populations {*w _{k}*} of the conformers (Eq. 4). The polyproline II helix was favored over the polyproline I helix across all lengths studied (supplemental Fig. S5), in agreement with previous computational (48) and experimental results (61). The conformational landscape of the other constructs was also explored using similar computational approaches. Finally, simulations of the tandem YFP–CFP showed that the linkers at the N and C termini were sufficiently long to allow for orientational averaging of the fluorophores on the time scale of the FRET experiment (supplemental Fig. S6).

To compare the FRET_{R} forward model with experimental data, we calculated the weighted average of *g*(*X*), which depends on the model coordinates (Eq. 4), as the ensemble average over the MD conformations (supplemental Figs. S7*A* and S7*B*). We inferred the forward model parameters *k _{da}* and

*I*, along with the uncertainty σ

_{da}_{0}, by maximizing the posterior distribution, which was defined based on all nine data points using the mean experimental value

*k*= 6.0 and standard error σ

^{exp}_{da}*= 2.0. Using the inferred parameters (*

_{I}^{exp}_{da}*k*= 7.7,

_{da}*I*= 6.6, and σ

_{da}_{0}= 0.05), we found good agreement between the forward model and measured FRET

_{R}values (Fig. 3

*B*, white and black bars, respectively), except for one outlier, TAF28-TAF18. When the procedure was repeated without the outlier (Fig. 3

*B*, gray bars), the inferred parameter values

*k*= 7.5 and

_{da}*I*= 6.2 changed minimally, and the data uncertainty σ

_{da}_{0}dropped from 0.05 to 0.03, as expected upon removal of an outlier data point. Thus, the forward model and associated parameters can effectively account for the influence of components of wide-field fluorescence microscopy, such as installed filter sets and illumination intensity, on the measurement of the efficiency of fluorescence energy transfer. The FRET

_{R}forward model can accurately relate FRET

_{R}values and fluorophore distances.

Finally, to improve the computational efficiency of the forward model, we fit an efficient Gaussian mixture model to the expensive all-atom MD simulations of the linker (SI), without a significant decrease in the accuracy of the forward model (supplemental Fig. S8).

##### Benchmark of Modeling Accuracy

The accuracy of the molecular architectures modeled based on synthetic FRET_{R} data, given the knowledge of the structure of each subunit, was mapped with the aid of known structures for 16 protein complexes of three and four subunits (55). For this benchmark, we used synthetic FRET_{R} data that were computed by first applying our FRET_{R} forward model (Eq. 4) to all pairs of N and C termini of each subunit in the native structures and then adding noise (Eq. 7). The accuracy was defined as the Cα dRMS deviation between the native structure and the most probable model found by the sampling algorithm in IMP, averaged on 20 independent runs. The use of synthetic data in this benchmark allowed us to map the accuracy of structural modeling from FRET_{R} data as a function of the level of data noise and sparseness, with (supplemental Table S2) and without (supplemental Table S3) taking the linker flexibility into account. A flowchart explaining the different steps of the benchmark is presented in supplemental Fig. S9. It is conceivable, however, that the accuracy of models computed from real FRET_{R} data might be worse than that from the simulated data, despite our effort to include noise in the simulated data. Real FRET_{R} data for the FTR117-FTR74 case were not used as a benchmark case, because the flexibility and the resulting demand on sampling made it difficult to run the benchmark a very large number of times.

When 100% of the data points were used, the accuracy of the predicted structure of the complex was 13.9 Å and 14.8 Å for ternary and quaternary complexes, respectively. This accuracy was marginally reduced to 16.1 Å and 17.4 Å when noisy data were used. The weak dependence on the noise level resulted from the small standard error of FRET_{R} obtained by averaging FRET_{R} over many (∼100) independent experiments. In contrast, the accuracy was strongly dependent on data sparseness. When only 50% of the data points were used, the accuracy decreased to a range from 20.4 Å to 21.5 Å, depending on the number of subunits and the noise level. This result emphasizes the need to compile as much information as possible from *in vivo* measurements.

Because FRET_{R} data provide information about the distance between the protein termini, we expected much greater accuracy in determining the positions of the terminal residues (dRMS_{T} in supplemental Tables S2 and S3). Indeed, the accuracy was 5.2 Å to 9.3 Å for ternary complexes and 7.1 Å to 11.6 Å for quaternary complexes, depending on the noise level.

Finally, the accuracy was also affected by the linker flexibility (supplemental Table S3). In particular, the positions of the tagged termini were inferred with greater accuracy (〈ΔdRMS_{T}〉 = 2.2 Å) when the simulated data were created and the sampling was performed without the linker flexibility. However, the inclusion of the linker flexibility had a relatively small effect on the accuracy (〈ΔdRMS〉 = 1.1 Å). Thus, the presence of a flexible linker, while allowing orientational averaging of the fluorophores (supplemental Fig. S6), does not dramatically affect the accuracy of our approach.

## DISCUSSION

Many observables have been introduced to quantify *in vivo* FRET (23). Fluorescence lifetime microscopy overcomes many of the problems associated with epifluorescence microscopy, but it is technically challenging and applicable only for complexes with a robust fluorescence signal (21, 22, 62⇓⇓–65). Many FRET indexes have successfully processed steady-state epifluorescence images to yield significant insights into the dynamics of protein associations in live cells (22, 23, 66). However, this work represents the first case in which the supporting theory and structural predictions from a FRET metric have been modeled and tested both *in silico*, with molecular dynamic simulations, and *in vivo*, with benchmark protein complexes.

Although our Bayesian approach could be adapted to incorporate other FRET metrics, or even FRET efficiencies derived from fluorescence lifetime microscopy, we chose the metric FRET_{R}. To our knowledge this is the only live-cell FRET metric in which structural arrangements predicted from *in vivo* measurements were directly confirmed *in vitro* by means of single particle analysis. FRET_{R} measurements of the γ-tubulin complex in yeast predicted the location of the N and C termini of two proteins, Spc97 and Spc98, in the complex (25). Fluorescent proteins linked to these ends were later directly visualized at the predicted locations via electron microscopy (67). FRET_{R} has also been used to analyze the structure of the yeast spindle pole body (24, 68) and cohesion architecture (69), and more recently the organization of the yeast kinetochore (26). Of course FRET_{R} also has limitations, and it is most appropriate for experimental conditions in which the proteins in a complex are uniformly tagged with a fluorescent protein, gene expression is tightly regulated and typically driven from native promoters, and free unincorporated proteins do not interfere with the FRET measurements (17, 23⇓–25). We showed that our FRET_{R} forward model is accurate, first by comparing the predicted value (Eq. 4) with that computed from KMC simulations of an *in silico* model of multiple CFP donors and YFP acceptors. Excellent agreement was found for typical conditions of fluorescence microscopy,^{3} where CFPs and YFPs were not saturated by the incident illumination. In addition, KMC simulations on systems of multiple donors and acceptors (supplemental Fig. S3) illustrated the expected asymmetry of the one CFP–two YFP and two CFP–one YFP experiments and suggested that data from experiments in which the positions of YFP and CFP are swapped provide independent and thus useful information and should not be averaged (24).

We also validated the forward model using experimental data by comparing predicted FRET_{R} to *in vivo* data collected on nine proteins of known structure, including fluorescent proteins separated by polyproline peptides of different lengths (Fig. 3*B*). Accurate modeling of the experimental data required explicit modeling of multiple conformations in the sample (supplemental Figs. S5 and S7). Although in this study the relative populations {*w _{k}*} were predetermined by MD simulations, in general they can be inferred along with the coordinates of the system and other parameters using multi-state Bayesian scoring functions (27⇓–29).

We demonstrated that the Bayesian approach is robust with respect to the presence of outlier data points. Collecting FRET_{R} data in living cells requires tagging a complex with CFP–YFP pairs that might perturb the system and affect its structure. As a result, a data point might not correctly represent the native structure of the complex and thus might be inconsistent with other information, including other FRET_{R} measurements. For example, the FRET_{R} value predicted for TAF28-TAF18 was significantly different from the observed one (Fig. 3*B*). This discrepancy might arise from several other factors besides structural changes due to the insertion of the fluorophores, such as non-converged MD simulations and inaccuracy of the molecular mechanics force field. Importantly, for each data point, an uncertainty parameter is either inferred or marginalized (31), allowing those points that are not consistent with the bulk of the data to be properly down-weighted in the construction of the model.

The results of the benchmark (Fig. 4 and supplemental Table S2) indicated the importance of using multiple data points to model a structure. Synthetic FRET_{R} data between all pairs of subunit N and C termini determined the structure of ternary and quaternary complexes with an accuracy of ∼15 Å (Cα dRMS), whereas using only 50% of the data decreased the accuracy to ∼20 Å. The greatest structural uncertainty is in the orientation between the subunits. The accuracy can thus be improved if further data are collected. Typically, only the protein termini of each subunit are tagged with GFP; the total number of FRET_{R} data points per complex that can be used in structural modeling is thus *N*(2*N* − 1), where *N* is the number of subunits of the complex. However, in principle fluorescent proteins can be inserted at positions other than the protein termini, although such insertions might be more likely to alter the structure of the complex.

Like any search-based approach, our method requires a sufficiently thorough configurational sampling algorithm. Here, we used advanced sampling techniques, including Gibbs sampler MC with simulated annealing (70) and MD combined with parallel tempering and metadynamics (47). We explicitly assessed whether sampling was sufficiently thorough by demonstrating the convergence of the model as a function of the number of sampled models (supplemental Fig. S7).

Compared with other methods that mostly deal with *in vitro* FRET data (15, 16), our approach treats all noise sources that characterize measurements in living cells, accounts for sample heterogeneity, and is robust to outlier data points. Furthermore, our approach is more general, because it allows the use of *in vivo* data collected in both bulk experiments, where multiple CFP and YFP contribute to the measured FRET_{R}, and single-molecule experiments (71), in which a single CFP–YFP pair is present; in the latter application, the observed FRET_{R} is not the ratio of average intensities in the different images (Eq. 4), but the average of FRET_{R} measured on samples in which the YFP is either active or photobleached.

Finally, we implemented our method in IMP, an open-source platform for integrative structural modeling of macromolecular systems (5). Through IMP, FRET_{R} data can be combined with information obtained via other methods, such as electron microscopy, chemical and cysteine cross-linking, small angle x-ray scattering, proteomics, and other theoretical or statistical analyses, in an integrative or hybrid approach (5, 72). The uncertainty in the orientation of the subunits based on FRET_{R} data alone could thus be resolved by considering additional complementary data, even if sparse and noisy. The Bayesian approach is expected to be even more useful in integrative modeling than modeling based on FRET_{R} data alone, because data from different experiments can in principle be properly weighted and thus seamlessly integrated.

## Acknowledgments

We are grateful to David Sivak and Charles Asbury for commenting on the manuscript and to Ben Webb for help in setting up the benchmark. We also thank Peter Schurmann for the clone of FTR, Peter L. Davies for the clone of THP12, Christophe Romier for the hTAF clones, and Jun Qin for the clone of the Lim4.

## Footnotes

Author contributions: M.B., T.N.D., E.G.M., and A.S. designed research; M.B. and E.G.M. performed research; M.B., B.A.S., M.R., D.J., R.R., and E.G.M. contributed new reagents or analytic tools; M.B., R.P., S.K., D.R., and E.G.M. analyzed data; M.B., T.N.D., E.G.M., and A.S. wrote the paper.

↵* This work was funded by NIH Grant Nos. R01 GM083960 (A.S.), U54 RR022220 (A.S.), and P41 GM103533 (E.M. and T.D.) and was supported by SNSF through Grant Nos. PBZHP3-133388 and PA00P3_139727 (R.P.).

↵

^{}This article contains supplemental material.↵

^{2}Bonomi, M., Pellarin, R., Spill, Y., Nilges, M., DeGrado, W., and Sali, A., in preparation.↵

^{3}For example, when collecting data for the yeast spindle pole body, 1.5 mW of light from the source illuminates the sample, corresponding to a photon per fluorophore every ∼50 ns. The excitation rate is of course smaller than implied by this photon flux (*k*< 0.05), because the YFP and CFP absorption cross-sections are typically much smaller than the fluorophore area._{d}^{X}/k_{d}^{F}↵

^{1}The abbreviations used are:- FRET
- Förster resonance energy transfer
- FRET
_{R} - index of relative FRET in cells
- IMP
- Integrative Modeling Platform
- dRMS
- distance-root-mean-square
- GFP
- green fluorescent protein
- CFP
- cyan fluorescent protein
- YFP
- yellow fluorescent protein
- KMC
- kinetic Monte Carlo
- MD
- Molecular Dynamics
- AOI
- area of interest.

- Received May 5, 2014.
- Revision received August 13, 2014.

- © 2014 by The American Society for Biochemistry and Molecular Biology, Inc.

## REFERENCES

- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵