# Geophysics

- Society of Exploration Geophysicists

## Abstract

Full-waveform inversion (FWI) is a challenging data-fitting procedure based on full-wavefield modeling to extract quantitative information from seismograms. High-resolution imaging at half the propagated wavelength is expected. Recent advances in high-performance computing and multifold/multicomponent wide-aperture and wide-azimuth acquisitions make 3D acoustic FWI feasible today. Key ingredients of FWI are an efficient forward-modeling engine and a local differential approach, in which the gradient and the Hessian operators are efficiently estimated. Local optimization does not, however, prevent convergence of the misfit function toward local minima because of the limited accuracy of the starting model, the lack of low frequencies, the presence of noise, and the approximate modeling of thewave-physics complexity. Different hierarchical multiscale strategies are designed to mitigate the nonlinearity and ill-posedness of FWI by incorporating progressively shorter wavelengths in the parameter space. Synthetic and real-data case studies address reconstructing various parameters, from

## INTRODUCTION

Seismic waves bring to the surface information gathered on the physical properties of the earth. Since the discovery of modern seismology at the end of the 19th century, the main discoveries have arisen from using traveltime information (Oldham, 1906; Gutenberg, 1914; Lehmann, 1936). Then there was a hiatus until the 1980s for amplitude interpretation, when global seismic networks could provide enough calibrated seismograms to compute accurate synthetic seismograms using normal-mode summation. Differential seismograms estimated through the Born approximation have been used as perturbations for matching long-period seismograms, which can provide high-resolution upper-mantle tomography (Gilbert and Dziewonski, 1975; Woodhouse and Dziewonski, 1984). The sensitivity or Fréchet derivative matrix, i.e., the partial derivative of seismic data with respect to the model parameters, is explicitly estimated before proceeding to inversion of the linearized system. The normal-mode description allows a limited number of parameters to be inverted (a few hundred parameters), which makes the optimization procedure feasible through explicit sensitivity matrix estimation in spite of the high number of seismograms.

Meanwhile, exploration seismology has taken up the challenge of high-resolution imaging of the subsurface by designing dense, multifold acquisition systems. Construction of the sensitivity matrix is too prohibitive because the number of parameters exceed 10,000. Instead, another road has been taken to perform high-resolution imaging. Using the exploding-reflector concept, and after some kinematic corrections, amplitude summation has provided detailed images of the subsurface for reservoir determination and characterization (Claerbout, 1971, 1976). The sum of the traveltimes from a specific point of the interface toward the source and the receiver should coincide with the time of large amplitudes in the seismogram. The reflectivity as an amplitude attribute of related seismic traces at the selected point of the reflector provides the migrated image needed for seismic stratigraphic interpretation. Although migration is more a concept for converting seismic data recorded in the time-space domain into images of physical properties, we often refer to it as the geometric description of the short wavelengths of the subsurface. A velocity macromodel or background model provides the kinematic information required to focus waves inside the medium.

The limited offsets recorded by seismic reflection surveys and the limited-frequency bandwidth of seismic sources make seismic imaging poorly sensitive to intermediate wavelengths (Jannane et al., 1989). This is the motivation behind a two-step workflow: construct the macromodel using kinematic information, and then the amplitude projection using different types of migrations (Claerbout and Doherty, 1972; Gazdag, 1978; Stolt, 1978; Baysal et al., 1983; Yilmaz, 2001; Biondi and Symes, 2004). This procedure is efficient for relatively simple geologic targets in shallow-water environments, although more limited performances have been achieved for imaging complex structures such as salt domes, subbasalt targets, thrust belts, and foothills. In complex geologic environments, building an accurate velocity background model for migration is challenging. Various approaches for iterative updating of the macromodel reconstruction have been proposed (Snieder et al., 1989; Docherty et al., 2003), but they remain limited by the poor sensitivity of the reflection seismic data to the large and intermediate wavelengths of the subsurface.

Simultaneous with the global seismology inversion scheme, Lailly (1983) and Tarantola (1984) recast the migration imaging principle of Claerbout (1971, 1976) as a local optimization problem, the aim of which is least-squares minimization of the misfit between recorded and modeled data. They show that the gradient of the misfit function along which the perturbation model is searched can be built by crosscorrelating the incident wavefield emitted from the source and the back-propagated residual wavefields. The perturbation model obtained after the first iteration of the local optimization looks like a migrated image obtained by reverse-time migration. One difference is that the seismic wavefield recorded at the receiver is back propagated in reverse time migration, whereas the data misfit is back propagated in the waveform inversion of Lailly (1983) and Tarantola (1984). When added to the initial velocity, the velocity perturbations lead to an updated velocity model, which is used as a starting model for the next iteration of minimizing the misfit function. The impressive amount of data included in seismograms (each sample of a time series must be considered) is involved in gradient estimation. This estimation is performed by summation over sources, receivers, and time.

Waveform-fitting imaging was quite computer demanding at that time, even for 2D geometries (Gauthier et al., 1986). However, it has been applied successfully in various studies using forward-modeling techniques such as reflectivity techniques in layered media (Kormendi and Dietrich, 1991), finite-difference techniques (Kolb et al., 1986; Ikelle et al., 1988; Crase et al., 1990; Pica et al., 1990; Djikpéssé and Tarantola, 1999), finite-element methods (Choi et al., 2008), and extended ray theory (Cary and Chapman, 1988; Koren et al., 1991; Sambridge and Drijkoningen, 1992). A less computationally intensive approach is achieved by Jin et al. (1992) and Lambaré et al. (1992), who establish the theoretical connection between ray-based generalized Radon reconstruction techniques (Beylkin, 1985; Bleistein, 1987; Beylkin and Burridge, 1990) and least-squares optimization (Tarantola, 1987). By defining a specific norm in the data space, which varies from one focusing point to the next, they were able to recast the asymptotic Radon transform as an iterative least-squares optimization after diagonalizing the Hessian operator. Applications on 2D synthetic data and real data are provided (Thierry et al., 1999b; Operto et al., 2000) and 3D extension is possible (Thierry et al., 1999a; Operto et al., 2003) because of efficient asymptotic forward modeling (Lucio et al., 1996). Because the Green's functions are computed in smoothed media with the ray theory, the forward problem is linearized with the Born approximation, and the optimization is iterated linearly, which means the background model remains the same over the iterations. These imaging methods are generally called migration/inversion or true-amplitude prestack depth migration (PSDM). The main difference with the waveform-inversion methods we describe is that the smooth background model does not change over iterations and only the single scattered wavefield is modeled by linearizing the forward problem.

Alternatively, the full information content in the seismogram can be considered in the optimization. This leads us to full-waveform inversion (FWI), where full-wave equation modeling is performed at each iteration of the optimization in the final model of the previous iteration. All types of waves are involved in the optimization, including diving waves, supercritical reflections, and multiscattered waves such as multiples. The techniques used for the forward modeling vary and include volumetric methods such as finite-element methods (Marfurt, 1984; Min et al., 2003), finite-difference methods (Virieux, 1986), finite-volume methods (Brossier et al., 2008), and pseudospectral methods (Danecek and Seriani, 2008); boundary integral methods such as reflectivity methods (Kennett, 1983); generalized screen methods (Wu, 2003); discrete wavenumber methods (Bouchon et al., 1989); generalized ray methods such as WKBJ and Maslov seismograms (Chapman, 1985); full-wave theory (de Hoop, 1960); and diffraction theory (Klem-Musatov and Aizenberg, 1985).

FWI has not been recognized as an efficient seismic imaging technique because pioneering applications were restricted to seismic reflection data. For short-offset acquisition, the seismic wavefield is rather insensitive to intermediate wavelengths; therefore, the optimization cannot adequately reconstruct the true velocity structure through iterative updates. Only when a sufficiently accurate initial model is provided can waveform-fitting converge to the velocity structure through such updates. For sampling the initial model, sophisticated investigations with global and semiglobal techniques (Koren et al., 1991; Jin and Madariaga, 1993, 1994; Mosegaard and Tarantola, 1995; Sambridge and Mosegaard, 2002) have been performed. The rather poor performance of these investigations that arises from insensitivity to intermediate wavelengths has led many researchers to believe that this optimization technique is not particularly efficient.

Only with the benefit of long-offset and transmission data to reconstruct the large and intermediate wavelengths of the structure has FWI reached its maturity as highlighted by Mora (1987, 1988), Pratt and Worthington (1990), Pratt et al. (1996), and Pratt (1999). FWI attempts to characterize a broad and continuous wavenumber spectrum at each point of the model, reunifying macromodel building and migration tasks into a single procedure. Historical crosshole and wide-angle surface data examples illustrate the capacity of simultaneous reconstruction of the entire spatial spectrum (e.g., Pratt, 1999; Ravaut et al., 2004; Brenders and Pratt, 2007a). However, robust application of FWI to long-offset data remains challenging because of increasing nonlinearities introduced by wavefields propagated over several tens of wavelengths and various incidence angles (Sirgue, 2006).

Here, we consider the main aspects of FWI. First, we review the forward-modeling problem that underlies FWI. Efficient numerical modeling of the full seismic wavefield is a central issue in FWI, especially for 3D problems.

In the second part, we review the main theoretical aspects of FWI based on a least-squares local optimization approach. We follow the compact matrix formalism for its simplicity (Pratt et al., 1998; Pratt, 1999), which leads to a clear interpretation of the gradient and the Hessian of the objective function. Once the gradient is estimated, we review different optimization algorithms used to compute the perturbation model. We conclude the methodology section by the source estimation problem in FWI.

In the third part, we review some key features of FWI. First, we highlight the relationships between the experimental setup (source bandwidth, acquisition geometry) and the spatial resolution of FWI. The resolution analysis provides the necessary guidelines to design the multiscale FWI algorithms required to mitigate the nonlinearity of FWI. We discuss the pros and cons of the time and frequency domains for efficient multiscale algorithms. We provide a few words concerning the parallel implementation of FWI techniques because these are computer demanding. Then we review some alternatives to the least-squares criterion and the Born linearization. A key issue of FWI is the initial model from which the local optimization is started. We also discuss several tomographic approaches to building a starting model.

In the fourth part, we review the main case studies of FWI subdivided into three categories of case studies: acoustic, multiparameter, and three dimensional. Finally, we discuss the future challenges raised by the revival of interest in FWI that has been shown by the exploration and the earthquake-seismology communities.

## THE FORWARD PROBLEM

Let us first introduce the notations for the forward problem, namely, modeling the full seismic wavefield. The reader is referred to Robertsson et al. (2007) for an up-to-date series of publications on modern seismic-modeling methods.

We use matrix notations to denote the partial-differential operators of the wave equation (Marfurt, 1984; Carcione et al., 2002). The most popular direct method to discretize the wave equation in the time and frequency domains is the finite-difference method (Virieux, 1986; Levander, 1988; Graves, 1996; Operto et al., 2007), although more sophisticated finite-element or finite-volume approaches can be considered. This is especially true when accurate boundary conditions through unstructured meshes must be implemented (e.g., Komatitsch and Vilotte, 1998; Dumbser and Kaser, 2006).

In the time domain, we have
**M** and **A** are the mass and the stiffness matrices, respectively (Marfurt, 1984). The source term is denoted by **s** and the seismic wavefield by **u**. In the acoustic approximation, **u** generally represents pressure, although in the elastic case **u** generally represents horizontal and vertical particle velocities. The time is denoted by *t* and the spatial coordinates by **x**. Equation 1 generally is solved with an explicit time-marching algorithm: The value of the wavefield at a time step

In the frequency domain, the wave equation reduces to a system of linear equations; the right-hand side is the source and the solution is the seismic wavefield. This system can be written compactly as
**B** is the impedance matrix (Marfurt, 1984). The sparse complex-valued matrix **B** has a symmetric pattern, although it is not symmetric because of absorbing boundary conditions (Hustedt et al., 2004; Operto et al., 2007).

Equation 2 can be solved by a decomposition of **B** such as lower and upper (LU) triangular decomposition, leading to direct-solver techniques. The advantage of the direct-solver approach is that once the decomposition is performed, equation 2 is efficiently solved for multiple sources using forward and backward substitutions (Marfurt, 1984). The direct-solver approach is efficient for 2D forward problems (Jo et al., 1996; Stekl and Pratt, 1998; Hustedt et al., 2004). However, the time and memory complexities of LU factorization and its limited scalability on large-scale distributed memory platforms prevent use of the approach for large-scale 3D problems (i.e., problems involving more than 10 million unknowns; Operto et al. 2007).

Iterative solvers provide an alternative approach for solving the time-harmonic wave equation (Riyanti et al., 2006, 2007; Plessix, 2007; Erlangga and Herrmann, 2008). Iterative solvers currently are implemented with Krylov subspace methods (Saad, 2003) that are preconditioned by solving the dampened time-harmonic wave equation. The solution of the dampened wave equation is computed with one cycle of a multigrid. The main advantage of the iterative approach is the low memory requirement, although the main drawback results from a difficulty to design an efficient preconditioner because the impedance matrix is indefinite. To our knowledge, the extension to elastic wave equations still needs to be investigated. As for the time-domain approach, the time complexity of the iterative approach increases linearly with the number of sources in contrast to the direct-solver approach.

An intermediate approach between the direct and iterative methods consists of a hybrid direct-iterative approach based on a domain decomposition method and the Schur complement system (Saad, 2003; Sourbier et al., 2008). The iterative solver is used to solve the reduced Schur complement system, the solution of which is the wavefield at interface nodes between subdomains. The direct solver is used to factorize local impedance matrices that are assembled on each subdomain. Briefly, the hybrid approach provides a compromise in terms of memory savings and multisource-simulation efficiency between the direct and the iterative approaches.

The last possible approach to compute monochromatic wavefields is to perform the modeling in the time domain and extract the frequency-domain solution either by discrete Fourier transform in the loop over the time steps (Sirgue et al., 2008) or by phase-sensitivity detection once the steady-state regime is reached (Nihei and Li, 2007). One advantage of the approach based on the discrete Fourier transform is that an arbitrary number of frequencies can be extracted within the loop over time steps at minimal extra cost. Second, time windowing can be easily applied, which is not the case when the modeling is performed in the frequency domain. Time windowing allows the extraction of specific arrivals for FWI (early arrivals, reflections, PS converted waves), which is often useful to mitigate the nonlinearity of the inversion by judicious data preconditioning (Sears et al., 2008; Brossier et al., 2009a).

Among all of these possible approaches, the iterative-solver approach theoretically has the best time complexity (here, “complexity” denotes how the computational cost of an algorithm grows with the size of the computational domain) if the number of iterations is independent of the frequency (Erlangga and Herrmann, 2008). In practice, the number of iterations generally increases linearly with frequency. In this case, the time complexities of the time-domain and the iterative-solver approach are equivalent (Plessix, 2007).

The reader is referred to Plessix (2007, 2009) and Virieux et al. (2009) for more detailed complexity analyses of seismic modeling based on different numerical approaches. A discussion on the pros and cons of time-domain versus frequency-domain seismic modeling with application to FWI is also provided in Vigh and Starr (2008b) and Warner et al. (2008).

Source implementation is an important issue in FWI. The spatial reciprocity of Green's functions can be exploited in FWI to mitigate the number of forward problems if the number of receivers is significantly smaller than the number of sources (Aki and Richards, 1980). The reciprocity of Green's functions also allows matching data emitted by explosions and recorded by directional sensors, with pressure synthetics computed for directional forces (Operto et al., 2006). Of note, the spatial reciprocity is satisfied theoretically for the unidirectional sensor and the unidirectional impulse source. However, the spatial reciprocity of the Green's functions can also be used for explosive sources by virtue of the superposition principle. Indeed, explosions can be represented by double dipoles or, in other words, by four unidirectional impulse sources.

A final comment concerns the relationship between the discretization required to solve the forward problem and that required to reconstruct the physical parameters. Often during FWI, these two discretizations are identical, although it is recommended that the fingerprint of the forward problem be kept minimal in FWI.

The properties of the subsurface that we want to quantify are embedded in the coefficients of matrices **M**, **A**, or **B** of equations 1 2. The relationship between the seismic wavefield and the parameters is nonlinear and can be written compactly through the operator **G**, defined as

## FWI AS A LEAST-SQUARES LOCAL OPTIMIZATION

We follow the simplest view of FWI based on the so-called length method (Menke, 1984). For information on probabilistic maximum likelihood or generalized inverse formulations, the reader is referred to Menke (1984), Tarantola (1987), Scales and Smith (1994), and Sen and Stoffa (1995).

We define the misfit vector *N* by the differences at the receiver positions between the recorded seismic data **u** by a detection operator **m** represents some physical parameters of the subsurface discretized over the computational domain.

In the simplest case corresponding to the monoparameter acoustic approximation, the model parameters are the P-wave velocities defined at each node of the numerical mesh used to discretize the inverse problem. In the extreme case, the model parameters correspond to the 21 elastic moduli that characterize linear triclinic elastic media, the density, and some memory variables that characterize the anelastic behavior of the subsurface (Toksöz and Johnston, 1981). The most common discretization consists of projection of the continuous model of the subsurface on a multidimensional Dirac comb, although a more complex basis can be considered (see Appendix A in Pratt et al. [1998] for a discussion on alternative parameterizations). We define a norm **m**) of this misfit vector

### The Born approximation and the linearization of the inverse problem

The least-squares norm is given by

In the time domain, the implicit summation in equation 4 is performed over the number of source-channel pairs and the number of time samples in the seismograms, where a channel is one component of a multicomponent sensor. In the frequency domain, the summation over frequencies replaces that over time. In the time domain, the misfit vector is real valued; in the frequency domain, it is complex valued.

The minimum of the misfit function **m**) is sought in the vicinity of the starting model **m** of dimension *M* can be written as the sum of the starting model **m** is real valued.

A second-order Taylor-Lagrange development of the misfit function in the vicinity of

The perturbation model is searched in the opposite direction of the steepest ascent (i.e., the gradient) of the misfit function at point **m**. This is the case for linear forward problems such as **u** = **G**.**m**. In this case, the expression of the perturbation model of equation 7 gives the minimum of the misfit function in one iteration. In FWI, the relationship between the data and the model is nonlinear and the inversion needs to be iterated several times to converge toward the minimum of the misfit function.

### Normal equations: The Newton, Gauss-Newton, and steepest-descent methods

#### Basic equations

The derivative of **J** is the sensitivity or the Fréchet derivative matrix. In equation 9, *M*. Taking

Differentiation of the gradient expression 8, with respect to the model parameters gives the following expression in matrix form for the Hessian (see Pratt et al. [1998] for details):

For linear problems (**d**=**G**.**m**), the second term in the Hessian is zero because the second-order derivative of the data with respect to model parameters is zero. Most of the time, this second-order term is neglected for nonlinear inverse problems. In the following, the remaining term in the Hessian, i.e.,

Alternatively, the inverse of the Hessian in equation 11 can be replaced by a scalar α, the so-called step length, leading to the gradient or steepest-descent method. The step length can be estimated by a line-search method, for which a linear approximation of the forward problem is used (Gauthier et al., 1986; Tarantola, 1987). In the linear approximation framework, the second-order Taylor-Lagrange development of the misfit function gives
**G**,

Pratt et al. (1998) illustrate how quality and rate of convergence of the inversion depend significantly on the Newton, Gauss-Newton, or gradient method used. Importantly, they show how the gradient method can fail to converge toward an acceptable model, however many iterations, unlike the Newton and Gauss-Newton methods. They interpret this failure as the result of the difficulty of estimating a reliable step length. However, gradient methods can be significantly improved by scaling (i.e., dividing) the gradient by the diagonal terms of

#### Numerical algorithms: The conjugate-gradient method

Over the last decade, the most popular local optimization algorithm for solving FWI problems was based on the conjugate-gradient method (Mora, 1987; Tarantola, 1987; Crase et al., 1990). Here, the model is updated at the iteration *n* in the direction of *n*,

The scalar *M*, i.e.,

#### Numerical algorithms: Quasi-Newton algorithms

Finite approximations of the Hessian and its inverse can be computed using quasi-Newton methods such as the BFGS algorithm (named after its discoverers Broyden, Fletcher, Goldfarb, and Shanno; see Nocedal [1980] for a review). The main idea is to update the approximation of the Hessian or its inverse *n*. In these approaches, the approximation of the Hessian or its inverse is explicitly formed.

For large-scale problems such as FWI in which the cost of storing and working with the approximation of the Hessian matrix is prohibitive, a limited-memory variant of the quasi-Newton BFGS method known as the L-BFGS algorithm allows computing in a recursive manner

#### Newton and Gauss-Newton algorithms

The more accurate, although more computationally intensive, Gauss-Newton and Newton algorithms are described in Akcelik (2002), Askan et al. (2007), Askan and Bielak (2008), and Epanomeritakis et al. (2008), with an application to a 2D synthetic model of the San Fernando Valley using the SH-wave equation. At each nonlinear FWI iteration, a matrix-free conjugate-gradient method is used to solve the reduced Karush-Kuhn-Tucker (KKT) optimal system, which turns out to be similar to the normal equation system (equation 11). Neither the full Hessian nor the sensitivity matrix is formed explicitly; only the application of the Hessian to a vector needs to be performed at each iteration of the conjugate-gradient algorithm.

Application of the Hessian to a vector requires performing two forward problems per shot for the incident and the adjoint wavefields (Akcelik, 2002). Because these two simulations are performed at each iteration of the conjugate-gradient algorithm, an efficient preconditioner must be used to mitigate the number of iterations of the conjugate-gradient algorithm. Epanomeritakis et al. (2008) use a variant of the L-BFGS method for the preconditioner of the conjugate gradient, in which the curvature of the objective function is updated at each iteration of the conjugate gradient using the Hessian-vector products collected over the iterations.

### Regularization and preconditioning of inversion

As widely stressed, FWI is an ill-posed problem, meaning that an infinite number of models matches the data. Some regularizations are conventionally applied to the inversion to make it better posed (Menke, 1984; Tarantola, 1987; Scales et al., 1990). The misfit function can be augmented as follows:
**m**, hence defining the so-called Tikhonov regularization (see Hansen [1998] for a review on regularization methods). The operator

For linear problems (assuming the second term of the Hessian is neglected), the minimization of the weighted misfit function gives the perturbation model:

For the steepest-descent algorithm, the regularized solution for the perturbation model is given by

A more complete and rigorous mathematical derivation of these equations is presented in Tarantola (1987).

Alternative regularizations based on minimizing the total variation of the model have been developed mainly by the image-processing and electromagnetic communities. The aim of the total variation or edge-preserving regularization is to preserve both the blocky and the smooth characteristics of the model (Vogel and Oman, 1996; Vogel, 2002). Total variation (TV) regularization is conventionally implemented by minimizing the

### The gradient and Hessian in FWI: Interpretation and computation

A clear interpretation of the gradient and Hessian is given by Pratt et al. (1998) using the compact matrix formalism of frequency-domain FWI. A review is given here. Let us consider the forward-problem equation given by equation 2 for one source and one frequency. In the following, we assume that the model is discretized in a finite-difference sense using a uniform grid of nodes.

Differentiation of equation 2 with respect to the model parameter

An analogy between the forward-problem equation 2 and equation 21 shows that the partial-derivative wavefield can be computed by solving one forward problem, the source of which is given by **u**. The matrix **B** with respect to **B** generally have local support, the matrix

The spatial support of the virtual secondary source is centered on the position of **s** and scattered by a point diffractor located at

Because the gradient is formed by the zero-lag correlation between the partial-derivative wavefield and the data residual, these have the same meaning: They represent perturbation wavefields scattered by the missing heterogeneities in the starting model

The approximate Hessian is formed by the zero-lag correlation between the partial-derivative wavefields, e.g., equation 10. The diagonal terms of the approximate Hessian contain the zero-lag autocorrelation and therefore represent the square of the amplitude of the partial-derivative wavefield. Scaling the gradient by these diagonal terms removes from the gradient the geometric amplitude of the partial-derivative wavefields and the residuals. In the framework of surface seismic experiments, the effects of the scaling performed by the diagonal Hessian provide a good balance between shallow and deep perturbations. A diagonal Hessian is shown in Ravaut et al. (2004, their Figure 12). The off-diagonal terms of the Hessian are computed by correlation between partial-derivative wavefields associated with different model parameters. For 1D media, the approximate Hessian is a band-diagonal matrix, and the numerical bandwidth decreases as the frequency increases. The off-diagonal elements of the approximate Hessian account for the limited-bandwidth effects that result from the experimental setup. Applying its inverse to the gradient can be interpreted as a deconvolution of the gradient from these limited-bandwidth effects.

An illustration of the scaling and deconvolution effects performed by the diagonal Hessian on one hand and the approximate Hessian on the other hand is provided in Figure 1. A single inclusion in a homogeneous background model (Figure 1a) is reconstructed by one iteration of FWI using a gradient method preconditioned by the diagonal terms of the approximate Hessian (Figure 1b) and by a Gauss-Newton method (Figure 1c). The image of the inclusion is sharper when the Gauss-Newton algorithm is used. The corresponding approximate Hessian and its diagonal elements are shown in Figure 2. An interpretation of the second term of the Hessian (equation 10) is given in Pratt et al. (1998). This term accounts for multiscattering events such as multiples in the reconstruction procedure. Through iterations, we might correct effects caused by this missing term as long as convergence is achieved.

Although equation 21 gives some clear insight into the physical sense of the gradient of the misfit function, it is impractical from a computer-implementation point of view; with the computer explicitly forming the sensitivity matrix with equation 21, it would require performing as many forward problems as the number of model parameters

Inserting the expression of the partial derivative of the wavefield (equation 21) in the expression of the gradient of equation 9 gives the following expression of the gradient:
**u** and one to back propagate the corresponding residuals. The underlying imaging principle is reverse-time migration, which relies on the correspondence of the arrival times of the incident wavefield and the back-propagated wavefield at the position of heterogeneity (Claerbout, 1971; Lailly, 1983; Tarantola, 1984).

The approach that consists of computing the gradient of the misfit function without explicitly building the sensitivity matrix is often referred to as the adjoint-wavefield approach by the geophysical community. The underlying mathematical theory is the adjoint-state method of the optimization theory (Lions, 1972; Chavent, 1974). Interesting links exist between optimization techniques used in FWI and assimilation methods, widely used in fluid mechanics (Talagrand and Courtier, 1987). A detailed review of the adjoint-state method with illustrations from several seismic problems is given in Tromp et al. (2005), Askan (2006), Plessix (2006), and Epanomeritakis et al. (2008). The expression of the gradient of the frequency-domain FWI misfit function (equation 24) is derived from the adjoint-state method and the method of the Lagrange multiplier in Appendix A.

For multiple sources and multiple frequencies, the gradient is formed by the summation over these sources and frequencies:

By comparing the expressions of the gradient in equations 9 24, we can conclude that one element of the sensitivity matrix is given by
*s* and *r* denoting a shot and a receiver position, respectively. An impulse source *r*. If the sensitivity matrix must be built, one forward problem for the incident wavefield and one forward problem per receiver position must be computed. Therefore, the number of simulations to build the sensitivity matrix can be higher than that required by gradient estimation if the number of nonredundant receiver positions significantly exceeds the number of nonredundant shots, or vice versa. Computing each term of the sensitivity matrix is also required to compute the diagonal terms of the approximate Hessian

Although the matrix-free adjoint approach is widely used in exploration seismology, the earthquake-seismology community tends to favor the scattering-integral method, which is based on the explicit building of the sensitivity matrix (Chen et al., 2007). The linear system relating the model perturbation to the data perturbation is formed and solved with a conjugate-gradient algorithm such as LSQR (Paige and Saunders, 1982a). A comparative complexity analysis of the adjoint approach and the scattering-integral approach is presented in Chen et al. (2007), who conclude that the scattering-integral approach outperforms the adjoint approach for a regional tomographic problem. Indeed, the superiority of one approach over the other is highly dependent on the acquisition geometry (the relative number of sources and receivers) and the number of model parameters.

The formalism in equation 25 has been kept as general as possible and can relate to the acoustic or the elastic wave equation. In the acoustic case, the wavefield is the pressure scalar wavefield; in the elastic case, the wavefield ideally is formed by the components of the particle velocity and the pressure if the sensors have four components. Equation 25 can be translated in the time domain using Parseval's relation. The expression of the gradient in equation 25 can be developed equivalently using a functional analysis (Tarantola, 1984). The partial derivatives of the wavefield with respect to the model parameters are provided by the kernel of the Born integral that relates the model perturbations to the wavefield perturbations. Multiplying the transpose of the resulting operator by the conjugate of the data residuals provides the expression of the gradient. The two formalisms (matrix and functional) give the same expression, provided the discretization of the partial differential operators are performed consistently in the two approaches. The derivation in the frequency domain of the gradient of the misfit function using the two formalisms is explicitly illustrated by Gelis et al. (2007).

### Source estimation

Source excitation is generally unknown and must be considered as an unknown of the problem (Pratt, 1999). The source wavelet can be estimated by solving a linear inverse problem because the relationship between the seismic wavefield and the source is linear (equation 2). The solution for the source is given by the expression

Alternatively, new misfit functions have been designed so the inversion becomes independent of the source function (Lee and Kim, 2003; Zhou and Greenhalgh, 2003). The governing idea of the method is to normalize each seismogram of a shot gather by the sum of all of the seismograms. This removes the dependency of the normalized data with respect to the source and modifies the misfit function. The drawback is that this approach requires an explicit estimate of the sensitivity matrix; the normalized residuals cannot be back propagated because they do not satisfy the wave equation.

## SOME KEY FEATURES OF FWI

### Resolution power of FWI and relationship to the experimental setup

The interpretation of the partial-derivative wavefield as the wavefield scattered by the missing heterogeneities provides some connections between FWI and generalized diffraction tomography (Devaney and Zhang, 1991; Gelius et al., 1991). Diffraction tomography recasts the imaging as an inverse Fourier transform (Devaney, 1982; Wu and Toksoz, 1987; Sirgue and Pratt, 2004; Lecomte et al., 2005). Let us consider a homogeneous background model of velocity

Equation 29 has the form of a truncated Fourier series where the integration variable is the scattering wavenumber vector given by

We can express the scattering wavenumber vector **n** is a unit vector in the direction of the slowness vector

Several key conclusions can be derived from equation 30. First, one frequency and one aperture in the data space map one wavenumber in the model space. Therefore, frequency and aperture have redundant control of the wavenumber coverage. This redundancy increases with aperture bandwidth. Pratt and Worthington (1990), Sirgue and Pratt (2004), and Brenders and Pratt (2007a) propose decimating this wavenumber-coverage redundancy in frequency-domain FWI by limiting the inversion to a few discrete frequencies. This data reduction leads to computationally efficient frequency-domain FWI and allows managing a compact volume of data, two clear advantages with respect to time-domain FWI. The guideline for selecting the frequencies to be used in the FWI is that the maximum wavenumber imaged by a frequency matches the minimum vertical wavenumber imaged by the next frequency (Sirgue and Pratt, 2004, their Figure 3). According to this guideline, the frequency interval increases with the frequency.

Second, the low frequencies of the data and the wide apertures help resolve the intermediate and large wavelengths of the medium. At the other end of the spectrum, the maximum wavenumber, constrained by θ = 0 and the highest frequency, leads to a maximum resolution of half a wavelength if normal-incidence reflections are recorded. Third, for surface acquisitions, long offsets are helpful for sampling the small horizontal wavenumbers of dipping structures such as flanks of salt domes.

A frequency-domain sensitivity kernel for point sources, referred to as the wavepath by Woodward (1992), is shown in Figure 5. The interference picture shows zones of equiphase over which the residuals are back projected during FWI. The central zone of elliptical shape is the first Fresnel zone of width

To illustrate the relationship between FWI resolution and the experimental setup, we show the FWI reconstruction of an inclusion in a homogeneous background for three acquisition geometries (Figure 6). In the crosshole experiment (Figure 6a), FWI has reconstructed a low-pass-filtered (smoothed) version of the vertical section of the inclusion and a band-pass-filtered version of the horizontal section of the inclusion. This anisotropy of the imaging results from the transmission-like reconstruction of the vertical wavenumbers and the reflection-like reconstruction of the horizontal wavenumbers of the inclusion. In the case of the double crosshole experiment (Figure 6b), the vertical and horizontal wavenumber spectra of the inclusion have been partly filled because of the combined use of transmission and reflection wavepaths. Of note, the vertical section shows a lack of low wavenumbers, whereas the horizontal section exhibits a deficit of low wavenumbers because the maximum horizontal source-receiver offset is two times higher than the vertical one (Figure 6b). Therefore, the aperture illuminations of the horizontal and vertical wavenumbers differ. For the surface acquisition (Figure 6c), the vertical section exhibits a strong deficit of low wavenumbers because of the lack of large-aperture illumination. Of note, the pick-to-pick amplitude of the perturbation is fully recovered in Figure 6c. The horizontal section of the inclusion is poorly recovered because of the poor illumination of the horizontal wavenumbers from the surface.

The ability of the wide apertures to resolve the large wavelengths of the medium has prompted some studies to consider long-offset acquisitions as a promising approach to design well-posed FWI problems (Pratt et al., 1996; Ravaut et al., 2004). For example, equation 30 can suggest that the long wavelengths of the medium can be resolved whatever the source bandwidth, provided that wide-aperture data are recorded by the acquisition geometry. However, all of the conclusions derived so far rely on the Born approximation. The Born approximation requires that the starting model allows matching the observed traveltimes with an error less than half the period (Beydoun and Tarantola, 1988). If not, the so-called cycle-skipping artifacts will lead to convergence toward a local minimum (Figure 7).

Pratt et al. (2008) translates this condition in terms of relative time error

### Multiscale FWI: Time domain versus frequency domain

FWI can be implemented in the time domain or in the frequency domain. FWI was originally developed in the time domain (Tarantola, 1984; Gauthier et al., 1986; Mora, 1987; Crase et al., 1990) whereas the frequency-domain approach was proposed mainly in the 1990s by Pratt and collaborators (Pratt, 1990; Pratt and Worthington, 1990; Pratt and Goulty, 1991), first with application to crosshole data and later with application to wide-aperture surface seismic data (Pratt et al., 1996).

The nonlinearity of FWI has prompted many studies to develop some hierarchical multiscale strategies to mitigate this nonlinearity. Apart from computational efficiency, the flexibility offered by the time domain or the frequency domain to implement efficient multiscale strategies is one of the main criteria that favors one domain rather than the other. The multiscale strategy successively processes data subsets of increasing resolution power to incorporate smaller wavenumbers in the tomographic models. In the time domain, Bunks et al. (1995) propose successive inversion of subdata sets of increasing high-frequency content because low frequencies are less sensitive to cycle-skipping artifacts. The frequency domain provides a more natural framework for this multiscale approach by performing successive inversions of increasing frequencies. In the frequency domain, single or multiple frequencies (i.e., frequency group) can be inverted at a time.

Although a few discrete frequencies theoretically are sufficient to fill the wavenumber spectrum for wide-aperture acquisitions, simultaneous inversion of multiple frequencies improves the signal-to-noise ratio and the robustness of FWI when complex wave phenomena are observed (i.e., guide waves, surface waves, dispersive waves). Therefore, a trade-off between computational efficiency and quality of imaging must be found. When simultaneous multifrequency inversion is performed, the bandwidth of the frequency group ideally must be as large as possible to mitigate the nonlinearity of FWI in terms of the nonunicity of the solution, whereas the maximum frequency of the group must be sufficiently low to prevent cycle-skipping artifacts. An illustration of this tuning of FWI is given in Brossier et al. (2009a) in the framework of elastic seismic imaging of complex onshore models from the joint inversion of surface waves and body waves.

The regularization effects introduced by hierarchical inversion of increasing frequencies might not be sufficient to provide reliable FWI results for realistic frequencies and realistic starting models in the case of complex structures. This has prompted some studies to design additional regularization levels in FWI. One of these is to select a subset of arrivals as a function of time. An aim of this time windowing is to remove arrivals that are not predicted by the physics of the wave equation implemented in FWI (for example, PS-converted waves in the frame of acoustic FWI). A second aim is to perform a heuristic selection of aperture angles in the data. Considering a narrow time window centered on the first arrival leads to so-called early-arrival waveform tomography (Sheng et al., 2006). Time windowing the data around the first arrivals is equivalent to selecting the large-aperture components of the data to image the large and intermediate wavelengths of the medium. Alternatively, time windowing can be applied to isolate later-arriving reflections or PS-converted phases to focus on imaging a specific reflector or a specific parameter class, such as the S-wave velocity (Shipp and Singh, 2002; Sears et al., 2008; Brossier et al., 2009a).

The frequency domain is the most appropriate to select one or a few frequencies for FWI, but the time domain is the most appropriate to select one type of arrival for FWI. Indeed, time windowing cannot be applied in frequency-domain modeling, in which only one or few frequencies are modeled at a time. A last resort is the use of complex-valued frequencies, which is equivalent to the exponential damping of a signal

A last regularization level can be implemented by layer stripping, in which the imaging proceeds hierarchically from the shallow part to the deep part. Layer stripping in FWI can be applied by combined offset and temporal windowing (Shipp and Singh, 2002; Wang and Rao, 2009).

These three levels of regularization — frequency dependent, time dependent, and offset dependent — can be combined in one integrated multiloop FWI workflow. An example is provided in Shin and Ha (2009) and Brossier et al. (2009a), in which the frequency- and time-dependent regularizations are implemented into two nested loops over frequency groups and time-damping factors. In this approach, the frequencies increase in the outer loop and the damping factors decrease in the inner loop. In Figure 8, the

In summary, the implementation of FWI in the frequency domain allows the easy implementation of multiscale FWI based on the hierarchical inversion of groups of frequencies of arbitrary bandwidth and sampling intervals. Time-domain modeling provides the most flexible framework to apply time windowing of arbitrary geometry. This makes frequency-domain FWI based on time-domain modeling an attractive strategy to design robust FWI algorithms. This is especially true for 3D problems, for which time-domain modeling has several advantages with respect to frequency-domain modeling (Sirgue et al., 2008).

### On the parallel implementation of FWI

FWI algorithms must be implemented in parallel to address large-scale 3D problems. Depending on the numerical technique for solving the forward problem, different parallel strategies can be considered for FWI. If the forward problem is based on numerical methods such as time-domain modeling or iterative solvers, which are not demanding in terms of memory, a coarse-grained parallelism that consists of distributing sources over processors is generally used and the forward problem is performed sequentially on each processor for each source (Plessix, 2009). If the number of processors significantly exceeds the number of shots, which can be the case if source encoding techniques are used (Krebs et al., 2009), a second level of parallelism can be viewed by domain decomposition of the physical computational domain. A comprehensive review of different algorithms to efficiently compute the forward and the adjoint wavefields in time-domain FWI is presented by Akcelik (2002).

In contract, if the forward problem is based on a method that embeds a memory-expensive preprocessing step, such as LU factorization in the frequency-domain direct-solver approach, parallelism must be based on a domain decomposition of the computational domain. Each processor computes a subdomain of the wavefields for all sources. Examples of such algorithms are described in Ben Hadj Ali et al. (2008a), Brossier et al. (2009a), and Sourbier et al. (2009a, 2009b). A contrast source inversion (CSI) method is described by Abubakar et al. (2009), which allows a decrease in the number of LU factorization in frequency-domain FWI at the expense of the number of iterations.

### Variants of classic least-squares and Born-approximation FWI

Although the most popular approach of FWI is based on minimizing the least-squares norm of the data misfit on the one hand and on the Born approximation for estimating partial-derivative wavefields on the other, several variants of FWI have been proposed over the last decade. These variants relate to the definition of the minimization criteria, the representation of the data (amplitude, phase, logarithm of the complex-valued data, envelope) in the misfit, or the linearization procedure of the inverse problem.

#### The choice of the minimization criterion

The least-squares norm approach assumes a Gaussian distribution of the misfit (Tarantola, 1987). Poor results can be obtained when this assumption is violated, for example, when large-amplitude outliers affect the data. Therefore, careful quality control of the data must be carried out before least-squares inversion. Crase et al. (1990) investigate several norms such as the least-squares norm

The

The Huber norm also combines the

#### The choice of the linearization method

The sensitivity matrix is generally computed with the Born approximation, which assumes a linear tangent relationship between the model and wavefield perturbations (Woodward, 1992). This linear relationship between the perturbations can be inferred from the assumption that the wavefield computed in the updated model is the wavefield computed in the starting model plus the perturbation wavefield.

The Rytov approach considers the generalized phase as the wavefield (Woodward, 1992). The Rytov approximation provides a linear relationship between the complex-phase perturbation and the model perturbation by assuming that the wavefield computed in the updated model is related to the wavefield computed in the starting model through **x**,ω) denotes the complex-phase perturbation. The sensitivity of the Rytov kernel is zero on the Fermat raypath because the traveltime is stationary along this path. A linear relationship between the model perturbations and the logarithm of the amplitude ratio

The Born approximation is valid in the case of weak and small perturbations, but the Rytov approximation is supposed to be valid for large-aperture angles and a small amount of scattering per wavelength, i.e., smooth perturbations or smooth variation in the phase-perturbation gradient (Beydoun and Tarantola, 1988). Although several analyses of the Rytov approximation have been carried out, it remains unclear to what extent its domain of validity differs significantly from that of the Born approximation. A comparison between the Born approximation and the Rytov approximation in the framework of elastic frequency-domain FWI is presented in Gelis et al. (2007). The main advantage of the Rytov approximation might be to provide a natural separation between phase and amplitude (e.g., Woodward, 1992). This separation allows the implementation of phase and amplitude inversions (Bednar et al., 2007; Pyun et al., 2007) from a frequency-domain FWI code using a logarithmic norm (Shin and Min, 2006; Shin et al., 2007), the use of which leads to the Rytov approximation in the framework of local optimization problems.

Another approach in the time-frequency domain is developed by Fichtner et al. (2008) for continental- and global-scale FWI, in which the misfit of the phase and the misfit of the envelopes are minimized in a least-squares sense. The expected benefit from this approach is to mitigate the nonlinearity of FWI by separating the phase and amplitude in the inversion and by inverting the envelope instead of the amplitudes, the former being more linearly related to the data.

### Building starting models for FWI

The ultimate goal in seismic imaging is to be able to apply FWI reliably from scratch, i.e., without the need for sophisticated a priori information. Unfortunately, because multidimensional FWI at present can only be attacked through local optimization approaches, building an accurate starting model for FWI remains one of the most topical issues because very low frequencies (<1 Hz) still cannot be recorded in the framework of controlled-source experiments.

A starting model for FWI can be built by reflection tomography and migration-based velocity analysis such as those used in oil and gas exploration. A review of the tomographic workflow is given in Woodward et al. (2008). Other possible approaches for building accurate starting models, which should tend toward a more automatic procedure and might be more closely related to FWI, are first-arrival traveltime tomography (FATT), stereotomography, and Laplace-domain inversion.

FATT performs nonlinear inversions of first-arrival traveltimes to produce smooth models of the subsurface (e.g., Nolet, 1987; Hole, 1992; Zelt and Barton, 1998). Traveltime residuals are back projected along the raypaths to compute the sensitivity matrix. The tomographic system, augmented with smoothing regularization, generally is solved with a conjugate-gradient algorithm such as LSQR (Paige and Saunders, 1982b). Alternatively, the adjoint-state method can be applied to FATT, which avoids the explicit building of the sensitivity matrix, just as in FWI (Taillandier et al., 2009). The spatial resolution of FATT is estimated to be the width of the first Fresnel zone (Williamson, 1991; Figure 5).

Examples of applications of FWI to real data using a starting model built by FATT are shown, for example, in Ravaut et al. (2004), Operto et al. (2006), Jaiswal et al. (2008, 2009), and Malinowsky and Operto (2008) for surface acquisitions; in Dessa and Pascal (2003) in the framework of ultrasonic experimental data; in Pratt and Goulty (1991) for crosshole data; and in Gao et al. (2006b) for VSP data.

Several blind tests that correspond to surface acquisitions have been tackled by joint FATT and FWI. Results at the oil-exploration scale and at the lithospheric scale are presented in Brenders and Pratt (2007a, 2007b, 2007c) and suggest that very low frequencies and very large offsets are required to obtain reliable FWI results when the starting model is built by FATT. For example, only the upper part of the BP benchmark model was imaged successfully by Brenders and Pratt (2007c) using a starting frequency as small as 0.5 Hz and a maximum offset of 16 km. Another drawback of FATT is that the method is not suitable when low-velocity zones exist because these low-velocity zones create shadow zones.

Reliable picking of first-arrival times is also a difficult issue when low-velocity zones exist. Fitting first-arrival traveltimes does not guarantee that computed traveltimes of later-arriving phases, such as reflections, will match the true reflection traveltimes with an error that does not exceed half a period, especially if anisotropy affects the wavefield. We should stress that FATT can be recast as a phase inversion of the first arrival using a frequency-domain waveform-inversion algorithm within which complex-valued frequencies are implemented (Min and Shin, 2006; Ellefsen, 2009). Compared to FATT based on the high-frequency approximation, this approach helps account for the finite-frequency effect of the data in the sensitivity kernel of the tomography. Judicious selection of the real and imaginary parts of the frequency allows extraction of the phase of the first arrival. The principles and some applications of the method are presented in Min and Shin (2006) and Ellefsen (2009) for near-surface applications. This is strongly related to finite-frequency tomography (Montelli et al., 2004).

Traveltime tomography methods that can manage refraction and reflection traveltimes should provide more consistent starting models for FWI. Among these methods, stereotomography is probably one of the most promising approaches because it exploits the slope of locally coherent events and a reliable semiautomatic picking procedure has been developed (Lambaré, 2008). Applications of stereotomography to synthetic and real data sets are presented in Billette and Lambaré (1998), Alerini et al. (2002), Billette et al. (2003), Lambaré and Alérini (2005), and Dummong et al. (2008).

To illustrate the sensitivity of FWI to the accuracy of different starting models, Figure 10 shows FWI reconstructions of the synthetic Valhall model when the starting model is built by FATT and reflection stereotomography (Prieux et al., 2009). In the case of stereotomography, the maximum offset is 9 km and only the reflection traveltimes are used (Lambaré and Alérini, 2005), whereas the maximum offset is 32 km for FATT (Prieux et al., 2009). Stereotomography successfully reconstructs the large wavelength within the gas cloud down to a maximum depth of 2.5 km; FATT fails to reconstruct the large wavelengths of the low-velocity zone associated with the gas cloud. However, the FWI model inferred from the FATT starting model shows an accurate reconstruction of the shallow part of the model. These results suggest that joint inversion of refraction and reflection traveltimes by stereotomography can provide a promising framework to build starting models for FWI.

A third approach to building a starting model for FWI can be provided by Laplace-domain and Laplace-Fourier-domain inversions (Shin and Cha, 2008, 2009; Shin and Ha, 2008). The Laplace-domain inversion can be viewed as a frequency-domain waveform inversion using complex-valued frequencies (see equation 32), the real part of which is zero and the imaginary part of which controls the time damping of the seismic wavefield. In other words, the principle is the inversion of the DC component of damped seismograms where the Laplace variable *s* corresponds to

The Laplace-Fourier domain is equivalent to performing an inversion of seismograms damped in time. The results shown in Shin and Cha (2009) suggest that this method can be applied to frequencies smaller than the minimum frequency of the source bandwidth. The ability of the method to use frequencies smaller than the frequencies effectively propagated by the seismic source is called a mirage resurrection of the low frequencies by Shin and Cha (2009). An application to real data from the Gulf of Mexico is shown in Shin and Cha (2009). For the real-data application, frequencies between 0 and 2 Hz in combination with nine Laplace damping constants are used for the Laplace-Fourier-domain inversion, the final model of which is used as the starting model for standard frequency-domain FWI.

Joint application of Laplace-domain, Laplace-Fourier-domain and Fourier-domain FWI to the BP benchmark model is illustrated in Figure 11 (Shin and Cha, 2009). The starting model is a simple velocity-gradient model (Figure 11b). A first velocity model of the large wavelengths is obtained by Laplace-domain inversion (Figure 11c), which is subsequently used as a starting model for inversion in the Laplace-Fourier-domain inversion, the final model of which is shown in Figure 11d. During this stage, the starting frequency used in the inversion of the damped data is as low as 0.01 Hz. The final model obtained after frequency-domain FWI is shown in Figure 11e. All of the structures were successfully imaged, beginning with a very crude starting model.

## CASE STUDIES

Applications of FWI have been applied essentially to synthetic examples for which high-resolution images have been constructed. Because FWI is an attractive approach, the number of real-data applications has increased quite rapidly, from monoparameter reconstructions of the

### Monoparameter acoustic FWI

Most of the recent real-data case studies of FWI at different scales and for different acquisition geometries have been performed in the acoustic isotropic approximation, considering only

Specific waveform-inversion data processing generally is designed to account for the amplitude errors introduced by acoustic modeling (Pratt, 1999; Ravaut et al., 2004; Brenders and Pratt, 2007b). The amplitude discrepancies in the P-wavefield result from incorrectly modeling the amplitude-variation-with-offset (AVO) effects and incorrectly modeling the directivity of the source and receiver (Brenders and Pratt, 2007b). Acoustic-wave modeling generally is based on resolving the acoustic-wave equation in pressure; therefore, the particle-velocity synthetic wavefields might not be computed (Hustedt et al., 2004). If the receivers are geophones, the physical measurements collected in the field (particle velocities) are not the same as those computed by the seismic modeling engine (pressure). A match between the vertical geophone data and the pressure synthetics can, however, be performed by exploiting the reciprocity of the Green's functions if the sources are explosions (Operto et al., 2006).

In contrast, if the sources and receivers are directional, the pressure wavefield cannot account for the directivity of the sources and receivers, and heuristic amplitude corrections must be applied before inversion. Brenders and Pratt (2007b) propose optimizing an empirical correction law for the decay of the rms amplitudes with offset. Applying this correction law to the modeled data matches the main AVO trend of the observed data before FWI. Using this data preprocessing, Brenders and Pratt (2007b) successfully image the onshore lithospheric model of the CCSS blind test (Zelt et al., 2005) by acoustic FWI of synthetic elastic data. This strategy is also used successfully by Jaiswal et al. (2008, 2009) in the framework of acoustic FWI of real onshore data in the Naga thrust and fold belt in India.

Successful application of acoustic FWI to synthetic elastic data computed in the marine Valhall model from an OBC acquisition is presented by Brossier et al. (2009b). An application of acoustic FWI to real onshore long-offset data recorded in the southern Apennines thrust belt is illustrated in Figure 12 (Ravaut et al., 2004). The velocity model is validated locally by comparison with a VSP log. Application of PSDM using the final FWI model as a starting model contributes to the validation of the relevance of the velocity structure reconstructed by FWI (Operto et al., 2004). Some guidelines based on numerical examples of the domain of validity of acoustic FWI applied to elastic data are also provided in Barnes and Charara (2008).

### Multiparameter FWI

Because FWI accounts for the full wavefield, the seismic modeling embedded in the FWI algorithm theoretically should honor as far as possible all of the physics of wave propagation. This is especially required by FWI of wide-aperture data, in which significant AVO and azimuthal anisotropic effects should be observed in the data. The requirement of realistic seismic modeling has prompted some studies to extend monoparameter acoustic FWI to account for parameter classes other than the P-wave velocity, such as density, attenuation, shear-wave velocity or related parameters, and anisotropy. The fact that additional parameter classes are taken into account in FWI increases in the ill-posedness of the inverse problem because more degrees of freedom are considered in the parameterization and because the sensitivity of the inversion can change significantly from one parameter class to the next.

Different parameter classes can be more or less coupled as a function of the aperture angle. This coupling can be assessed by plotting the radiation pattern of each parameter class using some asymptotic analyses. Diffraction patterns of the different combination of parameters for the acoustic, elastic, and viscoelastic wave equation are shown in Wu and Aki (1985), Tarantola (1986), Ribodetti and Virieux (1996), and Forgues and Lambaré (1997). An alternative is to plot the sensitivity kernel, i.e., that obtained by summing all of the rows of the sensitivity matrix for the full acquisition and for different combinations of parameters and to qualitatively assess which combination provides the best image (Luo et al., 2009). Hierarchical strategies that successively operate on different parameter classes should be designed to mitigate the ill-posedness of FWI (Tarantola, 1986; Kamei and Pratt, 2008; Sears et al., 2008; Brossier et al., 2009b).

#### Density

Density is difficult to reconstruct (Forgues and Lambaré, 1997). As an illustration, acoustic radiation patterns are shown in Figure 13 for different combinations of parameters

A successful reconstruction of the density parameter in the case of the Marmousi case study is presented by Choi et al. (2008). However, the use of an unrealistically low frequency (0.125 Hz) brings into question the practical implication of these results.

#### Attenuation

The attenuation reconstruction can be implemented in frequency-domain seismic-wave modeling using complex velocities (Toksöz and Johnston, 1981). The most commonly applied attenuation/dispersion model is referred to as the Kolsky-Futterman model (Kolsky, 1956; Futterman, 1962). This model has linear frequency dependence of the attenuation coefficient, whereas the deviation from constant-phase velocity is accounted for through a term that varies as the logarithm of the frequency. This model is obtained by imposing causality on the wave pulse and assuming the absorption coefficient is strictly proportional to the frequency over a restricted range of frequencies.

Using the Kolsky-Futterman model, the complex velocity *Q* is the attenuation factor and

In frequency-domain FWI, the real and imaginary parts of the velocity can be processed as two independent real-model parameters without any particular implementation difficulty. However, the reconstruction of attenuation is an ill-posed problem. Ribodetti et al. (2000) show that in the framework of the ray + Born inversion, P-wave velocity and *Q* are uncoupled (i.e., the Hessian is nonsingular) only if a reflector is illuminated from both sides (with circular acquisition as in medical imaging; see Figure 5b of Ribodetti et al., 2000). In contrast, the Hessian becomes singular for a surface acquisition. Mulder and Hak (2009) also conclude that *Q* cannot be imaged simultaneously from short-aperture data because the Hilbert transform with respect to depth of the complex-valued perturbation model for *Q* produces the same wavefield perturbation in the framework of the Born approximation as the original perturbation model. Despite this theoretical limitation, preconditioning of the Hessian is investigated by Hak and Mulder (2008) to improve the convergence of the joint inversion for *Q*.

Assessment and application of viscoacoustic frequency-domain FWI is presented at various scales by Liao and McMechan (1995, 1996), Song et al. (1995), Hicks and Pratt (2001), Pratt et al. (2005), Malinowsky et al. (2007), and Smithyman et al. (2008). Kamei and Pratt (2008) recommend inversion for *Q* in a second step because the reliability of the attenuation reconstruction strongly depends on the accuracy of the starting *Q* so that the inversion can discriminate the intrinsic attenuation from the extrinsic attenuation.

#### Elastic parameters

A limited number of applications of elastic FWI have been proposed. Because

In contrast, Brossier et al. (2009a) conclude that joint inversion for

A recent application of elastic FWI to a gas field in China is presented by Shi et al. (2007). They invert for the Lamé parameters and unambiguously image Poisson's ratio anomalies associated with the presence of gas. They accelerate the convergence of the inversion by computing an efficient step length using an adaptive controller based on the theory of model-reference nonlinear control. Several logs available along the profile confirm the reliability of this gas-layer reconstruction.

#### Anisotropy

Reconstruction of anisotropic parameters by FWI is probably one of the most undeveloped and challenging fields of investigation. Vertically transverse isotropy (VTI) or tilted transversely isotropic (TTI) media are generally considered a realistic representation of geologic media in oil and gas exploration, although fractured media require an orthorhombic description (Tsvankin, 2001). The normal-moveout (NMO) P-wave velocity in VTI media depends on only two parameters: the NMO velocity for a horizontal reflector

A feasibility study of FWI in VTI media for crosswell acquisitions is presented in Barnes et al. (2008). They invert for five parameter classes —

Further investigations of anisotropic FWI in the case of surface seismic data are required. In particular, the benefit of wide apertures in resolving as many anisotropic parameters as possible needs to be investigated (Jones et al., 1999).

### Three-dimensional FWI

Because of the continuous increase in computational power and the evolution of acquisition systems toward wide-aperture and wide-azimuth acquisition, 3D acoustic FWI is feasible today. In three dimensions, the computational burden of multisource seismic modeling is one of the main issues. The pros and cons of time-domain versus frequency-domain seismic modeling for FWI have been discussed. Another issue is assessing the impact of azimuth coverage on FWI. Sirgue et al. (2007) show the footprint of the azimuth coverage in 3D surveys on the velocity model reconstructed by FWI. Their imaging confirms the importance of wide-azimuth surveys for FWI of coarse acquisitions such as node surveys.

Most 3D FWI applications have been limited to low frequencies (<7 Hz). At these frequencies, FWI can be seen as a tool for velocity-model-building rather than a self-contained seismic-imaging tool that continuously proceeds from the velocity-model-building task to the migration task through the continuous sampling of wavenumbers (Ben Hadj Ali et al., 2008b).

Ben Hadj Ali et al. (2008a) apply a frequency-domain FWI algorithm to a series of synthetic data sets. The forward problem is solved in the frequency domain with a massively parallel direct solver. Plessix (2009) presents an application to real ocean-bottom-seismometer (OBS) data. Seismic modeling is performed in the frequency domain with an iterative solver. The inverted frequencies range between 2 and 5 Hz. The inversion converges to a similar velocity model down to the top of a salt body using two different starting velocity models, with one a simple velocity-gradient model. An application to a real 3D streamer data set is presented by Warner et al (2008) for the imaging of a shallow channel. They perform seismic modeling in the frequency domain using an iterative solver (Warner et al., 2007).

Three-dimensional time-domain FWI is developed by Vigh and Starr (2008a), in which the input data in FWI are plane-wave gathers rather than shot gathers. The main motivation behind the use of plane-wave shot gathers is to mitigate the computational burden by decimating the volume of data. The computational cost is reduced by one order of magnitude for 2D applications and by a factor 3 for 3D applications when the plane-wave-based approach is used instead of the shot-based approach.

Sirgue et al. (2009) apply 3D frequency-domain FWI to the hydrophone component of OBC data from the shallow-water Valhall field. Frequencies between 3.5 and 7 Hz are inverted successively, using a starting model built by ray-based reflection tomography. They successfully image a complex network of shallow channels at 150 m depth and a gas cloud between 1000 and 2500 m depth. Although the spacing between the cables is as high as 300 m, a limited footprint of the acquisition is visible in the reconstructed models. Comparisons of depth-migrated sections computed from the reflection tomography model and the FWI velocity model show the improvements provided by FWI, both in the shallow structure and at the reservoir level below the gas cloud. The step-change improvement in the quality of the depth-migrated image results from the high-resolution nature of the velocity model from FWI and the accounting of the intrabed multiples by the two-way wave-equation modeling engine. Comparisons between depth slices across the channels and the gas cloud extracted from the reflection tomography and the FWI models highlight the significant resolution improvement provided by FWI (Figure 14).

Solving large-scale 3D elastic problems is probably beyond our current tools because of the computational burden of 3D elastic modeling for many sources. This has prompted several studies to develop strategies to mitigate the number of forward simulations required during migration or FWI of large data sets. One of these approaches stacks the seismic sources before modeling (Capdeville et al., 2005). Because the relationship between the seismic wavefield and the source is linear, stacking sources is equivalent to emitting each source simultaneously instead of independently. This assemblage generates artifacts in the imaging that arise from the undesired correlation between each independent source wavefield with the back-propagated residual wavefields associated with the other sources of the stack. Applying some random-phase encoding to each source before assemblage mitigates these artifacts. The method originally was applied to migration algorithms (Romero et al., 2000). Promising applications to time- and frequency-domain FWI are presented in Krebs et al. (2009) and Ben Hadj Ali et al. (2009a, 2009b). Alternatively, Herrmann et al. (2009) propose to recover the source-separated wavefields from the simultaneous simulation before FWI.

An illustration of the source-encoding technique is provided in Figure 15, in which a dip section of the overthrust model is built three ways: by conventional frequency-domain FWI (i.e., without source assemblage; Figure 15b), by FWI with source assemblage but without phase encoding (Figure 15c), and by FWI with source assemblage and phase encoding (Figure 15d). The models were obtained by successive inversions of four groups of two frequencies ranging between 3.5 and 20 Hz. The number of shots was 200, and no noise was added to the data. The number of iterations per frequency group to obtain the final FWI models without and with the source assemblage was 15 and 200, respectively. The time to compute the model of Figure 15d was seven times less than the time to build the model of Figure 15b. More details can be found in Ben Hadj Ali et al. (2009). If the phase-encoding technique is seen as sufficiently robust, especially in the presence of noise, it is likely that 3D elastic FWI can be viewed in the near future using sophisticated modeling engines.

## DISCUSSION

Most of the FWI methods presented and assessed in the literature are based on the local least-squares optimization formulation, in which the misfit between the observed seismograms and the modeled ones are minimized in the time domain or in the frequency domain. Without very low frequencies (<1 Hz), it remains very difficult to obtain reliable results from these approaches when considering real data, especially at high frequencies. Clearly, new formulations of FWI are needed to proceed further.

Recent improvements relate to Hessian estimation, which has been shown to be quite important for better convergence toward the solution. A systematic strategy since the beginning of FWI investigations has been the progressive introduction of the entire content of seismograms through multiscale approaches to partially prevent the convergence toward secondary minima. Transforming the data at the different stages of the local inversion procedure (particle displacement, particle velocity, logarithms of these quantities, divergence) can also provide benefits for global convergence of the optimization procedure.

Do we need to mitigate the nonlinearity of the inverse problem through more sophisticated search strategies such as semiglobal approaches or even global sampling of the model space? Because this search is quite computationally intensive, should we reduce the dimensionality of the parameter models? To perform such global searches, should we speed up the forward model at the expense of its precision? If so, can we consider these procedures as intermediate strategies for approaching the global minimum?

A pragmatic workflow, from low to high spatial-frequency content, should be expected to move hierarchically from imaging the large medium wavelengths to the short medium wavelengths. A step-by-step procedure for extracting the information, starting from traveltimes and proceeding to true amplitudes, might include many intermediate steps for interpreting new observables (polarization, apparent velocity, envelope).

Another aspect is the quality control of a reconstruction. An objective analysis for uncertainty estimation is important and might rely on semiglobal investigations once the global minimum has been found. Various strategies can be considered that would be manageable as statistical procedures (bootstrapping or jackknifing techniques) or as local nondifferential approaches (simplex, simulation annealing, genetic algorithms).

Finally, solving large-scale 3D elastic problems remains beyond our present tools, although one must be aware that these aids are right around the corner. Because massive data acquisition for 3D reconstruction has been achieved, we indeed expect an improvement in our data-crunching for high-resolution imaging.

An appealing reconstruction is 4D imaging, which is based on time-lapse evolution of targets inside the earth. Differential data are available, providing us with new information for tracking the evolution of the subsurface parameters. Thus, fluid tracking and variations in solid-rock matrices are possible challenges for FWI in the near future.

## CONCLUSION

FWI is the last-course procedure for extracting information from seismograms. We have shown the conceptual efforts that have been carried out over the last 30 years to provide FWI as a possible tool for high-resolution imaging. These efforts have been focused on development of large-scale numerical optimization techniques, efficient resolution of the two-way wave equation, judicious model parameterization for multiparameter reconstruction, multiscale strategies to mitigate the ill-posedness of FWI, and specific waveform-inversion data preprocessing.

FWI is mature enough today for prototype application to 3D real data sets. Although applications to 3D real data have shown promising results at low frequencies (<7 Hz), it is still unclear to what extent FWI can be applied efficiently at higher frequencies. To answer this question, a more quantitative understanding of FWI sensitivity to the accuracy of the starting model, to the noise, and to the amplitude accuracies is probably required.

If FWI remains limited to low frequencies, it will remain a tool to build background models for migration. In the opposite case, FWI will tend toward a self-contained processing workflow that can reunify macromodel building and migration tasks.

The present is exciting because realistic applications are becoming possible right now. However, new strategies must be found to make this technique as attractive as the scientific issues require. Fields of investigation should address the need to speed up the forward problem by means of providing new hardware (GPUs) and software (compressive sensing), defining new minimization criteria in the model and data spaces, and incorporating more sophisticated wave phenomena (attenuation, elasticity, anisotropy) in modeling and inversion.

## Acknowledgments

We thank L. Sirgue and O. I. Barkved from BP for providing Figure 14 and for permission to present it. L. Sirgue is acknowledged for editing part of the manuscript. We thank C. Shin (Seoul National University) and Y. H. Cha (Seoul National University) for providing Figure 11. We thank A. Ribodetti (Géosciences Azur-IRD) for discussion on attenuation reconstruction. We are grateful to R. Brossier (Géosciences Azur-UNSA) and H. Ben Hadj Ali (Géosciences Azur-UNSA) for providing Figures 8 9 15 and for fruitful discussions on many aspects of FWI discussed in this study, in particular multiparameter FWI, norms, Rytov approximation, and source encoding. We would like to thank R. E. Plessix (Shell) for his explanations of the adjoint-state method. J. Virieux thanks E. Canet (LGIT) and A. Sieminski (LGIT) for an interesting discussion on the inverse problem and its relationship with assimilation techniques. We would like to thank S. Buske (University of Berlin), T. Nemeth (Chevron), I. Lecomte (NORSAR), and V. Sallares (CMIMA-CSIC Barcelona) for their encouragement during the preparation of the manuscript. Finally, we thank I. Lecomte and T. Nemeth for the time they dedicated to the review of the manuscript. We thank the sponsors of the SEISCOPE consortium (BP, CGGVeritas, ExxonMobil, Shell, and Total) for their support.

## APPENDIX A

### APPLICATION OF THE ADJOINT-STATE METHOD TO FWI

In this appendix, we provide some guidelines for the derivation of the gradient of the misfit function (equation 24) with the adjoint-state method and Lagrange multipliers. The reader is referred to Nocedal and Wright (1999) for a review of constrained optimization and to Plessix (2006) for a review of the adjoint-state method and its application to FWI.

First, we introduce the Lagrangian function *C* augmented with equality constraints:
**u** of the state equation is the so-called state variable. In equation A-1, we introduce the variable

The vector λ, the dimension of which is that of the wavefield **u**, is the Lagrange multiplier; it corresponds to the adjoint-state variables. In the framework of the theory of constrained optimization, the first-order necessary optimality conditions known as the Karush-Kuhn-Tucker (KKT) conditions state that the solution of the optimization problem is obtained at the stationary points of

The first condition, **B**(**m**)**u** = **s**. Resolving the state equation for

The second condition,

The third condition, *C*. We have
**u**, *C* as a function of the state variable and adjoint-state variable when *C* at the point *C* is sought for in FWI:

- Received June 7, 2009.
- Revision requested July 8, 2009.
- Revision received December 17, 2009.