5.4. Numerical Methods and FFT Convolution

Objectives: Describe the numerical distribution representation and FFT convolution algorithm that underlie all computations in aggregate.

Audience: User who wants a detailed understanding of the computational algorithms, potential errors, options, and parameters.

Prerequisites: Probability theory and aggregate distributions; complex numbers and matrix multiplication; numerical analysis, especially numerical integration; basics of Fourier transforms and series helpful.

See also: User Guides.

Contents:

Helpful References
Overview
Digital Representation of Distributions
Fourier Transform Convolution Algorithm
Floating Point Arithmetic and Rounding Errors

5.4.1. Helpful References

Actuarial and operational risk books and papers

Gerber [1982]
Bühlmann [1984]
Embrechts et al. [1993]
Wang [1998]
Grübel and Hermesmeier [1999]
Mildenhall [2005]
Schaller and Temnov [2008]
Kaas et al. [2008]
Embrechts and Frei [2009]
Shevchenko [2010]

Books on probability covering characteristic functions, $t\mapsto \mathsf E[e^{itX}]$

Loeve [1955]
Feller [1971]
Lukacs [1970]
Billingsley [1986]
Malliavin et al. [1995]
McKean [2014]

Books on Fourier analysis and Fourier transforms, $t\mapsto \mathsf E[e^{-2\pi itX}]$, the same concept with slightly different notation. Malliavin is a sophisticated treatment of both Fourier analysis and probability.

Stein and Weiss [1971]
Stein and Shakarchi [2011]
Strang [1986]
Terras [2013]
Körner [2022]

5.4.2. Overview

5.4.2.1. A Trilemma

Numerical analysts face a trilemma: they can pick two of fast, accurate, or flexible. Simulation is flexible, but trades speed against accuracy. Fewer simulations is faster but less accurate; many simulations improves accuracy at the cost of speed. aggregate delivers the third trilemma option using a fast Fourier transform (FFT) convolution algorithm to deliver speed and accuracy but with less flexibility than simulation.

5.4.2.2. The `aggregate` Convolution Algorithm

Complaint It is the most natural thing in the world to decompose an oscillating electrical signal into sines and cosines of various frequencies via the Fourier transform - but probabilities, no. Basically, these are positive numbers adding up to 1, and what have sines and cosines to do with that? Indeed, in many applications done first by Fourier, a simpler, more understandable proof may emerge upon taking Fourier away. Still, the Fourier transform is a very effective, sometimes indispensable technical tool in much of our business here. [McKean, 2014]

This section describes the core convolution algorithm implemented in aggregate. It is an application where the caveat in McKean’s complaint holds true. The use of Fourier transform methods is unnatural but very effective. As usual, let $A=X_1 + \cdots + X_N$ where severity $X_i$ are iid and independent of the claim count $N$. We want to explicitly compute the distribution

\[\begin{split}\Pr(A < a) &= \sum_n \Pr(A < a \mid N=n)\Pr(N=n) \\ &= \sum_n \Pr(X^{*n} < a)\Pr(N=n).\end{split}\]

Here, $X^{*n}$ denotes the sum of $n$ independent copies of $X$. This problem is usually hard to solve analytically. For example, if $X$ is lognormal then there is no closed form expression for the distribution of $X^{*2} \sim X_1 + X_2$. However, things are more promising in the Fourier domain. The characteristic function of $A$ can be written, using independence, as

\[\begin{split}\phi_A(t) :&= \mathsf E[e^{itA}] \\ &= \mathsf E[\mathsf E[ e^{itA} \mid N]] \\ &= \mathsf E[\mathsf E[ e^{itX}]^N] \\ &= \mathscr P_N(\phi_X(t))\end{split}\]

where $\mathscr P_N(z) = \mathsf E[z^N]$ is the probability generating function. The pgfs of most common frequency distributions are know. For example, $N$ is Poisson with mean $\lambda$ then it is easy to see that $\mathscr P_N(z) = \exp(\lambda(z-1))$.

Note

The algorithm assumes the pgf can be written an explicit function. That, in turn, implies that frequency is thin tailed (it is not subexponential). This is generally not a problem because of how insurance events are defined. Losses from a catastrophe event, which can produce a large number of claims, are combined into a single occurrence. As a result, we usually model a small and thin tailed number of events with a thick tailed severity distribution. For example, there are approximately 1.75 US landfalling hurricanes per year, and the distribution of events since 1850 is well-fit by a (thin-tailed) Poisson distribution.

Knowing the characteristic function is useful because it can be inverted to determine the density of $A$. Subject to certain terms and conditions, described below, these arguments can be carried out in a finite discrete setting which is helpful for two reasons. First, it makes the problem tractable for a digital computer, and second, many $A$ that arise in insurance problems that are discrete or mixed, because policy limits introduce mass points. For these reasons, we model a discrete approximation to $A$, see also Digital Representation of Distributions.

In a discrete approximation, the function $\phi_X(t)$ is replaced by a vector sample computed either by taking the discrete Fourier transform (DFT) of a discretized approximation to the distribution of $X$ or by directly sampling $\phi_X$. We usually do the former, computing the DFT using the Fast Fourier transform (FFT) algorithm. However, certain $X$, such as stable distributions, have known characteristic functions but no closed-form distribution or density function. In that case, a sample of the characteristic function can be used directly. The pgf is then applied element-by-element to the characteristic function sample vector, and the inverse FFT routine used to obtain a discrete approximation to the aggregate distribution. The exact rationale for this process are discussed in Fourier Transform Convolution Algorithm. The errors it introduces are discussed there and in num error analysis.

In summary, the FFT algorithm is simply:

Discretize the severity cdf.
Apply the FFT to discrete severity.
Apply the frequency pgf to the FFT.
Apply the inverse FFT to create a discretized approximation to the aggregate distribution.

This algorithm has appeared numerous times in the literature, see Related Actuarial Literature. The details are laid out in The aggregate Convolution Algorithm: Details. A plain Python implementation is presented in Aggregate Algorithm in Detail.

5.4.2.3. Strengths and Weaknesses

I’ve been using the FFT algorithm since Glenn Meyers explained it to me at a COTOR meeting around 1996, and I still find it miraculous. It is very fast and its speed is largely independent of the expected claim count—in contrast to simulations. The algorithm is also very accurate, both in absolute and relative terms. It is essentially exact in many cases and eight-plus digit precision is often easy to obtain. The algorithm works well in almost all situations and for many use-cases it is unbeatable, including computing:

The distribution of aggregate losses from a portfolio with a complex limit and attachment profile, and a mixed severity.
Ceded or net outcome distributions for an occurrence reinsurance program.
Ceded or net outcome distributions for reinsurance contracts with variable features such as sliding commissions, swing rated programs, profit commissions, aggregate limits, see Reinsurance Pricing.
The distribution of retained losses net of specific and aggregate insurance, as found in a risk-retention group, see Individual Risk Pricing, including exact computation of Table L and Table M charges in US worker compensation ratemaking.
The distribution of the sum of independent distributions, e.g., aggregating units such as line of business, business unit, geographic unit etc.
The distribution of the sum of dependent units, where the dependence structure is driven by common frequency mixing variables.

The algorithm is particularly well-suited to compute aggregates with low claim counts and a thick-tailed severity and where accuracy is important, such as catastrophe risk PML, AEP, and OEP points. Outcomes with low expected loss rates are hard to simulate accurately.

The FFT algorithm is not a panacea. On the downside, its mysterious Fourier-nature presents the user with a choice of trusting in magic or a steep learning curve to understand the theory. It relies on hard-to-select parameters and can fail spectacularly and without warning if they are not chosen judiciously. A big contribution of aggregate is to provide the user with sensible default parameters and a test of model validity, see num error analysis. It does not work well for a high mean, thick-tailed frequency combined with a thick-tailed severity distribution that has an intricate distribution—an unusual situation that stresses any numerical method. However, when either frequency or severity is thin-tailed, it excels. Finally, the aggergate implementation is limited to tracking one variable at a time. It cannot model joint distributions, such as ceded and net loss or derived quantities such as the total cession to a specific and aggregate cover, or the cession to an occurrence program with limited reinstatements. Both of these require a bivariate distribution. It can model the net position after specific and aggregate cessions, and ceded losses to an occurrence program with an aggregate limit. See num extensions for an approach to bivariate extensions.

5.4.2.4. Related Actuarial Literature

The earliest reference to Fourier transforms in actuarial science I have found is Heckman and Meyers [1983]. They used continuous Fourier transforms to compute aggregate distributions by numerically integrating the characteristic function. Their analysis includes severity and frequency uncertainty, that they call contagion and mixing.

Explicit use of the FFT appears first in [Bertram, 1983]. It has subsequently appeared in numerous places.

Bühlmann [1984] compares the FFT algorithm with Panjer recursion for compound Poisson distributions. It concludes that usually FFTs can be computed in fewer operations. Hürlimann [1986] obtains an error bound for stop-loss premium computed with FFTs.

Robertson [1992] considers a quasi-FFT algorithm, using discrete-continuous adjustments to reflect a piecewise linear as opposed to a fully discrete, distribution function. These greatly complicate the analysis for little tangible benefit. We recommend using a fully discrete distribution as explained in Digital Representation of Distributions.

Embrechts et al. [1993] describes the FFT algorithm and considers Richardson extrapolation to estimate the density.

Wang [1998] describes the FFT algorithm, using padding to control aliasing (wrapping) error. The first edition of Klugman et al. [2019], published in 1998, describes the algorithm, although it no longer appears in the fifth edition. Grübel and Hermesmeier [1999] describes the use of exponential tilting to reduce aliasing error and Grübel and Hermesmeier [2000] explains how to use Richardson extrapolation to improve density estimates. Exponential tilting is the same process used in GLM exponential families to adjust the mean, and it is also used in large deviation theory. Mildenhall [2005] describes the FFT algorithm.

Approximate inversion of the Fourier transform is also possible using FFTs. Menn and Rachev [2006] uses of FFTs to determine densities for distributions which have analytic MGFs but not densities, notably the class of stable distributions. This method is shown in Fourier Transform Convolution Algorithm.

Kaas et al. [2008] section 3.6 presents the FFT algorithm in R.

Embrechts and Frei [2009] revisits Panjer recursion compared to the FFT algorithm. It also explores exponential tilting for aliasing error. It comments “Compared to the Panjer recursion, FFT has two main advantages: It works with arbitrary frequency distributions and it is much more efficient.” It concludes:

The Panjer recursion is arguably the most widely used method to “exactly” evaluate compound distributions. However, FFT is a viable alternative: It can be applied with arbitrary frequencies and offers a tremendous timing advantage for a large number of lattice points; moreover, the use of exponential tilting—which practically rules out any aliasing effects—facilitates applications (such as evaluation of the lower tail) that were thought to be an exclusive task for recursive procedures.

More recently, Papush et al. [2021], extending Papush et al. [2001], considers the best two parameter approximation to an frequency severity convolution. It shows that the gamma provides the best fit across a wide range of synthetic examples. However, all of their examples have a bounded (hence thin tailed) severity. A simple model:

agg 10 claims sev lognorm 2 poisson

is not best fit by a gamma.

Homer and Clark [2003] and Mildenhall [2005] describe the use of two-dimensional FFTs to model aggregates with bivariate frequency (for two different lines) and bivariate severity (net and ceded).

5.4.2.5. Other Applications

The FFT algorithm is applied to model operational risk in Schaller and Temnov [2008], Temnov and Warnung [2008], Luo and Shevchenko [2009], Luo and Shevchenko [2011], and Shevchenko [2010]. These applications mirror the actuarial approach, using either padding or exponential tilting (exponential window) to control aliasing error. They are interesting because they include modeling with a very high expected claim counts and a very thick tailed severity (no mean). See Truncation Example.

In finance, FFTs are used in option pricing, Carr and Madan [1999]. These applications can use distributions derived from stable-$\alpha$ and Levy process families that have a closed for characteristic function but no analytic density. Duan et al. [2012] describe more recent innovations. FFTs are also used as a general purpose convolution routine, Černý [2004]

Wilson and Keich [2016] describes an interesting approach to accurate pairwise convolution that splits each component to limit the ratio of its most and least (non-zero) likely outcome. It provides helpful estimates for assessing relative error and determining when an FFT estimate can be trusted.

5.4.2.6. Conditional Expectations (Kappa)

The function $\kappa_i(x):=\mathsf E[X_i \mid X=x]$ is the basis for many of the computations in Portfolio. It can be computed using Fourier transforms because it is a convolution. There is no loss in generality assuming $X=X_1 + X_2$. For simplicity suppose $(X_1, X_2)$ have a bivariate density $f$. Then

\[\begin{split}\mathsf E[X_1 \mid X=x] &= \int_0^x t\frac{f(t, x-t)}{f(x)}\, dt \\ &= \frac{1}{f(x)} \int_0^x tf_1(t) f_2(x-t)\, dt\end{split}\]

can be computed from the convolution of $tf_1(t)$ and $f_2$. The convolution can be computed using Fourier transforms in the usual way: transform, product, inverse transform. Using FFTs and relying on the discretized version of $X_i$, the algorithm becomes:

Compute the discrete approximation to $X_{\hat i}$, the sum of all $X_j$, $j\not=i$, identifying distributions with their discrete approximation vectors.
Compute FFTs of $X_{i}$ and $X_{\hat i}$, with optional padding.
Take the elementwise product of the FFTs.
Apply the inverse FFT and unpad if necessary.

A variable with density $xf(x) / \mathsf E[X]$ is called the size-bias of $X$. Size-biased variables have lots of interesting applications, see Arratia et al. [2019].

The aggregate implementation computes $X_{\hat i}$ by dividing out the distribution of $X_{i}$ from the overall sum (deconvolution), where that is possible, saving computing time.

5.4.3. Digital Representation of Distributions

“We come now to reality. The truth is that the digital computer has totally defeated the analog computer. The input is a sequence of numbers and not a continuous function. The output is another sequence of numbers.” [Strang, 1986]

5.4.3.1. How `aggregate` Represents a Distribution

aggregate aims to deliver the speed and accuracy of parametric distributions to aggregate probability distributions and make them as easy to use as the lognormal. To achieve that, it needs a representation of the underlying distribution amenable to computation.

There is no analytic expression for the cdf of most aggregate distributions. For example, there is no closed form expression for the distribution of the sum of two lognormals [Milevsky and Posner, 1998]. Therefore we must use a numerical approximation to the exact cdf. There are two obvious ways to construct a numerical approximation to a cdf:

As a discrete (arithmetic, lattice) distribution supported on a discrete set of points.
As a continuous random variable with a piecewise linear distribution function.

The next two figures illustrate of the two approaches. First, a discrete approximation, which results in a step-function, piecewise constant cdf, is shown left and the corresponding quantile function, right. The cdf is continuous from the right and the (lower) quantile function is continuous from the left. The distribution does not have a density function (pdf); it only has a probability mass function (pmf).

In [1]: from aggregate.extensions.pir_figures import fig_4_5, fig_4_6

In [2]: fig_4_5()

Second, a piecewise linear continuous approximation, which results in a step-function pdf (not shown).

In [3]: fig_4_6()

The second approach assumes the aggregate has a continuous distribution, which is often not the case. For example, the Tweedie and all other compound Poisson distributions are mixed (they have a mass at zero). An aggregate whose severity has a limit will have a mass at multiples of the limit caused by the non-zero probability of limit-only claims. When $X$ is mixed it is impossible to distinguish the jump and continuous parts using a numerical approximation. The large jumps may be obvious but the small ones are not.

There are three other arguments in favor of discrete models. First, we live in a discrete world. Monetary amounts are multiples of a smallest unit: the penny, cent, yen, satoshi. Computers are inherently discrete. Second, probability theory is based on measure theory, which approximates distributions using simple functions that are piecewise constant. Third, the continuous model introduces unnecessary complexities in use, without any guaranteed gain in accuracy across all cases. See the complicated calculations in Robertson [1992], for example.

For all of these reasons, we use a discrete approximation. Further, we assume that the distribution is known at integer multiples of a fixed bandwidth or bucket size. This assumption is forced by the use of FFTs and has some undesirable consequences. Ideally, we would use a stratified approach, sampling more points where the distribution changes shape and using larger gaps to capture the tail. However, the computational efficiency of FFTs make this a good trade-off.

Based on the above considerations, saying we have computed an aggregate means that we have a discrete approximation to its distribution function concentrated on integer multiples of a fixed bucket size $b$. This specifies the approximation aggregate as

the value $b$ and
a vector of probabilities $(p_0,p_1,\dots, p_{n-1})$

with the interpretation

\[\Pr(X=kb)=p_k.\]

All subsequent computations assume that the aggregate is approximated in this way. There are several important consequences.

The cdf is a step function with a jump of size $p_k$ at $kb$.
The cdf is continuous from the right (it jumps up at $kb$).
The cdf be computed from the cumulative sum of $(p_0,p_1,\dots, p_{n-1})$.
The approximation has moments given by

\[\sum_k k^r p_i b.\]
The limited expected value (the integral of the survival function), can be computed at the points $kb$ as $b$ times the cumulative sum of the survival function.
The pdf, if it exists, can be approximated by $p_i/b$.

All of these calculations are more straightforward than assuming a piecewise linear cdf.

5.4.3.1.1. Sidebar: Continuous Discretization

It is possible to approximate the continuous cdf approach in aggregate. For example, the following code will reproduce the simple example in Section 4 of Robertson [1992]. Compare the output to his Table 4. Using bs=1/200 approximates a continuous histogram. The use of a decimal bucket size is never recommended, but is used here to approximate Robertson’s table values. We recommend against this approach. It is unnecessarily complicated and does not improve accuracy in any example we have encountered.

In [4]: from aggregate import build, qd

In [5]: s = build('agg Robertson '
   ...:           '5 claims '
   ...:           'sev chistogram xps [0 .2 .4 .6 .8 1] [.2 .2 .2 .2 .2] '
   ...:           'fixed'
   ...:           , bs=1/200, log2=12)
   ...: 

In [6]: qd(s.density_df.loc[0:6:40, ['F']], max_rows=100,
   ...:   float_format=lambda x: f'{x:.10f}')
   ...: 

                    F
loss                 
000    0.0000000000
000m 0.0000028505
000m 0.0000881354
000m 0.0006619547
000m 0.0027744084
000    0.0084395964
200    0.0209413662
400    0.0447838984
600    0.0847627931
800    0.1446976504
000    0.2261520701
200    0.3271821577
400    0.4417953515
600    0.5611682516
800    0.6755414579
000    0.7761395705
200    0.8570626840
400    0.9164675600
600    0.9559897985
800    0.9794889995
000    0.9917687630
200    0.9973109416
400    0.9993650547
600    0.9999172021
800    0.9999974838
000    1.0000000000
200    1.0000000000
400    1.0000000000
600    1.0000000000
800    1.0000000000
000    1.0000000000

5.4.3.2. Discretizing the Severity Distribution

This section discusses ways to approximate a severity distribution with a discrete distribution. Severity distributions used by aggregate are supported on the non-negative real numbers; we allow a loss of zero, but not negative losses. However, the discretization process allows severity to be derived from a distribution supported on the whole real line—see the note below.

Let $F$ be a distribution function and $q$ the corresponding lower quantile function. It is convenient to be able to refer to a random variable with distribution $F$, so let $X=q(U)$ where $U(\omega)=\omega$ is the standard uniform variable on the sample space $\Omega=[0,1]$. $X$ has distribution $F$ [Föllmer and Schied, 2016].

We want approximate $F$ with a finite, purely discrete distribution supported at points $x_k=kb$, $k=0,1,\dots, m$, where $b$ is called the bucket size or the bandwidth. Split this problem into two: first create an infinite discretization on $k=0,1,\dots$, and then truncate it.

The calculations described in this section are performed in Aggregate.discretize().

5.4.3.2.1. Infinite Discretization

There are four common methods to create an infinite discretization.

The rounding method assigns probability to the $k$ th bucket equal to

\[\begin{split}p_k &= \Pr((k - 1/2)b < X \le (k + 1/2)b) \\ &= F((k + 1/2)b) - F((k - 1/2)b) \\ p_0 &= F(b/2).\end{split}\]
The forward difference method assigns

\[\begin{split}p_k &= \Pr(kb < X \le (k+1)b ) \\ &= F((k + 1)b) - F(kb) \\ p_0 &= F(b).\end{split}\]
The backward difference method assigns

\[\begin{split}p_k &= \Pr((k-1)b < X \le kb ) \\ &= F(kb) - F((k - 1)b) \\ p_0 &= F(0).\end{split}\]
The moment difference method [Klugman et al., 2019] assigns

\[\begin{split}p_k &= \frac{2\mathsf E[X \wedge kb] - \mathsf E[X \wedge (k-1)b] - \mathsf E[X \wedge (k+1)b]}{b} \\ p_0 &= 1 - \frac{\mathsf E[X \wedge b]}{b}.\end{split}\]

The moment difference ensures the discretized distribution has the same first moment as the original distribution. This method can be extended to match more moments, but the resulting weights are not guaranteed to be positive.

Note

Setting the first bucket to $F(b/2)$ for the rounding method (resp. $F(b)$, $F(0)$) allows the use of random variables with negative support. Any values $\le 0$ are included in the zero bucket. This behavior is useful because it allows the normal, Cauchy, and other similar distributions can be used as the basis for a severity.

Each of these methods produces a sequence $p_k\ge 0$ of probabilities that sum to 1 that can be interpreted as the pmf and distribution function $F_b^{(d)}$ of a discrete approximation random variable $X_b^{(d)}$

\[\begin{split}\Pr(X_b^{(d)} = kb) = p_k \\ F_b^{(d)}(kb) = \sum_{i \le k} p_i\end{split}\]

where superscript $d=r,\ f,\ b,\ m$ describes the discretization method and subscript $b$ the bucket size.

There is a disconnect between how the rounding method is defined and how it is interpreted. By definition, it corresponds to a distribution with jumps at $(k+1/2)b$, not $kb$. However, the approximation assumes the jumps are at $kb$ to simplify and harmonize subsequent calculations across the three discretization methods.

It is clear that [Embrechts and Frei, 2009]

\[\begin{split}& F_b^{(b)} \le F \le F_b^{(f)} \\ & F_b^{(b)} \le F_b^r \le F_b^{(f)} \\ & X_b^{(b)} \ge X \ge X_b^{(f)} \\ & X_b^{(b)} \ge X_b^r \ge X_b^{(f)} \\ & X_b^{(b)} \uparrow X \ \text{as}\ b\downarrow 0 \\ & X_b^{(f)} \downarrow X \ \text{as}\ b\downarrow 0\end{split}\]

$X_b$, $X_r$, and $X_f$ converge weakly (in $L^1$) to $X$ and the same holds for a compound distribution with severity $X$. These inequalities are illustrated in the example below.

5.4.3.2.2. Rounding Method Used by Default

aggregate uses the rounding method by default and offers the forward and backwards methods to compute explicit bounds on the distribution approximation if required. We found that the rounding method performs well across all examples we have run. These options are available in update() through the sev_calc argument, which can take the values round, forwards, and backwards. This decision is based in part on the following observations about the moment method in Embrechts and Frei [2009] (emphasis added):

that both the forward/backward differences and the rounding method do not conserve any moments of the original distribution. In this light Gerber [1982] suggests a procedure that locally matches the first $k$ moments. Practically interesting is only the case $k = 1$; for $k \ge 2$ the procedure is not well defined, potentially leading to negative probability mass on certain lattice points. The moment matching method is much more involved than the rounding method in terms of implementation; we need to calculate limited expected values. Apart from that, the gain is rather modest; moment matching only pays off for large bandwidths, and after all, the rounding method is to be preferred. This is further reinforced by the work of Grübel and Hermesmeier [1999]: if the severity distribution is absolutely continuous with a sufficiently smooth density, the quantity $f_{b,j} / b$, an approximation for the compound density, can be quadratically extrapolated.

Klugman et al. [2019] report that Panjer and Lutek [1983] found two moments were usually sufficient and that adding a third moment requirement adds only marginally to the accuracy. Furthermore, they report that the rounding method and the first-moment method had similar errors, while the second-moment method provided significant improvement but at the cost of no longer guaranteeing that the resulting probabilities are nonnegative.

5.4.3.2.3. Approximating the Density

The pdf at $kb$ can be approximated as $p_k / b$. This suggests another approach to discretization. Using the rounding method

\[\begin{split}p_k &= F((k+1/2) b) - F((k- 1/2)b) \\ &= \int_{(k-1/2)b}^{(k+1)b} f(x)dx \\ &\approx f(kb) b.\end{split}\]

Therefore we could rescale the vector $(f(0), f(b), f(2b), \dots)$ to have sum 1. This method works well for continuous distributions, but does not apply for mixed ones, e.g., when a policy limit applies.

5.4.3.2.4. Discretization Example

This example illustrates the impact of different discretization methods on the severity and aggregate distributions. The example uses a severity that can take negative values. aggregate treats any negative values as a mass at zero. This approach allows for the use of the normal and other distributions supported on the whole real line. The severity has finite support, so truncation is not an issue, and it is discrete so it is easy to check the calculations are correct. The severity is shown first, discretized using bs=1, 1/2, 1/4, 1/8. As expected, the rounding method (orange), lies between the forward (blue) and backwards (green) methods.

In [7]: from aggregate import build, qd

In [8]: import matplotlib.pyplot as plt

In [9]: from matplotlib import ticker

In [10]: dsev = [-1, 0, .25, .5, .75, 1, 1.5 + 1 / 16, 2, 2 + 1/4, 3]

In [11]: a01 = build(f'agg Num:01 1 claim dsev {dsev} fixed', update=False)

In [12]: fig, axs = plt.subplots(2, 2, figsize=(2 * 3.5, 2 * 2.45 + 0.1),
   ....:     constrained_layout=True)
   ....: 

In [13]: for bs, ax in zip([1, 1/2, 1/4, 1/8], axs.flat):
   ....:     for k in ['forward', 'round', 'backward']:
   ....:         a01.update(log2=10, bs=bs, sev_calc=k)
   ....:         a01.density_df.p_total.cumsum().\
   ....:             plot(xlim=[-.25, 3.25], lw=2 if  k=='round' else 1,
   ....:             drawstyle='steps-post', ls='--', label=k, ax=ax)
   ....:         ax.legend(loc='lower right')
   ....:         ax.set(title=f'Bucket size bs={bs}')
   ....: 

In [14]: axs[0,0].set(ylabel='distribution');

In [15]: axs[1,0].set(ylabel='distribution');

In [16]: fig.suptitle('Severity by discretization method for different bucket sizes');

Next, create aggregate distributions with a Poisson frequency, mean 4 claims, shown for the same values of bs.

In [17]: a02 = build(f'agg Num:02 4 claims dsev {dsev} poisson', update=False)

In [18]: fig, axs = plt.subplots(2, 2, figsize=(2 * 3.5, 2 * 2.45 + 0.1),
   ....:     constrained_layout=True)
   ....: 

In [19]: for bs, ax in zip([1, 1/2, 1/4, 1/8], axs.flat):
   ....:     for k in ['forward', 'round', 'backward']:
   ....:         a02.update(log2=10, bs=bs, sev_calc=k)
   ....:         a02.density_df.p_total.cumsum().\
   ....:             plot(xlim=[-2, 27], lw=2 if  k=='round' else 1,
   ....:            drawstyle='steps-post', label=k, ax=ax)
   ....:         ax.legend(loc='lower right')
   ....:         ax.set(title=f'Bucket size bs={bs}')
   ....: 

In [20]: fig.suptitle('Aggregates by discretization method');

Note

Setting drawstyle='steps-post' joins dots with a step function that jumps on the right (post=afterwards), making the result continuous from the right, appropriate for a distribution. Quantile functions are continuous from the left and should be rendered using drawstyle='steps-pre' (before), which puts the jump on the left.

5.4.3.2.5. Exact Calculation

The differences $p_k=F((k + 1/2)b) - F((k - 1/2)b)$ can be computed in three different ways, controlled by the discretization_calc option. The options are:

discretization_calc='distribution' takes differences of the sequence $F((k + 1/2)b)$. This results in a potential loss of accuracy in the right tail where the distribution function increases to 1. The resulting probabilities can be no smaller than the smallest difference between 1 and a float. numpy reports this as numpy.finfo(float).epsneg; it is of the order 1e-16.
discretization_calc='survival' takes the negative difference of the sequence $S(k + 1/2)b)$ of survival function values. This results in a potential loss of accuracy in the left tail where the survival function increases to 1. However, it provides better resolution in the right.
discretization_calc='both' attempts to make the best of both worlds, computing:
```
np.maximum(np.diff(fz.cdf(adj_xs)), -np.diff(fz.sf(adj_xs)))
```

This does double the work and is marginally slower.

The update default is survival. The calculation method does not generally impact the aggregate distribution when FFTs are used because they compute to accuracy about 1e-16 (there is a 1 in each row and column of $\mathsf F$, see Fast Fourier Transforms). However, the option can be helpful to create a pleasing graph of severity log density.

5.4.3.3. Truncation and Normalization

The discrete probabilities $p_k$ must be truncated into a finite-length vector to use in calculations. The number of buckets used is set by the log2 variable, which inputs its base 2 logarithm. The default is log2=16 corresponding to 65,536 buckets. There are two truncation options, controlled by the normalize variable.

normalize=False simply truncates, possibly resulting in a vector of probabilities that sums to less than 1.
normalize=True truncates and then normalizes, dividing the truncated vector by its sum, resulting in a vector of probabilities that does sums to 1 (approximately, see floats).

The default is normalize=True.

It is obviously desirable for the discrete probabilities to sum to 1. A third option, to put a mass at the maximum loss does not produce intuitive results—since the underlying distributions generally do not have a mass the graphs look wrong.

In general, it is best to use normalize=True in cases where the truncation error is immaterial, for example with a thin tailed severity. It is numerically cleaner and avoids issues with quantiles close to 1. When there will be an unavoidable truncation error, it is best to use normalize=False. The user needs to be aware that the extreme right tail is understated. The bucket size and number of buckets should be selected so that the tail is accurate where it is being relied upon. See num error analysis for more.

Warning

Avoid using normalize=True for thick tail severities. It results in unreliable and hard to interpret estimated mean severity.

5.4.3.3.1. Truncation Example

Schaller and Temnov [2008] consider a Poisson-generalized Pareto model for operational risk. They assume an expected claim count equal to 18 and a generalized Pareto with shape 1, scale 12000 and location 7000. This distribution does not have a mean. They want to model the 90th percentile point. They compare using exponential tilting [Grübel and Hermesmeier, 1999] with padding, using up to 1 million log2=20 buckets. They use a right-hand endpoint of 1 million on the severity. This example illustrates the impact of normalization and shows that padding and tilting have a similar effect.

Setup the base distribution without recomputing. Note infinite severity.

In [21]: a = build('agg Schaller:Temnov '
   ....:           '18 claims '
   ....:           'sev 12000 * genpareto 1 + 7000 '
   ....:           'poisson'
   ....:           , update=False)
   ....: 

In [22]: qd(a)

      E[X]  CV(X) Skew(X)
X                        
Freq    18 0.2357  0.2357
Sev    inf               
Agg    inf               
log2 = 0, bandwidth = na, validation: n/a, not updated.

Execute a variety of updates and assemble answer. Compare Schaller and Temnov [2008], Example 4.3.2, p. 197. They estimate the 90th percentile as 3,132,643. In this case, normalizing severity has a material impact; it acts to decrease the tail thickness and hence estimated percentiles.

In [23]: import time

In [24]: import pandas as pd

In [25]: updates = {
   ....:     'a': dict(log2=17, bs=100, normalize=True, padding=0 , tilt_vector=None),
   ....:     'b': dict(log2=17, bs=100, normalize=False, padding=0, tilt_vector=None),
   ....:     'c': dict(log2=17, bs=100, normalize=False, padding=1, tilt_vector=None),
   ....:     'd': dict(log2=17, bs=100, normalize=False, padding=2, tilt_vector=None),
   ....:     'e': dict(log2=20, bs=25, normalize=True, padding=1 , tilt_vector=None),
   ....:     'f': dict(log2=20, bs=25, normalize=False, padding=1 , tilt_vector=None),
   ....:     'g': dict(log2=17, bs=100, normalize=False, padding=0, tilt_vector=20 / (1<<17))
   ....:     }
   ....: 

In [26]: ans = {}

In [27]: for k, v in updates.items():
   ....:     start_time_ns = time.time_ns()
   ....:     a.update(**v)
   ....:     end_time_ns = time.time_ns()
   ....:     ans[k] = [a.q(0.9), a.q(0.95), a.q(0.99),  (-start_time_ns + end_time_ns) / 1e6]
   ....: 

In [28]: df = pd.DataFrame(ans.values(), index=ans.keys(), columns=[.9, .95, .99, 'millisec'])

In [29]: for k, v in updates['a'].items():
   ....:     df[k] = [v[k] for v in updates.values()]
   ....: 

In [30]: df = df.replace(np.nan, 'None')

In [31]: df = df.set_index(['log2', 'bs', 'normalize', 'padding', 'tilt_vector'])

In [32]: df.columns.name = 'percentile'

In [33]: qd(df, float_format=lambda x: f'{x:12,.0f}', sparsify=False, col_space=4)

percentile                                  900.000m     950.000m     990.000m     millisec
log2 bs   normalize padding tilt_vector                                                    
17   100  True      0       None           2,789,700    4,280,300    8,936,500           20
17   100  False     0       None           3,090,100    5,294,500   13,107,100           20
17   100  False     1       None           3,132,600    5,466,900   13,107,100           25
17   100  False     2       None           3,132,700    5,467,100   13,107,100           38
20   25   True      1       None           2,966,775    4,851,225   13,306,850          255
20   25   False     1       None           3,132,675    5,467,125   23,117,100          200
17   100  False     0       152.588u       3,132,700    5,467,100   13,107,100           25

5.4.3.4. Estimating the Bucket Size

Warning

Estimating the bucket size correctly is critical to obtaining accurate results from the FFT algorithm. This section is very important!

The bucket size is estimated as the $p$-percentile of a moment matched fit to the aggregate. By default $p=0.999$, but the user can selection another value using the recommend_p argument to update.

On creation, Aggregate automatically computes the theoretic mean, CV, and skewness $\gamma$ of the requested distribution. Using those values and $p$ the bucket size is estimated as follows.

If the CV is infinite the user must input $b$. An ValueError is thrown if no value is provided. Without a standard deviation there is no way to gauge the scale of the distribution. Note that the CV is automatically infinite if the mean does not exist.
Else if the CV is finite and $\gamma < 0$, fit a normal approximation (matching two moments). Most insurance applications have positive skewness.
Else if the CV is finite and $0 < \gamma < \infty$, fit shifted lognormal and gamma distributions (matching three moments), and a normal distribution.
Else if the CV is finite but skewness is infinite, fit lognormal, gamma, and normal distributions (two moments).
Compute $b'$ as the greatest of any fit distribution $p$-percentile (usually the lognormal).
If all severity components are limited, compute the maximum limit, $m$, otherwise set $m=0$.
Take $b=\max(b', m)/n$, where $n$ is the number of buckets.
If $b \ge 1$ round up to 1, 2, 5, 10, 20, 100, 200, etc., and return. Else if $b<1$ return the smallest power of 2 greater than $b$ (e.g., 0.2 rounds to 0.25, 0.1 to 0.125).

Step 8 ensures that $b \ge 1$ is a reasonable looking round number and is an exact float when $b \le 1$. The algorithm performs well in practice, though it can under-estimate $b$ for thick-tailed severities. The user should always look at the diagnostics Aggregate Quick Diagnostics.

5.4.3.5. Occurrence Reinsurance and Loss Picks

If specific layer loss picks are selected, the adjustment occurs immediately after the gross severity is computed in Step 2.

Occurrence reinsurance is applied after loss pick adjustment and before step 3.

Aggregate reinsurance is applied at the very end of the algorithm.

5.4.4. Fourier Transform Convolution Algorithm

We come now to reality. The truth is that the digital computer has totally defeated the analog computer. The input is a sequence of numbers and not a continuous function. The output is another sequence of numbers, whether it comes from a digital filter or a finite element stress analysis or an image processor. The question is whether the special ideas of Fourier analysis still have a part to play, and the answer is absolutely yes. [Strang, 1986]

The previous section quoted Strang in support of discrete models. Here we complete his quote in support of using Fourier analysis, born in application to continuous functions, in a discrete setting.

5.4.4.1. The `aggregate` Convolution Algorithm: Details

This section expands on each step in the FFT algorithm presented at the end of The aggregate Convolution Algorithm in more detail. Recall, the four steps:

Discretize the severity cdf.
Apply the FFT to discrete severity.
Apply the frequency pgf to the FFT.
Apply the inverse FFT to create a discretized approximation to the aggregate distribution.

5.4.4.1.1. Algorithm Objective

Compute a discrete approximation to the aggregate distribution

\[A = X_1 + \cdots + X_N,\]

under the assumption that $X_i$ are iid like $X$ and $N$ is independent of $X_i$.

5.4.4.1.2. Algorithm Inputs

Severity distribution (cdf, sf, and moments), optionally including loss pick and occurrence reinsurance adjustments.
Frequency distribution (probability generating function $\mathscr P(z):= \mathsf E[z^N]$, and moments).
Number of buckets, expressed as log base 2, $n=2^\mathit{log2}$.
Bucket size, $b$.
Severity calculation method: round, forwards, or backwards
Discretization calculation method: survial, distribution, or both.
Normalization parameter, True or False.
Padding parameter, an integer $d \ge 0$.
Tilt parameter, a real number $\theta \ge 0$.
Remove “fuzz” parameter, True or False

Usually either tilting or padding is applied to manage aliasing, but not both. When both are requested, tilting is applied first and the result is zero padded.

5.4.4.1.3. Default and Reasonable Parameter Values

The severity and frequency distributions are required. Defaults and reasonable ranges for the other parameters are as follows. Set $x_{max}=bn$ to be the range of the discretized output.

$\mathit{log2}=16$, with a reasonable range $3\le\mathit{log2}\le 28-d$ (on a 64-bit computer with 32GB RAM).
Estimating the bucket size is quite involved and is described in Estimating the Bucket Size.
Severity calculation round, see Rounding Method Used by Default.
Discretization survival, see Exact Calculation.
Normalization defaults to True. It should only be used if it is immaterial, in the sense that $1-F(x_{max})$ is small. See Truncation and Normalization.
Padding equals 1, meaning severity is doubled in size and padded with zeros. Padding equals to 2, which quadruples the size of the severity vector, is sometimes necessary. Schaller and Temnov [2008] report requiring padding equal to 2 in empirical tests with very high frequency and thick tailed severity.
Tilt equals 0 (no tilting applied). Embrechts and Frei [2009] recommend $\theta n \le 20$. Schaller and Temnov [2008] section 4.2 discuss how to select $\theta$ to the decrease in aliasing error and impact on numerical precision. Padding is an effective way to manage aliasing, but no more so than padding most circumstances. We prefer the simpler padding approach.
Remove fuzz is True.

5.4.4.1.4. Algorithm Steps

The default steps are shown next, followed by further explanation.

If frequency is identically zero, then $(1,0,\dots)$ is returned with no further calculation.
If frequency is identically one, then the discretized severity is returned with no further calculation.
If needed, estimate the bucket size, Estimating the Bucket Size.
Discretize severity into a vector $\mathsf p=(p_0,p_1,\dots,p_{n-1})$, see Discretizing the Severity Distribution. This step may include normalization.
Tilt severity, $p_k\gets p_k e^{-k\theta}$.
Zero pad the vector $\mathsf p$ to length $2^{\mathit{log2} + d}$ by appending zeros, to produce $\mathsf x$.
Compute $\mathsf z:=\mathsf{FFT}(\mathsf x)$.
Compute $\mathsf f:=\mathscr P(\mathsf z)$.
Compute the inverse FFT, $\mathsf y:=\mathsf{IFFT}(\mathsf f)$.
Take the first $n$ entries in $\mathsf y$ to obtain $\mathsf a:=\mathsf y[0:n]$.
Aggregate reinsurance is applied $\mathsf a$ if applicable.

5.4.4.2. Theory: Why the Algorithm Works

This section explains why the output output $\mathsf a=(a_0,\dots,a_{m-1})$ has $a_k$ very close to $\Pr(A=kb)$.

Fourier transforms provide an alternative way to represent a distribution function. The [Wikipedia](https://en.wikipedia.org/wiki/Fourier_transform) article says:

The Fourier transform of a function is a complex-valued function representing the complex sinusoids that comprise the original function. For each frequency, the magnitude (absolute value) of the complex value represents the amplitude of a constituent complex sinusoid with that frequency, and the argument of the complex value represents that complex sinusoid’s phase offset. If a frequency is not present, the transform has a value of 0 for that frequency. The Fourier inversion theorem provides a synthesis process that recreates the original function from its frequency domain representation.

Functions that are localized in the time domain have Fourier transforms that are spread out across the frequency domain and vice versa, a phenomenon known as the uncertainty principle. The critical case for this principle is the Gaussian function: the Fourier transform of a Gaussian function is another Gaussian function.

Generalizations include the discrete-time Fourier transform (DTFT, group $Z$), the discrete Fourier transform (DFT, group $Z\pmod N$) and the Fourier series or circular Fourier transform (group = $S^1$, the unit circle being a closed finite interval with endpoints identified). The latter is routinely employed to handle periodic functions. The fast Fourier transform (FFT) is an algorithm for computing the DFT.

The Fourier transform (FT) of a distribution function $F$ is usually written $\hat F$. The FT contains the same information as the distribution and there is a dictionary back and forth between the two, using the inverse FT. Some computations with distributions are easier to perform using their FT, which is what makes them useful. The FT is like exponentiation for distributions. The exponential and log functions turn (difficult) multiplication into (easy) addition

\[e^a \times e^b = e^{a+b}.\]

FTs turn difficult convolution of distributions (addition of the corresponding random variables) into easy multiplication of Fourier transforms. If $X_i$ are random variables, $X=X_1+X_2$, and $F_X$ is the distribution of $X$, then

\[\widehat{F_{X_1+X_2}}(t) = \widehat{F_{X_1}}(t) \times \widehat{F_{X_2}}(t),\]

where the righthand side is a product of functions. Computing the distribution of a sum of random variables is complicated because you have to consider all different ways an outcome can be split, but it is easy using FTs. Of course, this depends on it being easy to compute the FT and its inverse—which is where FFTs come in.

There are three things going on here:

Fourier transform of a function,
Discrete Fourier transform of an infinite sequence, and
Fast Fourier transform of a finite vector.

Discrete Fourier transforms are a discrete approximation to continuous FTs, formed by sampling at evenly spaced points. The DFT is a sequence, rather than a function. It retains the convolution property of FTs. They are sometimes called discrete cosine transforms (DCT).

The Fast Fourier transform refers to a very fast way to compute finite discrete FTs, which are applied to finite samples of FTs. General usage blurs the distinction between discrete FTs and their computation, and uses FFT as a catchall for both.

Thus, there are four-steps from the continuous to the finite discrete computational strategy (notation explained below):

Analytic domain:

\[f \rightarrow \hat f \rightarrow \mathscr P\circ \hat f \rightarrow \widehat{\mathscr P\circ \hat f} =: g\]
Discrete approximation:

\[f \rightarrow f_b \rightarrow \hat{f_b} \rightarrow \mathscr P\circ \hat{f_b} \rightarrow \widehat{\mathscr P\circ \hat{f_b}} =: g_b\]
Finite discrete approximation:

\[f \rightarrow f_{b, n} \rightarrow \hat{f_{b,n}} \rightarrow \mathscr P\circ \hat{f_{b,n}} \rightarrow \widehat{\mathscr P\circ \hat{f_{b,n}}} =: g_{b, n}\]
Finite discrete approximation, periodic inversion:

\[f \rightarrow f_{b, n} \rightarrow \hat{f_{b,n}} \rightarrow \mathscr P_m\circ \hat{f_{b,n}} \rightarrow \widehat{\mathscr P_m\circ \hat{f_{b,n}}} =: g_{b,n,m}\]

Here is the rationale for each step.

Step 1 to 2: discretize $f$ because we are working in a digital world, not an analog/analyic one (Strang quote) and because the answers are often not continuous distributions. Discretize at multiples of a sampling interval $b$. The sampling rate is $1/b$ samples per unit. The sampled distribution (which no longer has a density) is

\[f_b = \sum_k p_k\delta_{kb}.\]

$f_b$ has Fourier transform

\[\hat{f_b}(t) =\sum_k p_ke^{-2\pi i kb t}.\]

If $\hat f$ is know analytically it can be sampled directly, see the stable example below. However, many relevant $f$ do not have analytic FTs, e.g., lognormal. At this point, $f_b$ is still defined of $\mathbb R$.
Step 2 to 3: truncate and take a finite discretization because we are working on a digital computer with finite memory.

\[f_{b, n} = \sum_{k=0}^{n-1} p_k\delta_{kb}.\]

Let $P=nb$. Now $f_{b,n}$ is non-zero only on $[0, P)$. Finite discretization combined with an assumption of $P$ periodicity enables the use of FFTs to compute $f_{b, n} \rightarrow \hat{f_{b,n}}$. (In order for a Fourier series to be $P$-periodic, it can only weight frequencies that are a multiple of $1/P$ since $\exp(2\pi i (x+kP)t)=\exp(2\pi i xt)$ for all integers $k$ iff $\exp(2\pi i kPt)=1$ iff $Pt$ is an integer. Take the integer to be 1; higher values correspond to aliasing. Hence Shannon-Nyquist and bandwidth limited functions etc.) Sampling $\widehat{f_{b, n}}(t)$ at $t=0,1/P,\dots,(n-1)/P$, requires calculating

\[\hat{f_{b,n}}(\tfrac{l}{P}) = \sum_k p_ke^{-2\pi i kb \tfrac{l}{P}} = \sum_k p_ke^{-\tfrac{2\pi i}{n} kl}\]

which is exactly what FFTs compute very quickly.
Step 3 to 4: finite convolution, $\mathscr P_m$ is computed with a sample of length $m\ge n$, i.e., padding, to control aliasing. We can also use exponential tilting (which must be done in the $f$ domain). $\mathscr P_m\circ \hat{f_{b,n}}$ is the application of a function to a vector, element-by-element and is easy to compute. $\mathscr P_m\circ \hat{f_{b,n}} \rightarrow \widehat{\mathscr P_m\circ \hat{f_{b,n}}}$ can be computed using FFTs, whereas inverting $\mathscr P\circ \hat{f_{b,n}}$ would usually be very difficult because it usually has infinite support. The price for using FFTs is assuming $g$ is $P$-periodic, i.e., introducing aliasing error. For simplicity, assume $m=n$ by padding the samples in Step 2.

Now we can use the inverse DFT to recover $g$ at the values $kb$:

\[\begin{split}\begin{align} g(kb) &= \sum_l \hat g(\tfrac{l}{P}) e^{2 \pi i kb \tfrac{l}{P}} \\ &= \sum_l \hat gf(\tfrac{l}{P}) e^{\tfrac{2 \pi i}{n} kl}. \end{align}\end{split}\]

However, this is an infinite sum (step 3), and we are working with computers, so it needs to be truncated (step 4). What is

\[\sum_{l=0}^{n-1} \hat g(\tfrac{l}{P}) e^{\tfrac{2 \pi i}{n} kl}?\]

It is an inverse DFT, that FFTs compute with alacrity. What does it equal?

Define $g_P(x) = \sum_k g(x + kP)$ to be the $P$-periodic version of $g$. If $g$ has finite support contained in $[0, P)$ then $g_P=g$. If that is not the case there will be wrapping spill-over, see PICTURE.

Now

\[\begin{split}\begin{align} \hat g(\tfrac{l}{P}) :&= \int_\mathbb{R} g(x)e^{-2 \pi i x \tfrac{l}{P}}dx \\ &= \sum_k \int_{kP}^{(k+1)P} g(x)e^{-2 \pi i x \tfrac{l}{P}}dx \\ &= \sum_k \int_{0}^{P} g(x+kP)e^{-2 \pi i (x+kP) \tfrac{l}{P}}dx \\ &= \int_{0}^{P} \sum_k g(x+kP)e^{-2 \pi i x \tfrac{l}{P}}dx \\ &= \int_{0}^{P} g_P(x)e^{-2 \pi i x \tfrac{l}{P}}dx \\ &= \hat{g_P}(\tfrac{l}{P}) \end{align}\end{split}\]

and therefore, arguing backwards and assuming that $\hat g$ is quickly decreasing, for large enough $n$,

\[\begin{split}\begin{align} \sum_{l=0}^{n-1} \hat{g}(\tfrac{l}{P}) e^{\tfrac{2 \pi i}{n} kl} &\approx\sum_l \hat{g}(\tfrac{l}{P}) e^{\tfrac{2 \pi i}{n} kl} \\ &= \sum_l \hat{g_P}(\tfrac{l}{P}) e^{\tfrac{2 \pi i}{n} kl} \\ &= g_P(kb) \end{align}\end{split}\]

Thus the partial sum we can easily compute on the left approximates $g_P$ and in favorable circumstances it is close to $g$.

There are four sources of error in the FFT algorithm. They can be controlled by different parameters:

Discretization error $f \leftrightarrow f_b$ (really $\hat f \leftrightarrow \hat{f_b}$): replacing the original distribution with a discretized approximation, controlled by decreasing the bucket size.
Truncation error $\hat{f_b} \leftrightarrow \hat{f_{b, n}}$: shrinking the support of the severity distribution by right truncation, controlled by increasing the bucket size and/or increasing the number of buckets.
Aliasing error $\widehat{\mathscr P\circ \hat{f_{b,n}}} \leftrightarrow \widehat{\mathscr P_m\circ \hat{f_{b,n}}}$: expect $g_k$ get $\sum_l g_{k+ln}$: working with only finitely many frequencies in the Fourier domain which results in visible the aggregate wrapping, controlled by padding or tilting severity.
FFT algorithm: floating point issues, underflow and (rarely) overflow, hidden by removing numerical “fuzz” after the algorithm has run.

To summarize:

If we know $\hat f$ analyically, we can use this method to estimate a discrete approximation to $f$. We are estimating $f_P(kb)$ not $f(kb)$, so there is always aliasing error, unless $f$ actually has finite support.
If $f$ is actually discrete, the only error comes from truncating the Fourier series. We can make this as small as we please by taking enough terms in the series. This case is illustrated for the Poisson distribution. This method is also applied by aggregate: the “analytic” chf is $\mathscr P_N(M_X(t))$, where $M_X(t)$ is the sum of exponentials given above.
When $\hat f$ is known we have a choice between discretizing in the space (loss) or time domain.
If $f$ is not discrete, there is a discretization and potentially aliasing error. We can control the former with high frequency (small $b$) sampling. We control the latter with large $P=nb$, arguing for large $n$ or large $b$ (in conflict to managing discretization error).

5.4.4.3. Using FFT to Invert Characteristic Functions

The use of FFTs to recover the aggregate at the end of Step 4 is entirely generic. It can be used to invert any characteristic function. In this section we provide some of examples.

Invert a gamma distribution from a sample of its characteristic function and compare with the true density. These plots show the inversion is extremely accurate over a very wide range. The top right plot compares the log density, highlighting differences only in the extreme tails.

In [1]: from aggregate.extensions import ft_invert

In [2]: import scipy.stats as ss

In [3]: import matplotlib.pyplot  as plt

In [4]: df = ft_invert(
   ...:          log2=6,
   ...:          chf=lambda alpha, t: (1 - 1j * t) ** -alpha,
   ...:          frz_generator=ss.gamma,
   ...:          params=[30],
   ...:          loc=0,
   ...:          scale=1,
   ...:          xmax=0,
   ...:          xshift=0,
   ...:          suptitle='Gamma distribution')
   ...: 

Invert a Poisson distribution with a very high mean. This is an interesting case, because we do not need space for the whole answer, just the effective range of the answer. We can use periodicity to “move” the answer to the right $x$ range. This example reproduces a Poisson with mean 10,000. The standard deviation is only 100 and so the effective rate of the distribution (using the normal approximation) will be about 9500 to 10500. Thus a satisfactory approximation can be obtained with only $2^{10}=1024$ buckets.

In [5]: import aggregate.extensions.ft as ft

In [6]: df = ft.ft_invert(
   ...:          log2=10,
   ...:          chf=lambda en, t: np.exp(en * (np.exp(1j * t) - 1)),
   ...:          frz_generator=ss.poisson,
   ...:          params=[10000],
   ...:          loc=0,
   ...:          scale=None, # for freq dists, scaling does not apply
   ...:          xmax=None,  # for freq dists want bs = 1, so xmax=1<<log2
   ...:          xshift=9500,
   ...:          suptitle='Poisson distribution, large mean computed in small space.')
   ...: 

Invert a stable distribution. Here there is more aliasing error because the distribution is so thick tailed. There is also more on the left than right because of the asymmetric beta parameter.

In [7]: def levy_chf(alpha, beta, t):
   ...:     Φ = np.tan(np.pi * alpha / 2) if alpha != 1 else -2 / np.pi * np.log(np.abs(t))
   ...:     return np.exp(- np.abs(t) ** alpha * (1 - 1j * beta * np.sign(t) * Φ))
   ...: 

In [8]: df = ft.ft_invert(
   ...:          log2=12,
   ...:          chf=levy_chf,
   ...:          frz_generator=ss.levy_stable,
   ...:          params=[1.75, 0.3],  # alpha, beta
   ...:          loc=0,
   ...:          scale=1.,
   ...:          xmax=1<<8,
   ...:          xshift=-(1<<7),
   ...:          suptitle='Stable Levy exponent $\\alpha=7/4$, '
   ...:          'slightly skewed')
   ...: 

In [9]: f = plt.gcf()

In [10]: ax = f.axes[1]

In [11]: ax.grid(lw=.25, c='w');

5.4.4.4. Fast Fourier Transforms

The trick with FFTs is how they are computed. What they compute is very straightforward and given by a simple matrix multiplication.

The FFT of the $m\times 1$ vector $\mathsf{x}=(x_0,\dots,x_{m-1})$ is just another $m\times 1$ vector $\hat{\mathsf{x}}$ whose $j$th component is

\[x_j = \sum_{k=0}^{m-1} x_k\exp(-2\pi ijk/m),\]

where $i=\sqrt{-1}$. The coefficients of $\hat{\mathsf{x}}$ are complex numbers. It is easy to see that $\hat{\mathsf{x}}=\mathsf{F}\mathsf{x}$ where

\[\begin{split}\mathsf{F}= \begin{pmatrix} 1 & 1 & \dots & 1 \\ 1 & w & \dots & w^{m-1} \\ 1 & w^2 & \dots & w^{2(m-1)} \\ \vdots & & & \vdots \\ 1 & w^{m-1} & \dots & w^{(m-1)^2} \end{pmatrix}\end{split}\]

is a matrix of complex roots of unity and $w=\exp(-2\pi i/m)$. This shows there is nothing inherently mysterious about an FFT. The trick is that there is a very efficient algorithm for computing the matrix multiplication [Press et al., 1992]. Rather than taking time proportional to $m^2$, as one would expect, it can be computed in time proportional to $m\log(m)$. For large values of $m$, the difference between $m\log(m)$ and $m^2$ time is the difference between practically possible and practically impossible.

The inverse FFT to recovers $\mathsf{x}$ from its transform $\hat{\mathsf{x}}$. The inverse FFT is computed using the same equation as the FFT with $\mathsf F^{-1}$ (matrix inverse) in place of $\mathsf F$. It is easy to see that inverse equals

\[\begin{split}\mathsf{F}^{-1}= \frac{1}{m} \begin{pmatrix} 1 & 1 & \dots & 1 \\ 1 & w^{-1} & \dots & w^{-(m-1)} \\ 1 & w^2 & \dots & w^{2(m-1)} \\ \vdots & & & \vdots \\ 1 & w^{-(m-1)} & \dots & w^{-(m-1)^2} \end{pmatrix}.\end{split}\]

The $(j,j)$ element of $m\mathsf{F}\mathsf{F}^{-1}$ is

\[\sum_g w^{jg}w^{-jg}= \sum_g 1 = m\]

and the $(j,k),\ j\not=k$ element is

\[\sum_g w^{jg} w^{-gk} = \sum_g w^{g(j-k)} = 0.\]

The inversion process can also be computed in $m\log(m)$ time because the matrix equation is the same.

How does the FFT compute convolutions? Given two probability vectors for outcomes $k=0,1,\dots,n-1$, say $\mathsf p=(p_0,\dots, p_{n-1})$ and $\mathsf q=(q_0,\dots, q_{n-1})$, the product of the $k$ th elements of the FFTs equals

\[\begin{split}\left(\sum_g p_g w^{gk} \right) \left(\sum_h p_h w^{hk} \right) = \sum_{m=0}^{n-1} \left( \sum_{\substack{g, h\\ g+h\equiv m(n)}} p_g q_h \right) w^{km}\end{split}\]

is the $k$ th element of the FFT of the wrapped convolution of $\mathsf p$ and $\mathsf q$. For example, if $n=4$ and $m=0$, the inner sum on the right equals

\[p_0 q_0 + p_1 q_3 + p_2 q_2 + p_3 q_1\]

which can be interpreted as

\[p_0 q_0 + p_1 q_{-1} + p_2 q_{-2} + p_3 q_{-3}\]

in arithmetic module $n$.

In the convolution algorithm, the product of functions $\widehat{F_{X_1}} \times \widehat{F_{X_2}}$ is replaced by the component-by-component product of two vectors, which is easy to compute. Thus, to convolve two discrete distributions, represented as $\mathsf p=(p_0,\dots,p_{m-1})$ and $\mathsf q=(q_0,\dots,q_{m-1})$ simply

Take the FFT of each vector, $\hat{\mathsf p}=\mathsf F\mathsf p$ and $\hat{\mathsf q}=\mathsf F\mathsf q$
Compute the component-by-component product $\mathsf z = \hat{\mathsf p}\hat{\mathsf q}$
Compute the inverse FFT $\mathsf F^{-1}\mathsf z$.

The answer is the exact convolution of the two input distributions, except that sum values wrap around: the extreme right tail re-appears as probabilities around 0. This problem is called aliasing (the same as the wagon-wheel effect in old Westerns), but it can be addressed by padding the input vectors.

Here is a simple example of wrapping, using a compound Poisson distribution with an expected claim count of 10 and severity taking the values 0, 1, 2, 3, or 4 equally often. The aggregate has a mean of 20 and is computed using only 32 buckets. This is not enough space, and the right hand part of the distribution wraps around. The components are shown in the middle and how they combine on the right.

In [12]: ft.fft_wrapping_illustration(ez=10, en=2)

The next figure illustrates more extreme FFT wrapping. It shows an attempt to model a compound Poisson distribution with a mean of 80 using only 32 buckets. The result is the straight line on the left. The middle plot shows the true distribution and the vertical slices of width 32 that are combined to get the total. These are shown shifted on the left. The right plot zooms into the rate 0:32, and shows how the wrapped components sum to the result on the left. This is a good example of how FFT methods can fail and can appear to give inexplicable results.

In [13]: ft.fft_wrapping_illustration(ez=10, en=8)

It is not necessary to understand the details of FTs to use aggregate although they are fascinating, see for example Körner [2022]. In probability, the moment generating functions and characteristic function are based on FTs. They are discussed in any serious probability text.

5.4.4.5. FFT Routines

Computer systems offer a range of FFT routines. aggregate uses two functions from scipy.fft called scipy.fft.rfft() and scipy.fft.irfft(). There are similar functions in numpy.fft. They are tailored to taking FFTs of vectors of real numbers (as opposed to complex numbers). The FFT routine automatically handles padding the input vector. The inverse transform returns real numbers only, so there is no need to take the real part to remove noise-level imaginary parts. It is astonishing that the whole aggregate library pivots on a single line of code:

agg_density = rifft(\mathscr P(rfft(p)))

Obviously, a lot of work is done to marshal the input, but this line is where the magic occurs.

The FFT routines are accurate up to machine noise, of order 1e-16. The noise can be positive or negative—the latter highly undesirable in probabilities. It appears random and does not accumulate undesirably in practical applications. It is best to strip out the noise, setting to zero all values with absolute value less than machine epsilon (numpy.finfo(float).esp). The remove_fuzz option controls this behavior. It is set True by default. CHECK SURE?

5.4.5. Floating Point Arithmetic and Rounding Errors

The internal workings of computer floating point arithmetic can cause unexpected problems. You can read no further in this section if you promise to obey

Warning

Only use a bucket size $b$ with an exact floating point representation. It must have an exact binary representation as a fraction $a/b$ where $b$ is a power of two.

1/3, 1/5 and 1/10 are not binary floats.

For those who choose to continue, this section presents random selection of results about floats that tripped me up as I wrote aggregate.

Floating point arithmetic is not associative!

In [1]: x = .1 + (0.6 + 0.3)

In [2]: y = (0.1 + 0.6) + 0.3

In [3]: x, x.as_integer_ratio(), y, y.as_integer_ratio()
Out[3]: (0.9999999999999999, (9007199254740991, 9007199254740992), 1.0, (1, 1))

This fact can be used to create sequences with nasty accumulating errors.

Exercise Redux

Recall the exercise to compute quantiles of a dice roll. aggregate produces the consistent results—if we look carefully and account for the foibles of floating point numbers. The case $p=0.1$ is easy. But the case $p=1/6$ appears wrong. There are two ways we can model the throw of a dice: with frequency 1 to 6 and fixed severity 1, or as fixed frequency 1 and severity 1 to 6. They give different answers. The lower quantile is wrong in the first case (it equals 1) and the upper quantile in the second (2).

In [4]: from aggregate import build, qd

In [5]: import pandas as pd

In [6]: d = build('agg Dice dfreq [1:6] dsev [1]')

In [7]: print(d.q(0.1, 'lower'), d.q(0.1, 'upper'))
1.0 1.0

In [8]: print(d.q(1/6, 'lower'), d.q(1/6, 'upper'))
1.0 2.0

In [9]: d2 = build('agg Dice2 dfreq [1] dsev [1:6]')

In [10]: print(d2.q(1/6, 'lower'), d2.q(1/6, 'upper'))
1.0 2.0

These differences are irritating! The short answer is to adhere to the warning above.

Here’s the long answer, if you want to know. Looking at the source code shows that the quantile function is implemented as a previous or next look up on a dataframe of distinct values of the cumulative distribution function. These two dataframes for the different dice models are:

In [11]: ff = lambda x: f'{x:.25g}'

In [12]: qd(d.density_df.query('p_total > 0')[['p', 'F']], float_format=ff)

                                p                           F
loss                                                         
1.000 0.1666666666666666574148081 0.1666666666666666574148081
2.000 0.1666666666666666296592325  0.333333333333333259318465
3.000 0.1666666666666666296592325 0.4999999999999998889776975
4.000 0.1666666666666666296592325   0.66666666666666651863693
5.000 0.1666666666666666574148081 0.8333333333333331482961626
6.000 0.1666666666666666574148081 0.9999999999999997779553951

In [13]: qd(d2.density_df.query('p_total > 0')[['p', 'F']], float_format=ff)

                                p                           F
loss                                                         
1.000  0.166666666666666740681535  0.166666666666666740681535
2.000 0.1666666666666666296592325 0.3333333333333333703407675
3.000 0.1666666666666666296592325                         0.5
4.000  0.166666666666666740681535  0.666666666666666740681535
5.000  0.166666666666666740681535   0.83333333333333348136307
6.000   0.16666666666666651863693                           1

In [14]: print(f'\n{d.cdf(1):.25f} < {1/6:.25f} < 1/6 < {d2.cdf(1):.25f}')

0.1666666666666666574148081 < 0.1666666666666666574148081 < 1/6 < 0.1666666666666667406815350

Based on these numbers, the reported quantiles are correct. $p=1/6$ is strictly greater than d.cdf(1) and strictly less than d2.cdf(1), as shown in the last row! d and d2 are different because the former runs through the FFT routine to convolve the trivial severity, whereas the latter does not.

Exercise

$X$ is a random variable defined on a sample space with ten equally likely events. The event outcomes are $0,1,1,1,2,3, 4,8, 12, 25$. Compute $\mathsf{VaR}_p(X)$ for all $p$.

In [15]: a = build('agg Ex.50 dfreq [1] '
   ....:           'dsev [0 1 2 3 4 8 12 25] [.1 .3 .1 .1 .1 .1 .1 .1]')
   ....: 

In [16]: a.plot()

In [17]: print(a.q(0.05), a.q(0.1), a.q(0.2), a.q(0.4),
   ....:    a.q(0.4, 'upper'), a.q(0.41), a.q(0.5))
   ....: 
0.0 1.0 1.0 1.0 2.0 2.0 2.0

In [18]: qd(a.density_df.query('p_total > 0')[['p', 'F']],
   ....:     float_format=ff)
   ....: 

                                  p                            F
loss                                                            
0.000  0.09999999999999997779553951 0.09999999999999997779553951
1.000    0.300000000000000044408921  0.4000000000000000222044605
2.000  0.09999999999999997779553951                          0.5
3.000  0.09999999999999997779553951  0.5999999999999999777955395
4.000  0.09999999999999997779553951   0.699999999999999955591079
8.000   0.0999999999999999639177517  0.7999999999999999333866185
12.000 0.09999999999999997779553951   0.899999999999999911182158
25.000   0.100000000000000088817842                            1

Solution. On the graph, fill in the vertical segments of the distribution function. Draw a horizontal line at height $p$ and find its intersection with the completed graph. There is a unique solution for all $p$ except $0.1, 0.4, 0.5,\dots, 0.9$. Consider $p=0.4$. Any $x$ satisfying $\mathsf{Pr}(X < x) \le 0.4 \le \mathsf{Pr}(X\le x)$ is a $0.4$-quantile. By inspection the solutions are $1\le x \le 2$. VaR is defined as the lower quantile, $x=1$. The $0.41$ quantile is $x=2$. VaRs are not interpolated in this problem specification. The loss 25 is the $p$-VaR for any $p>0.9$. The apparently errant numbers for aggregate (the upper quantile at 0.1 equals 2) are explained by the float representations. The float representation of 0.4 is 3602879701896397/9007199254740992 which actually equals 0.4000000000000000222044605.

5.4. Numerical Methods and FFT Convolution

5.4.1. Helpful References

5.4.2. Overview

5.4.2.1. A Trilemma

5.4.2.2. The aggregate Convolution Algorithm

5.4.2.3. Strengths and Weaknesses

5.4.2.4. Related Actuarial Literature

5.4.2.5. Other Applications

5.4.2.6. Conditional Expectations (Kappa)

5.4.3. Digital Representation of Distributions

5.4.3.1. How aggregate Represents a Distribution

5.4.3.1.1. Sidebar: Continuous Discretization

5.4.3.2. Discretizing the Severity Distribution

5.4.3.2.1. Infinite Discretization

5.4.3.2.2. Rounding Method Used by Default

5.4.3.2.3. Approximating the Density

5.4.3.2.4. Discretization Example

5.4.3.2.5. Exact Calculation

5.4.3.3. Truncation and Normalization

5.4.3.3.1. Truncation Example

5.4.3.4. Estimating the Bucket Size

5.4.3.5. Occurrence Reinsurance and Loss Picks

5.4.4. Fourier Transform Convolution Algorithm

5.4.4.1. The aggregate Convolution Algorithm: Details

5.4.4.1.1. Algorithm Objective

5.4.4.1.2. Algorithm Inputs

5.4.4.1.3. Default and Reasonable Parameter Values

5.4.4.1.4. Algorithm Steps

5.4.4.2. Theory: Why the Algorithm Works

5.4.4.3. Using FFT to Invert Characteristic Functions

5.4.4.4. Fast Fourier Transforms

5.4.4.5. FFT Routines

5.4.5. Floating Point Arithmetic and Rounding Errors

5.4.2.2. The `aggregate` Convolution Algorithm

5.4.3.1. How `aggregate` Represents a Distribution

5.4.4.1. The `aggregate` Convolution Algorithm: Details