This book is a work in progress and is Open Review. We want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right hand corner of the page

Chapter 16 Estimation using covariates

Up to this point, our estimators have relied primarily on direct measurements of the variable of interest, $y$, collected from sampled units. In many settings, however, we also have access to auxiliary information—additional variables that are already available or inexpensive to obtain. When such auxiliary data are related to $y$, they can be used to improve estimation and increase efficiency. A covariate $x$ is one form of auxiliary information: it’s a variable that is related to, but distinct from, $y$. When $x$ is related to $y$, incorporating it into estimation can reduce standard errors and, in some settings, lower field measurement costs.

We have already seen covariates at work indirectly. For example, in stratified sampling (Chapter 15), a stratifying variable improved estimation by grouping similar units together, which reduced within-stratum variability and led to more precise estimates. Here, we treat $x$ as an explicit part of the estimator.

In this chapter, we’ll consider several different estimators. The choice of estimator depends on the nature of the relationship between $y$ and $x$.

Regression estimation is preferred when the relationship is approximately linear with a nonzero intercept and the variability of $y$ remains roughly constant across values of $x$ (Figure 16.1(a)).
Ratio estimation is preferred when the relationship is approximately proportional—passing through (or near) the origin—and the variability of $y$ increases with $x$ (Figure 16.1(b,c)). We’ll detail two common ratio estimators, referred to as the ratio-of-means and mean-of-ratios. The ratio-of-means works well when variability in $y$ increases roughly in proportion to $x$ (Figure 16.1(b)), while the mean-of-ratios is useful when variability in $y$ increases with $x^2$ (Figure 16.1(c)) or when unit-level ratios are of direct interest.

Three straight-line relationship patterns. (a) Nonzero intercept with roughly constant variance. (b) Passes through the origin with variance increasing in proportion to $x$. (c) Passes through the origin with variance increasing in proportion to $x^2$.

FIGURE 16.1: Three straight-line relationship patterns. (a) Nonzero intercept with roughly constant variance. (b) Passes through the origin with variance increasing in proportion to $x$. (c) Passes through the origin with variance increasing in proportion to $x^2$.

In Sections 16.1 and 16.2, we detail regression and ratio estimators for settings where the population mean of $x$ (i.e., $\mu_x$) is known and used to improve estimation of the population mean of $y$ (i.e., $\mu_y$). Then, in Section 16.3, we relax the assumption that $\mu_x$ is known and instead rely on its estimate $\bar{x}$ to improve estimation of $\mu_y$.

16.1 Regression estimation

A common way to describe the relationship between two variables is with a straight line. Suppose we have a variable of interest $y$ and a covariate $x$. The general equation for a line is \[\begin{equation} y = b_0 + b_1x, \tag{16.1} \end{equation}\] where $b_0$ is the point where the line crosses the $y$-axis (the intercept) and $b_1$ is the slope, which tells us how much $y$ changes for a one-unit increase in $x$.

Suppose we have a population of paired values $(x_i, y_i)$ for $i = 1, \ldots, N$. From this population we select a sample of $n$ pairs, which could produce a scatter plot like Figure 16.1(a). With these data we can fit a straight line using ordinary least squares regression. This method finds the values of $b_0$ and $b_1$ that minimize the squared vertical distances between the observed points and the line. In R, this is done with lm(), which returns estimates for $b_0$ and $b_1$. This is simple linear regression in a nutshell.

Our goal here isn’t to build a predictive regression model or test hypotheses about regression coefficients. Instead, we’re after a practical estimator for $\mu_y$ that borrows strength from a linear relationship with $x$. We treat the regression equation as a stepping stone: by fitting a line through the sample data, we derive an estimator that incorporates known information about the population mean of $x$. Let’s walk through how this works.

To see how regression leads to an estimator, note that in ordinary least squares the fitted regression line always passes through the point formed by the sample means, $(\bar{x}, \bar{y})$. Substituting these values into (16.1) gives \[\begin{equation} \bar{y} = b_0 + b_1 \bar{x}. \tag{16.2} \end{equation}\]

This identity can be rearranged to express the intercept in terms of sample quantities \[\begin{equation} b_0 = \bar{y} - b_1 \bar{x}. \tag{16.3} \end{equation}\]

Now suppose the population mean of the covariate, $\mu_x$, is known. If the fitted line is a reasonable approximation to the population relationship between $y$ and $x$, then evaluating the line at $x = \mu_x$ provides a natural estimator for the population mean of $y$. We denote this regression estimator by $\bar{y}_{\text{R}}$, to distinguish it from the usual sample mean $\bar{y}$, and write \[\begin{equation} \bar{y}_{\text{R}} = b_0 + b_1 \mu_x. \tag{16.4} \end{equation}\]

Substituting $b_0$ from (16.3) into (16.4) gives \[\begin{equation} \bar{y}_{\text{R}} = (\bar{y} - b_1 \bar{x}) + b_1 \mu_x, \end{equation}\] which simplifies to the regression estimator \[\begin{equation} \bar{y}_{\text{R}} = \bar{y} + b_1(\mu_x - \bar{x}). \tag{16.5} \end{equation}\]

From (16.5) we see the logic of regression estimation. The usual sample mean $\bar{y}$ serves as the baseline. The adjustment term $b_1(\mu_x - \bar{x})$ shifts this baseline according to how far the sample mean of $x$ is from its population mean, scaled by the slope. If $\bar{x}$ is less than $\mu_x$, the estimate is adjusted upward; if greater, it’s adjusted downward. In this way, the regression estimator incorporates auxiliary information from $x$ to improve our estimate of $\mu_y$. The extent of improvement depends on how strong and linear the relationship is between $x$ and $y$.

The slope in (16.5) is computed from the sample as \[\begin{equation} b_1 = \frac{S_{xy}}{S_{xx}}, \tag{16.6} \end{equation}\] where $S$ denotes a sum of squares (or cross-products). In particular, \[\begin{equation} S_{xy} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \end{equation}\] and \[\begin{equation} S_{xx} = \sum_{i=1}^n (x_i - \bar{x})^2. \end{equation}\]

The standard error of $\bar{y}_{\text{R}}$ can be estimated as \[\begin{equation} s_{\bar{y}_{\text{R}}} = \sqrt{s^2_{y|x} \left(\frac{1}{n} + \frac{(\mu_x - \bar{x})^2}{S_{xx}}\right)\left(1-\frac{n}{N}\right)}, \tag{16.7} \end{equation}\] where $s^2_{y|x}$ is the residual variance of $y$ about the fitted regression line, \[\begin{equation} s^2_{y|x} = \frac{S_{yy} - \frac{S_{xy}^2}{S_{xx}}}{n - 2} \tag{16.8} \end{equation}\] and \[\begin{equation} S_{yy} = \sum_{i=1}^n (y_i - \bar{y})^2. \end{equation}\]

The FPC term should be removed from (16.7) when appropriate, see Section 12.2.1.

To estimate the population total we scale by $N$ \[\begin{equation} \hat{t}_{\text{R}} = N \bar{y}_{\text{R}} \tag{16.9} \end{equation}\] and its standard error is \[\begin{equation} s_{\hat{t}_{\text{R}}} = N s_{\bar{y}_{\text{R}}}. \tag{16.10} \end{equation}\]

Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y}_{\text{R}} \pm \tval_{\text{df}, 1 - \frac{\alpha}{2}} s_{\bar{y}_{\text{R}}},\\ \hat{t}_{\text{R}} \pm \tval_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{t}_{\text{R}}}, \end{align*}\] with degrees of freedom $\text{df} = n - 2$.

Like regression itself, regression estimation works best when two assumptions hold: (1) the mean of $y$ varies approximately linearly with $x$, and (2) the variability of $y$ around the fitted line is approximately constant across the range of $x$. If the first assumption fails, a transformation of $x$ may help. If the second fails and variance increases with $x$, a variance-stabilizing transformation may be considered, or ratio estimation (Section 16.2) might be more appropriate.

16.2 Ratio estimation

Ratio estimation assumes a proportional relationship between $y$ and $x$, so that $y \approx R x$ for some constant multiplier $R$ (where $\approx$ means “approximately equal”). This implies the mean relationship between $y$ and $x$ passes through (or near) the origin. We estimate $R$ from the sample as $\hat{R}$ and then use it to estimate the population mean as $\bar{y}_{\text{ratio}} = \hat{R}\mu_x$, and the population total as $\hat{t}_{\text{ratio}} = \hat{R}\tau_x$.

There are two common ways to define $\hat{R}$, leading to two distinct ratio estimators. To keep things clear, we use the subscript RM for the ratio-of-means estimator and MR for the mean-of-ratios estimator in the expressions that follow.

16.2.1 Ratio-of-means

The estimator of the population mean is \[\begin{equation} \bar{y}_{\text{RM}} = \hat{R}\,\mu_x, \tag{16.11} \end{equation}\] where \[\begin{equation} \hat{R} = \frac{\bar{y}}{\bar{x}}. \end{equation}\] Equivalently, this ratio can be written \[\begin{equation} \hat{R} = \frac{\sum_{i=1}^n y_i}{\sum_{i=1}^n x_i}, \tag{16.12} \end{equation}\] since the factors of $n$ in the sample means cancel out.¹⁶³

The standard error of $\bar{y}_{\text{RM}}$ can be estimated as \[\begin{equation} s_{\bar{y}_{\text{RM}}} = \sqrt{\left(\frac{s^2_y + \hat{R}^2 s_x^2 - 2\hat{R}s_{xy}}{n}\right)\left(1-\frac{n}{N}\right)}, \tag{16.13} \end{equation}\] where $s^2_y$ and $s^2_x$ are the sample variances of $y$ and $x$, and \[\begin{equation} s_{xy} = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \end{equation}\] is the sample covariance between $x$ and $y$.¹⁶⁴ The FPC term should be removed from (16.13) when appropriate, see Section 12.2.1.

The population total is estimated using \[\begin{equation} \hat{t}_{\text{RM}} = N \bar{y}_{\text{RM}} \tag{16.14} \end{equation}\] and its standard error is \[\begin{equation} s_{\hat{t}_{\text{RM}}} = N s_{\bar{y}_{\text{RM}}}. \tag{16.15} \end{equation}\]

Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y}_{\text{RM}} \pm \tval_{\text{df}, 1 - \frac{\alpha}{2}} s_{\bar{y}_{\text{RM}}},\\ \hat{t}_{\text{RM}} \pm \tval_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{t}_{\text{RM}}}, \end{align*}\] with degrees of freedom $\text{df} = n - 1$.

16.2.2 Mean-of-ratios

The estimator of the population mean is \[\begin{equation} \bar{y}_{\text{MR}} = \hat{R}\mu_x, \tag{16.16} \end{equation}\] where \[\begin{equation} \hat{R} = \frac{1}{n}\sum_{i=1}^n r_i, \tag{16.17} \end{equation}\] and $r_i = y_i/x_i$ is the unit ratio.

The standard error of $\bar{y}_{\text{MR}}$ can be estimated as \[\begin{equation} s_{\bar{y}_{\text{MR}}} = \mu_x \sqrt{\frac{s_r^2}{n}\left(1 - \frac{n}{N}\right)}, \tag{16.18} \end{equation}\] where \[\begin{equation} s_r^2 = \frac{1}{n-1}\sum_{i=1}^n \left(r_i - \hat{R}\right)^2. \tag{16.19} \end{equation}\] The FPC term should be removed from (16.18) when appropriate, see Section 12.2.1.

The population total is estimated using \[\begin{equation} \hat{t}_{\text{MR}} = N \bar{y}_{\text{MR}} \tag{16.20} \end{equation}\] and its standard error is \[\begin{equation} s_{\hat{t}_{\text{MR}}} = N s_{\bar{y}_{\text{MR}}}. \tag{16.21} \end{equation}\]

Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y}_{\text{MR}} \pm \tval_{\text{df}, 1 - \frac{\alpha}{2}} s_{\bar{y}_{\text{MR}}},\\ \hat{t}_{\text{MR}} \pm \tval_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{t}_{\text{MR}}}, \end{align*}\] with degrees of freedom $\text{df} = n - 1$.

16.3 Double sampling (Two-phase)

Double sampling is useful when the covariate $x$ is inexpensive to measure but its population mean $\mu_x$ is unknown. The design proceeds in two phases. In phase 1, we draw a large sample of size $n_1$ and measure only $x$. In phase 2, we draw a nested subsample of size $n_2$ from the phase-1 units and measure $y$. As a result, we have a large sample of $x$ values and a smaller set of paired observations $(x, y)$.

The two phases play complementary roles. Phase 1 provides a precise estimate of the population mean of $x$, which anchors the estimation. Phase 2 concentrates effort on measuring $y$, allowing us to estimate how $y$ varies with $x$—for example, through a fitted slope $b_1$ or a ratio $\hat{R}$.

As in the regression and ratio estimators introduced in the preceding sections, the idea is to start with the mean of $y$ from the smaller sample and then adjust it based on the relationship between $x$ and $y$ and how far the phase-2 sample mean $\bar{x}_2$ is from the phase-1 mean $\bar{x}_1$. In essence, phase 1 tells us where $x$ is centered, and phase 2 tells us how changes in $x$ translate into changes in $y$. When $n_1$ is large, most of the uncertainty in the estimator arises from residual variation around the fitted relationship, rather than from estimating $\bar{x}_1$.

16.3.1 Regression estimation

The regression estimator of the population mean under double sampling is \[\begin{equation} \bar{y}_{\text{Rd}} = \bar{y}_2 + b_1\left(\bar{x}_1 - \bar{x}_2\right), \tag{16.22} \end{equation}\] where the slope $b_1$ is defined in (16.6) and is computed here using only the $n_2$ phase-2 observations.

The standard error of $\bar{y}_{\text{Rd}}$ can be estimated as \[\begin{equation} s_{\bar{y}_{\text{Rd}}} = \sqrt{s^2_{y|x} \left(\frac{1}{n_2} + \frac{(\bar{x}_1 - \bar{x}_2)^2}{S_{xx}}\right)\left(1 - \frac{n_2}{n_1} \right) + \frac{s^2_y}{n_1}\left(1 - \frac{n_1}{N}\right)}, \tag{16.23} \end{equation}\] where $s^2_{y|x}$ is the residual variance defined in (16.8), again computed using the phase-2 data.

Note that (16.23) includes two FPC terms: one for the phase-2 sample nested within phase 1, and one for the phase-1 sample drawn from the population. As before, the second FPC is removed under an areal frame (see Section 12.2.1), but we always retain the first FPC since the phase-1 sample is finite.

The population total is estimated as \[\begin{equation} \hat{t}_{\text{Rd}} = N \bar{y}_{\text{Rd}}, \tag{16.24} \end{equation}\] with standard error \[\begin{equation} s_{\hat{t}_{\text{Rd}}} = N s_{\bar{y}_{\text{Rd}}}. \tag{16.25} \end{equation}\]

Confidence intervals follow the usual form \[\begin{align*} \bar{y}_{\text{Rd}} \pm \tval_{\text{df}, 1 - \frac{\alpha}{2}} s_{\bar{y}_{\text{Rd}}},\\ \hat{t}_{\text{Rd}} \pm \tval_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{t}_{\text{Rd}}}, \end{align*}\] with degrees of freedom $\text{df} = n_2 - 2$.

16.3.2 Ratio-of-means

The estimator of the population mean is \[\begin{equation} \bar{y}_{\text{RMd}} = \hat{R}\,\bar{x}_1, \tag{16.26} \end{equation}\] where \[\begin{equation} \hat{R} = \frac{\bar{y}_2}{\bar{x}_2}. \tag{16.27} \end{equation}\]

The standard error of $\bar{y}_{\text{RMd}}$ can be estimated as \[\begin{equation} s_{\bar{y}_{\text{RMd}}} = \sqrt{\left(1 - \frac{n_2}{n_1}\right)\left(\frac{\bar{x}_1}{\bar{x}_2}\right)^2\left(\frac{s_y^2 + \hat{R}^2 s_x^2 - 2\hat{R}s_{xy}}{n_2}\right) + \frac{s^2_y}{n_1}\left(1 - \frac{n_1}{N}\right)}. \tag{16.28} \end{equation}\] If you’re using an areal sampling frame, it’s appropriate to drop the second FPC term (see Section 12.2.1 for discussion).

The population total is estimated using \[\begin{equation} \hat{t}_{\text{RMd}} = N \bar{y}_{\text{RMd}} \tag{16.29} \end{equation}\] and its standard error is \[\begin{equation} s_{\hat{t}_{\text{RMd}}} = N s_{\bar{y}_{\text{RMd}}}. \tag{16.30} \end{equation}\]

Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y}_{\text{RMd}} \pm \tval_{\text{df}, 1 - \frac{\alpha}{2}} s_{\bar{y}_{\text{RMd}}},\\ \hat{t}_{\text{RMd}} \pm \tval_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{t}_{\text{RMd}}}, \end{align*}\] with degrees of freedom $\text{df} = n_2 - 1$.

16.3.3 Mean-of-ratios

The estimator of the population mean is \[\begin{equation} \bar{y}_{\text{MRd}} = \hat{R}\,\bar{x}_1, \tag{16.31} \end{equation}\] where \[\begin{equation} \hat{R} = \frac{1}{n_2}\sum_{i=1}^{n_2} r_i, \end{equation}\] and $r_i = y_i/x_i$ is the unit ratio.

The standard error of $\bar{y}_{\text{MRd}}$ can be estimated as \[\begin{equation} s_{\bar{y}_{\text{MRd}}} = \sqrt{\bar{x}_1^2\left(\frac{s_r^2}{n_2}\right)\left(1 - \frac{n_2}{n_1}\right) + \frac{s_y^2}{n_1}\left(1 - \frac{n_1}{N}\right)}, \tag{16.32} \end{equation}\] where $s_r^2$ was defined in (16.19) and computed using the phase-2 data. If you’re using an areal sampling frame, it’s appropriate to drop the second FPC term (see Section 12.2.1 for discussion).

The population total is estimated using \[\begin{equation} \hat{t}_{\text{MRd}} = N \bar{y}_{\text{MRd}} \tag{16.33} \end{equation}\] and its standard error is \[\begin{equation} s_{\hat{t}_{\text{MRd}}} = N s_{\bar{y}_{\text{MRd}}}. \tag{16.34} \end{equation}\]

Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y}_{\text{MRd}} \pm \tval_{\text{df}, 1 - \frac{\alpha}{2}}\, s_{\bar{y}_{\text{MRd}}},\\ \hat{t}_{\text{MRd}} \pm \tval_{\text{df}, 1 - \frac{\alpha}{2}}\, s_{\hat{t}_{\text{MRd}}}, \end{align*}\] with degrees of freedom $\text{df} = n_2 - 1$.

16.4 Illustrative analysis and extensions

The next few sections illustrate how the estimators we’ve introduced are applied in forest inventory settings. Each example highlights how auxiliary information in the form of a covariate can improve efficiency, and how the choice of estimator depends on available data, the relationship between variables, sampling designs, and field constraints.

We begin in Section 16.4.1 with a timber sale cruise, where ratio estimation offers a simple and efficient approach to estimating total volume. Section 16.4.2 turns to the challenge of quantifying insect damage using remotely sensed data, where double sampling proves useful. Finally, Sections 16.5 and 16.6 explore more advanced inventory designs built around ratio and regression estimators.

16.4.1 Using ratio estimators to estimate timber volume

This illustration is adapted from a short essay by Ken Desmarais, who at the time was the Assistant District Ranger in the White Mountain National Forest, New Hampshire. The piece, entitled Marcy and Ted Take a Crack at Ratio Estimators, is both fun and informative, and we encourage you to read the original (included in the datasets/WMNF directory). Here we present the key elements, focusing on the methods and analysis.

16.4.1.1 Setting

In this example, the crew needed to mark every merchantable tree and estimate its volume for a spruce-fir timber sale in the White Mountain National Forest. This is a demanding task if DBH and height must be measured for every stem and then entered into a volume equation to calculate volume.

Rather than measure every tree, Marcy, the heroine of the story, suggests using ratio estimation. Her idea is to guess DBH and height for each tree, yielding quick but coarse volume estimates, and then measure DBH and height for a random subset of trees to get accurate volumes. The guesses were made by ocular assessment only—no instruments—just a quick guess at the tree’s DBH class and number of 16-foot logs (height), which were then run through a volume table to produce coarse volume estimates. The ratio estimator is used to correct these guesses, producing an accurate and precise estimate of the total sale volume.

Mapping this back to our regression sampling methods, the coarse (guessed) tree volumes play the role of $x$. Since every tree is guessed, the population mean $\mu_x$ is known. The $y$ values come from trees whose DBH and height were measured and then run through a volume equation. The parameter of interest is $\mu_y$, the true mean tree volume, which can then be scaled by $N$ to estimate the total sale volume $\tau_y = N \mu_y$.

16.4.1.2 Sampling design and implementation

Each tree in the sale was marked and assigned an ocular estimate of 2-inch DBH class and number of 16-foot logs. These coarse values were run through a balsam-fir volume table to yield a guessed volume (in board feet).
A subsample of trees was selected for measurement. This subsample had to be chosen using a random mechanism (e.g., SRS or SYS). To simplify implementation in the field, a SYS design was used. After a random start, every 200-th tree received a DBH and height measurement. The crew figured they needed about 20 observations to estimate the relationship between guessed and estimated volume, i.e., $n = 20$ $(x, y)$ pairs should give a good estimate of $R$. They estimated about 4,000 trees in the sale, so a SYS sampling interval of 200 should yield the desired sample size $n$.
To start the systematic sample, a random start between 1 and 200 was chosen; the draw produced 97. Starting from the first tree, each was assigned a guessed volume. When the crew reached the 97-th tree, they guessed as usual but also took a measurement. The same was done for the 297-th tree, the 497-th, and so on, until all trees were marked.

16.4.1.3 Data and analysis

In the end, $N$ = 4,181 trees were marked and included in the sale. Because the total population size is known and finite (4,181 marked trees) we will apply estimators for a finite population. Table 16.1 shows the $n = 20$ trees that received both a guess and a measurement. Guessed volumes are coarse values from a volume table; estimated volumes are based on DBH and height measurements entered into a volume equation, providing more accurate estimates.¹⁶⁵ The mean guessed volume across all $N$ trees was $\mu_x = 62.1$ board feet per tree.

TABLE 16.1: Paired guessed and measured tree volumes for the subsample of 20 spruce-fir trees in the White Mountain National Forest.
		Guessed			Measured
Tree	Species	DBH (in)	Logs (16-ft)	Vol. (bd ft)	DBH (in)	Height (ft)	Vol. (bd ft)
1	rs	10	1.5	48	10.4	27	57.9
2	bf	8	1.0	24	8.6	18	27.4
3	rs	12	1.5	74	11.8	28	79.9
4	rs	10	1.5	48	10.2	25	52.4
5	rs	12	2.0	92	12.4	36	107.0
6	rs	12	2.0	92	12.1	38	104.9
7	bf	10	1.5	48	10.7	24	57.0
8	rs	14	2.5	153	14.8	43	182.7
9	rs	12	2.0	92	11.5	31	80.8
10	rs	10	1.5	48	10.7	28	63.6
11	rs	8	1.0	24	7.7	18	20.6
12	bf	8	1.0	24	8.6	19	28.5
13	rs	8	1.0	24	8.2	16	22.2
14	rs	10	1.0	36	10.8	18	47.6
15	rs	8	1.0	24	8.1	14	19.5
16	bf	12	1.5	74	12.2	27	84.2
17	rs	14	1.5	105	14.8	28	134.6
18	bf	12	2.0	92	11.7	30	82.3
19	rs	10	1.0	36	10.3	18	42.5
20	rs	10	1.5	48	9.7	25	46.5

The paired data in Table 16.1 are available in datasets/WMNF/sf_xy.csv. We read them into sf_xy and use a scatter plot (Figure 16.2) to characterize the relationship between guessed and measured volumes, helping inform our choice of estimator.

sf_xy <- read_csv("datasets/WMNF/sf_xy.csv")
sf_xy %>% glimpse()

#> Rows: 20
#> Columns: 8
#> $ tree_id      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1…
#> $ species      <chr> "rs", "bf", "rs", "rs", "rs", "r…
#> $ dbh_guess    <dbl> 10, 8, 12, 10, 12, 12, 10, 14, 1…
#> $ logs_guess   <dbl> 1.5, 1.0, 1.5, 1.5, 2.0, 2.0, 1.…
#> $ vol_guess    <dbl> 48, 24, 74, 48, 92, 92, 48, 153,…
#> $ dbh_measure  <dbl> 10.4, 8.6, 11.8, 10.2, 12.4, 12.…
#> $ ht_measure   <dbl> 27, 18, 28, 25, 36, 38, 24, 43, …
#> $ vol_measured <dbl> 57.9, 27.4, 79.9, 52.4, 107.0, 1…

$Guessed ($x$) versus measured ($y$) tree volume for the paired trees listed in Table 16.1. A simple linear regression line is drawn for exploratory analysis.$

FIGURE 16.2: Guessed ($x$) versus measured ($y$) tree volume for the paired trees listed in Table 16.1. A simple linear regression line is drawn for exploratory analysis.

The scatter in Figure 16.2 shows a strong positive relationship between guessed ($x$) and measured ($y$) volumes, and the fitted line crosses the $y$-axis near the origin. With only 20 pairs it’s hard to assess variance patterns, but the spread looks roughly constant. This supports using the ratio-of-means estimator defined in Section 16.2.1.

The code below computes the ratio-of-means estimates for mean volume (16.11) and total volume (16.14), along with their respective standard errors (16.13) and (16.15), and a 95% confidence interval for the total.

# Some known values.
N <- 4181 # Trees.
mu_x <- 62.1 # (bd ft/tree).
n <- nrow(sf_xy)

# Pull out variables.
x <- sf_xy %>% pull(vol_guess)
y <- sf_xy %>% pull(vol_measured)

# Ratio estimate.
R_hat <- sum(y) / sum(x)

# Mean estimate (bd ft/tree).
y_bar <- R_hat * mu_x

# Total estimate (bd ft).
t_hat <- N * y_bar

# Standard error of the mean.
s_y_bar <- sqrt((var(y) +
                 R_hat^2 * var(x) -
                 2 * R_hat * cov(x, y)) / n *
                (1 - n / N))

# Standard error of the total.
s_t_hat <- N * s_y_bar

# 95% CI
t <- qt(df = n - 1, p = 1 - 0.05 / 2)
ci_lower <- t_hat - t * s_t_hat
ci_upper <- t_hat + t * s_t_hat

From the code above, after rounding, the computed $\hat{R}$ is 1.11. This multiplier converts the population mean of $x$ into an estimate of the population mean of $y$. The estimated total sale volume is 288,941 board feet, with a 95% confidence interval of (270,274, 307,608).

These results come from just 20 measured trees. Guessing takes time, but it’s a fraction of what it would take to measure DBH and height on every stem. That’s the payoff of a good design: many quick, inexpensive values adjusted by a small set of careful measurements. We’ll return to variations on this theme in the sections that follow and in the chapter exercises.

16.4.2 Double sampling for dead-tree totals

In the previous example, the auxiliary variable $x$ was observed for every tree in the sale. In practice, we often can’t afford to collect $x$ for the entire population. As described in Section 16.3, a common workaround is double sampling, where a relatively inexpensive measurement is taken on a large phase-1 sample, and a more accurate but costly measurement is taken on a smaller, nested phase-2 sample. Remote sensing is often used in phase 1 to provide broad, inexpensive coverage, while phase 2 relies on targeted field data collection for more detailed and accurate measurements.

This example involves a forest manager who wants to estimate the total number of dead trees in a 400-acre area under heavy insect infestation. The area is divided into $N = 200$ equal-area tiles. Aerial imagery-based mortality counts are collected for a phase-1 sample of $n_1 = 100$ tiles. From those, a nested phase-2 subsample of $n_2 = 10$ tiles is visited for a field-based mortality survey. The aerial counts ($x$) are fairly accurate, but the manager is concerned about missing subcanopy mortality and other potential biases due to image interpretation. The field-based counts ($y$) are more accurate.

Our goal is to estimate total mortality across the full area using a double sampling estimator. For comparison, we’ll also estimate total mortality using only the phase-2 field measurements, ignoring the phase-1 data.

16.4.2.1 Data and analysis

Here, the phase-1 and phase-2 data are stored in a single file, with one row per sampled tile.

Column names are:

tile_id unique identifier for each sampled tile ($n_1$ rows);
x aerial image count (available for all $n_1$ tiles);
y field-based mortality count (available for $n_2$ tiles; NA otherwise).

We begin by reading the data into mort. From this, we extract three vectors: $x_1$ contains all phase-1 counts; $x_2$ is the subset of aerial counts paired with field measurements; and $y_2$ holds the phase-2 field measurements. These align with the notation used in our estimators.

mort <- read_csv("datasets/insects/mortality_two_phase.csv")

# Extract phase-1 sample.
x_1 <- mort %>% pull(x)

# Extract phase-2 sample.
x_2 <- mort %>% filter(!is.na(y)) %>% pull(x)
y_2 <- mort %>% filter(!is.na(y)) %>% pull(y)

$Paired aerial ($x_2$) and field ($y_2$) mortality counts for ten tiles. The fitted regression line is used only to characterize the relationship and help choose an appropriate estimator. The gray dashed line marks the 1:1 reference.$

FIGURE 16.3: Paired aerial ($x_2$) and field ($y_2$) mortality counts for ten tiles. The fitted regression line is used only to characterize the relationship and help choose an appropriate estimator. The gray dashed line marks the 1:1 reference.

Figure 16.3 shows the paired $(x_2, y_2)$ values for the ten tiles with both aerial and field-based counts. The plot affirms the manager’s concerns: assuming the field survey is more accurate, the image-based survey tends to undercount smaller values and overcount larger ones. But that’s okay—the important thing is the strong linear relationship between $y$ and $x$. The fitted line has a $y$-intercept well above the origin, the relationship appears roughly proportional, and the scatter of points about that line has fairly constant variance across values of $x$. Given these observations, the regression estimator is more appropriate than the ratio estimator. However, for comparison, we’ll apply both the regression and ratio-of-means estimators, and compare their results to a baseline SRS estimate based on the $n_2$ field measurements alone.

As noted in Section 11.3.7, the relative standard error (RSE) is a useful way to express estimation uncertainty as a percentage of the estimate itself. It shares the same mathematical form as the general relative variation measure in (11.38), with the standard error of the estimator in the numerator and the estimate itself in the denominator.

In this case, since we’re estimating the total number of dead trees, the numerator is the standard error of the total, and the denominator is the estimated total. All else equal, estimators with lower RSE are preferred because they reflect greater precision relative to the size of the estimate.

The code below applies the three estimators and collects their results into a summary table.

# Known values.
N <- 200
n_1 <- length(x_1)
n_2 <- length(x_2)

# Means, variances, and covariance.
x_bar_1 <- mean(x_1)
x_bar_2 <- mean(x_2)
y_bar_2 <- mean(y_2)
s_sq_x_2 <- var(x_2)
s_sq_y_2 <- var(y_2)
s_xy <- cov(x_2, y_2)

# Ratio-of-means (double sampling).
R_hat_RMd <- y_bar_2 / x_bar_2
y_bar_RMd <- R_hat_RMd * x_bar_1
s_y_bar_RMd <- sqrt((1 - n_2 / n_1) *
  ((x_bar_1 / x_bar_2)^2 *
    ((s_sq_y_2 +
      R_hat_RMd^2 * s_sq_x_2 -
      2 * R_hat_RMd * s_xy) / n_2)) +
  (s_sq_y_2 / n_1) * (1 - n_1 / N))

t_hat_RMd <- N * y_bar_RMd
s_t_hat_RMd <- N * s_y_bar_RMd

RSE_t_hat_RMd <- 100 * s_t_hat_RMd / t_hat_RMd

# Regression (double sampling).
S_xx <- sum((x_2 - x_bar_2)^2)
S_yy <- sum((y_2 - y_bar_2)^2)
S_xy <- sum((x_2 - x_bar_2) * (y_2 - y_bar_2))
b1 <- S_xy / S_xx
s_sq_y_given_x <- (S_yy - S_xy^2 / S_xx) / (n_2 - 2)

y_bar_Rd <- y_bar_2 + b1 * (x_bar_1 - x_bar_2)

s_y_bar_Rd <- sqrt(s_sq_y_given_x *
  (1 / n_2 + (x_bar_1 - x_bar_2)^2 / S_xx) *
  (1 - n_2 / n_1) +
  (s_sq_y_2 / n_1) *
    (1 - n_1 / N))

t_hat_Rd <- N * y_bar_Rd
s_t_hat_Rd <- N * s_y_bar_Rd

RSE_t_hat_Rd <- 100 * s_t_hat_Rd / t_hat_Rd

# Phase-2 only, SRS.
y_bar_SRS <- y_bar_2
s_y_bar_SRS <- sqrt((1 - n_2 / N) * (s_sq_y_2 / n_2))

t_hat_SRS <- N * y_bar_SRS
s_t_hat_SRS <- N * s_y_bar_SRS

RSE_t_hat_SRS <- 100 * s_t_hat_SRS / t_hat_SRS

# Output summary.
tibble(
  method = c("SRS (Phase-2)", "Ratio (RMd)", "Regression (Rd)"),
  y_bar = c(y_bar_SRS, y_bar_RMd, y_bar_Rd),
  s_y_bar = c(s_y_bar_SRS, s_y_bar_RMd, s_y_bar_Rd),
  t_hat = c(t_hat_SRS, t_hat_RMd, t_hat_Rd),
  s_t_hat = c(s_t_hat_SRS, s_t_hat_RMd, s_t_hat_Rd),
  RSE_t_hat = c(RSE_t_hat_SRS, RSE_t_hat_RMd, RSE_t_hat_Rd)
)

#> # A tibble: 3 × 6
#>   method          y_bar s_y_bar t_hat s_t_hat RSE_t_hat
#>   <chr>           <dbl>   <dbl> <dbl>   <dbl>     <dbl>
#> 1 SRS (Phase-2)    24.5    2.99 4904.    598.     12.2 
#> 2 Ratio (RMd)      23.9    2.21 4788.    442.      9.24
#> 3 Regression (Rd)  24.2    1.06 4836.    212.      4.39

The results in the table above highlight the benefit of incorporating auxiliary information from the phase-1 aerial counts. Both the regression and ratio estimators yield substantially smaller RSEs compared with the phase-2-only SRS estimator.

Because these are simulated data, we know the true population values: $\mu_y = 24.75$ dead trees per tile and $\tau_y = 4,951$ dead trees across the full 400 acres.

As suggested by the exploratory plot in Figure 16.3, the regression estimator is more appropriate than the ratio estimator due to the nonzero intercept. Ratio estimation assumes a proportional relationship between $y$ and $x$ that passes through the origin. When that assumption is violated—in particular, when the relationship has a nonzero intercept—the ratio estimator can be biased. The relatively low ratio estimate hints at this bias, but a full repeated sampling analysis—like the one in Section 15.7.1—would be needed to quantify it. Still, based on the evidence from Figure 16.3 and the markedly lower RSE, it’s clear that the regression estimator is both more appropriate and more efficient for these data.

This example underscores the importance of EDA when working with auxiliary variables. A simple scatterplot with a fitted line can reveal whether the relationship between $x$ and $y$ departs from the zero-intercept assumption underlying ratio estimation. In this case, visual inspection of the $(x, y)$ relationship provided an early and reliable indication that regression was the better choice.

16.5 Big BAF sampling

This section continues our focus on combining inexpensive measurements with a targeted subset of more costly measurements to improve parameter estimates in terms of both statistical and operational efficiency. Big BAF sampling fits squarely within this theme. It builds on the principles of horizontal point sampling (Section 12.4) and introduces an elegant way to obtain per-unit-area and total estimates by splitting the estimation process into two parts: a count-based estimate of basal area (BA) per unit area, and a tree-based estimate of volume-to-basal-area ratio (VBAR). While the method was originally devised to estimate volume, it can be applied more broadly to other parameters of interest. As initially introduced in Section 12.4.6, think of the “V” in VBAR more generically as the variable of interest.

The beauty of the Big BAF method lies in its simplicity. The basic idea is straightforward: at each sampling location, conduct a tree count using a standard BAF (e.g., one that yields about 4 to 10 “in” trees). These tree counts provide an estimate of BA per unit area using point-level summaries, as described in Section 12.4. Then, using a second sweep at the same sampling location, apply a larger BAF to identify a subset of trees to measure. These measurement trees are used to calculate tree-level VBAR values by measuring DBH (used to compute BA) and height, which together are used to estimate volume (or another variable of interest). The average of these tree-level VBAR values is computed across all measurement trees—that is, pooled across sampling locations. This average VBAR is then multiplied by the BA estimate from the tree counts to yield an estimate of volume per unit area.

Counting trees with a standard-size BAF is fast and inexpensive, which makes it feasible to visit many sample locations. Measuring DBH and other tree variables to calculate VBAR, by contrast, is more time-consuming and costly. The Big BAF method balances this tradeoff by using a big BAF to identify a smaller subset of trees to measure at each location. For example, if a standard-size BAF yields an average of 8 count trees per point, a big BAF that is four times larger would result in about 2 measurement trees on average—just one-fourth as many.

Studies and personal observations have shown that BA per-unit-area estimates are virtually always more variable than VBAR estimates, so it’s generally more efficient to invest in a larger number of sampling locations (for BA counts) and fewer tree measurements (for VBAR). The Big BAF design supports this strategy naturally and allows us to concentrate effort where it’s most needed.

Kim Iles conceived the Big BAF method, building on earlier related ideas and methods developed by John F. Bell and others, as well as foundational work by Lewis R. Grosenbaugh in the 1940s and 1950s. A thorough treatment of the topic—including historical background, estimator behavior, variations, and sample size versus cost considerations—can be found in Iles (2003), Marshall, Iles, and Bell (2004), Gove et al. (2020), and references therein. These sources provide practical guidance on selecting both the number of sampling locations and the number of measurement trees, along with deeper insights into estimator performance and design tradeoffs.

Big BAF is closely related to double sampling and ratio estimation, although it’s not like the double sampling introduced in Section 16.3, where phase-1 and phase-2 are a sample and subsample of sampling units (e.g., plots or points). Under Big BAF, the phases occur at the tree level: phase-1 can be seen as all count trees identified using the standard-size BAF and used to estimate BA per unit area through point-level summaries; phase-2 consists of the subset of trees measured at each sampling location using the big BAF and used to estimate VBAR.¹⁶⁶ As we’ll see in the next section, the estimator follows the same spirit as other ratio estimators—correcting a coarse, inexpensive estimate using a more accurate, but costly, measurement.

16.5.1 Estimation using Big BAF

As described above, the Big BAF estimator comprises two components: (1) the mean basal area per unit area ($\overline{\text{BA}}$), based on point-level estimates using a standard-size BAF, and (2) the mean volume-to-basal-area ratio ($\overline{\text{VBAR}}$), based on tree-level measurements using a big BAF. Throughout, we’ll use superscripts $s$ and $b$ to distinguish quantities associated with the small (standard-size) and big BAFs, respectively.

Following the methods in Section 12.4, the mean basal area per unit area is estimated using \[\begin{equation} \overline{\text{BA}} = \frac{1}{n} \sum_{i=1}^{n} \text{BA}_i, \tag{16.35} \end{equation}\] where $n$ is the number of sampling locations and $\text{BA}_i = m^s_i \cdot \text{BAF}^s$ is the observed basal area per unit area at the $i$-th location, with $m^s_i$ being the number of count trees selected using the small BAF (BAF$^s$).

We define VBAR for the $j$-th measurement tree at the $i$-th sampling location as
\[\begin{equation} \text{VBAR}_{ij} = \frac{v_{ij}}{BA_{ij}} \end{equation}\]
and compute the mean across all measurement trees as
\[\begin{equation} \overline{\text{VBAR}} = \frac{1}{m^b} \sum_{i=1}^n \sum_{j=1}^{m^b_i} \text{VBAR}_{ij}, \tag{16.36} \end{equation}\]
where $m^b_i$ is the number of measurement trees selected using the big BAF (BAF$^b$) at the $i$-th sampling location, $n$ is the number of sampling locations, and $m^b = \sum_{i=1}^n m^b_i$ is the total number of measurement trees across all locations. Note that, unlike the point-level summary of $\text{BA}$, which includes locations with zero count trees, the mean $\overline{\text{VBAR}}$ is defined only over measurement trees, so only locations with at least one measurement tree contribute to the sum (i.e., the $i$-th sampling location could have $m^b_i = 0$ and hence doesn’t contribute to the sum).

The Big BAF estimator for volume per unit area is then given by \[\begin{equation} \hat{v} = \overline{\text{BA}} \cdot \overline{\text{VBAR}}. \tag{16.37} \end{equation}\]

It’s useful to verify how the units work out when applying (16.37). For example, suppose volume is measured in cubic meters and basal area is reported in square meters per hectare. Then (16.37) with units becomes \[\begin{align} \hat{v}\left(\frac{\text{m}^3}{\text{ha}}\right) &= \overline{\text{BA}}\left(\frac{\text{m}^2}{\text{ha}}\right) \cdot \overline{\text{VBAR}} \left(\frac{\text{m}^3}{\text{m}^2}\right) \nonumber\\ &= \overline{\text{BA}}\left(\frac{\cancel{\text{m}^2}}{\text{ha}}\right) \cdot \overline{\text{VBAR}} \left(\frac{\text{m}^3}{\cancel{\text{m}^2}}\right). \end{align}\] The $\text{m}^2$ units cancel, yielding the desired volume per unit area.

Because $\hat{v}$ is the product of two sample means—each estimated from a different sample—its standard error must account for the variability in both components. A common and practical approach is to use Bruce’s formula, which assumes that $\overline{\text{BA}}$ and $\overline{\text{VBAR}}$ are uncorrelated and combines their uncertainty in quadrature \[\begin{equation} s_{\hat{v}} \approx \sqrt{ s_{\overline{\text{BA}}}^2 + s_{\overline{\text{VBAR}}}^2 }. \tag{16.38} \end{equation}\] This approach is well established in forestry sampling (Bell and Alexander 1957; Donald Bruce 1961) and is explained in the context of Big BAF estimation by Gove et al. (2020).

In practice, Bruce’s formula is implemented by combining uncertainty on a common relative scale, rather than by directly summing standard errors expressed in different units.

Each standard error is estimated from its respective sample data using (11.21).

The standard error for $\overline{\text{BA}}$ is
\[\begin{equation} s_{\overline{\text{BA}}} = \frac{s_{\text{BA}}}{\sqrt{n}}, \tag{16.39} \end{equation}\] where \[\begin{equation} s_{\text{BA}} = \sqrt{\frac{1}{n - 1} \sum_{i=1}^{n} \left( \text{BA}_i - \overline{\text{BA}} \right)^2}. \end{equation}\]

The standard error for $\overline{\text{VBAR}}$ is
\[\begin{equation} s_{\overline{\text{VBAR}}} = \frac{s_{\text{VBAR}}}{\sqrt{m^b}}, \tag{16.40} \end{equation}\] where \[\begin{equation} s_{\text{VBAR}} = \sqrt{ \frac{1}{m^b - 1} \sum_{i=1}^{n} \sum_{j=1}^{m^b_i} \left( \text{VBAR}_{ij} - \overline{\text{VBAR}} \right)^2 }. \end{equation}\]

Because $\overline{\text{BA}}$ and $\overline{\text{VBAR}}$ are on wholly different scales (e.g., $\text{m}^2/\text{ha}$ and $\text{m}^3/\text{m}^2$, respectively), we apply (16.38) using percent standard errors. This avoids scale incompatibility and offers a clearer sense of relative precision. Rewriting (16.38) using percent standard errors gives
\[\begin{equation} s_{\hat{v}\%} \approx \sqrt{s_{\overline{\text{BA}}\%}^2 + s_{\overline{\text{VBAR}}\%}^2}, \tag{16.41} \end{equation}\]
where the percent standard errors for $\overline{\text{BA}}$ and $\overline{\text{VBAR}}$ are defined as
\[\begin{equation} s_{\overline{\text{BA}}\%} = 100 \cdot \frac{s_{\overline{\text{BA}}}}{\overline{\text{BA}}} \tag{16.42} \end{equation}\]
and
\[\begin{equation} s_{\overline{\text{VBAR}}\%} = 100 \cdot \frac{s_{\overline{\text{VBAR}}}}{\overline{\text{VBAR}}}. \tag{16.43} \end{equation}\]

In addition to aligning the two error components on a common scale, expressing the standard errors as percents in (16.41) highlights how each component contributes to the overall uncertainty in $\hat{v}$. It also provides a clear framework for evaluating tradeoffs between count and measurement effort; see, e.g., Iles (2003) and Marshall, Iles, and Bell (2004).

To return the standard error to the original units of $\hat{v}$, simply multiply the percent standard error by the estimate and divide by 100 \[\begin{equation} s_{\hat{v}} = \hat{v} \cdot \frac{s_{\hat{v}\%}}{100}. \tag{16.44} \end{equation}\]

If a confidence interval is desired, a $\tval$-value can be computed using the Welch–Satterthwaite approximation for degrees of freedom (Welch 1947). This approach is appropriate when combining uncertainty from two independent sample means with unequal variances, and it aligns with our application of Bruce’s formula to estimate the standard error. The approximate degrees of freedom are given by
\[\begin{equation} \text{df} \approx \frac{ s_{\overline{\text{BA}}\%}^2 + s_{\overline{\text{VBAR}}\%}^2 }{ \frac{s_{\overline{\text{BA}}\%}^4}{n - 1} + \frac{s_{\overline{\text{VBAR}}\%}^4}{m^b - 1} }. \tag{16.45} \end{equation}\]

With this degrees of freedom approximation, the confidence interval for $\hat{v}$ is \[\begin{equation} \hat{v} \pm \tval_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{v}}. \tag{16.46} \end{equation}\]

To estimate totals, simply multiply the mean estimate $\hat{v}$ and its standard error $s_{\hat{v}}$ by the population area.

Some general observations about Big BAF. As mentioned earlier, the estimates for BA and VBAR need not come from the same locations or sampling designs; it’s simply convenient to pair them. When sampling locations follow a SYS design, combining the two sampling efforts at the same points helps ensure good spatial coverage and representation of the population.

The method offers great flexibility in how BA and VBAR are sampled, and at what intensity. Working with percent standard errors makes optimal allocation of sampling resources a straightforward exercise, assuming costs can be assigned to sampling locations (i.e., cost per unit of $n$), tree counting (for BA), and tree measurement (for VBAR), and that reasonable percent standard error estimates are available (e.g., from prior inventories).

The standard error estimators presented here for $\overline{\text{VBAR}}$ and $\hat{v}$ rely on two distinct approximations. First, the standard error for $\overline{\text{VBAR}}$ treats the sampled tree-level ratios $\text{VBAR}_{ij} = v_{ij}/\text{BA}_{ij}$ as the observed values and applies the usual sample mean and variance formulas to those ratios. Although each $\text{VBAR}_{ij}$ is computed from two measurements taken on the same tree, the ratio itself is the unit of analysis for estimation. As noted by Gregoire and Valentine (2008) (p. 259), a more exact variance estimator could be constructed that explicitly accounts for the ratio structure, but simulation results in Gove et al. (2020) show that the simpler approach yields nearly identical inferences in practice.

Second, Bruce’s formula for the standard error of $\hat{v}$ assumes that the two summary estimators being combined—the mean basal area per unit area, $\overline{\text{BA}}$, and the mean volume-to-basal-area ratio, $\overline{\text{VBAR}}$—have negligible covariance. This assumption concerns the relationship between the estimators themselves, not the relationship between volume and basal area at the tree level, which are necessarily related by construction.

While this negligible-covariance assumption may not hold exactly in all settings, it’s often a reasonable approximation in forest inventory applications, particularly when basal area is estimated from point-level counts and VBAR is estimated from a separate set of measurement trees. Even when some correlation is present, empirical studies suggest that omitting the covariance term typically has little practical impact on inference (see, e.g., Gove et al. (2020)). For these reasons, Bruce’s formula remains a widely used and operationally effective approach for estimating uncertainty under Big BAF sampling.

As noted by Marshall, Iles, and Bell (2004), a useful feature of Big BAF is that it provides an internal check on basal area estimates. Because the count and measurement data rely on different BAFs, they yield separate estimates of BA. If those estimates differ substantially, it can indicate potential issues with field protocols that warrant closer review.

If estimates are to be broken down by species, product type, strata, or any other category requiring separate VBAR estimates, care must be taken to ensure enough measurement trees are sampled to support reliable estimation within each group. The more categories considered, the more measurement trees are needed to inform the VBAR estimates—which has cost implications.

16.5.2 Applying Big BAF

To illustrate Big BAF, we return once more to the Harvard Forest dataset. As in earlier examples, this is a simplified setting, but it allows us to walk through the full workflow using real data and compare sample-based estimates to known population values.

The objective is to estimate mean AGB per hectare for all stems with DBH $\ge$ 12.7 cm across the 35 ha forest.

We again use the sys_grid() function, introduced in Chapter 14, to generate a systematic grid of sampling locations across the forest. For this example, the spacing is 125-by-125 meters, yielding $n = 24$ sampling locations (Figure 16.4).

FIGURE 16.4: Systematic sample locations for the Harvard Forest Big BAF example.

We’ll use metric BAF$^s$ = 5 for count trees and BAF$^b$ = 20.25 for measurement trees.¹⁶⁷ This setup implies about one measurement tree for every four count trees (since 5/20.25 $\approx$ 0.25).

For both count and measurement trees, the walkthrough method was used to address boundary slopover bias. Trees with walkthrough points falling outside the boundary received a count of two; see Section 12.3.3 for details. As in previous sections, sampling locations with no trees were assigned a tree count of zero.

In the analysis that follows, we’ll first apply the Big BAF mean and standard error estimators outlined in the previous section. Then, for comparison, we’ll repeat the analysis assuming measurements were taken on all count trees (i.e., all “in” trees under BAF$^s$), representing a more traditional cruise.

The data are located in the "datasets/BigBAF" directory. "HF_BAF_s_count.csv" contains the count data collected using BAF$^s$ and "HF_BAF_b_measure.csv" contains the measurement data collected using BAF$^b$. We begin by reading these files and inspecting their structure.

count_trees <- read_csv("datasets/BigBAF/HF_BAF_s_count.csv")
measurement_trees <- read_csv("datasets/BigBAF/HF_BAF_b_measure.csv")

glimpse(count_trees)

#> Rows: 194
#> Columns: 2
#> $ point_id   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1…

glimpse(measurement_trees)

#> Rows: 54
#> Columns: 4
#> $ point_id   <dbl> 1, 1, 2, 3, 4, 4, 4, 5, 6, 6, 6, 7…
#> $ dbh_cm     <dbl> 48.0, 48.6, 22.6, 0.0, 15.9, 33.6,…
#> $ agb_Mg     <dbl> 2.22710, 2.29694, 0.44771, 0.00000…
#> $ tree_count <dbl> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1…

Let’s get a sense of the number of count ($m^s_i$) and measurement trees ($m^b_i$), based on their respective BAFs, across sampling locations (which are uniquely identified using point_id).

count_trees %>%
  group_by(point_id) %>%
  summarize(m_s_i = sum(tree_count)) %>%
  summarize(min = min(m_s_i), max = max(m_s_i), 
            mean = mean(m_s_i), total = sum(m_s_i))

#> # A tibble: 1 × 4
#>     min   max  mean total
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1    18  8.12   195

measurement_trees %>%
  group_by(point_id) %>%
  summarize(m_b_i = sum(tree_count)) %>%
  summarize(min = min(m_b_i), max = max(m_b_i), 
            mean = mean(m_b_i), total = sum(m_b_i),
            n_zero = sum(m_b_i == 0))

#> # A tibble: 1 × 5
#>     min   max  mean total n_zero
#>   <dbl> <dbl> <dbl> <dbl>  <int>
#> 1     0     5  2.08    50      4

Given these summaries, as expected, there are roughly four times as many count trees as measurement trees (195 vs. 50 total). The number of count trees per sampling location ranged from 1 to 18, with an average of about 8. This wide range in the number of “in” trees reflects the spatial variability in basal area across the forest. Measurement trees ranged from 0 to 5 per location, with an average of about 2. As indicated by n_zero, 4 sampling locations had no measurement trees.

Next, using the count trees, we estimate mean basal area (BA, in $\text{m}^2/\text{ha}$) using (16.35), standard error of the mean using (16.39), and standard error expressed as a percent using (16.42).

BA <- small_BAF_count %>%
    group_by(point_id) %>%
    summarize(m = sum(tree_count), # (trees per point).
              BA = m * BAF_small) %>% # (m^2/ha per point).
    summarize(n = n(), # Number of points.
              mean = mean(BA), # (m^2/ha).
              cv = 100 * sd(BA)/mean, # Coefficient of variation.
              se = sd(BA) / sqrt(n), # (m^2/ha).
              se_percent = 100 * se / mean) # Std. err. as % of mean.

BA

#> # A tibble: 1 × 5
#>       n  mean    cv    se se_percent
#>   <int> <dbl> <dbl> <dbl>      <dbl>
#> 1    24  40.6  47.1  3.90       9.61

Now we estimate mean $\text{VBAR}$ ($\text{Mg}/\text{m}^2$) using (16.36), standard error using (16.40), and standard error expressed as a percent of the mean using (16.43).

Recall that in our cruise data, we included a tree_count of zero to track which sampling locations had no measurement trees. However, as noted below (16.36), these zeros should not be included in the calculation of $\overline{\text{VBAR}}$, so we filter them out using filter(tree_count > 0) in the subsequent piped sequence below.

After removing those rows, tree_count takes on values of 1 or 2. A value of 2 indicates the tree was double-counted based on the walkthrough method. Because we’re computing the mean across trees (not across sampling points), the cleanest way to handle this is to duplicate the row for any tree with tree_count == 2. We do this using uncount(tree_count) from the tidyr package, which replicates rows according to a count column.¹⁶⁸

# Basal area conversion factor.
c <- pi / (10000 * 4)

VBAR <- big_BAF_measure %>%
  filter(tree_count > 0) %>%
    uncount(tree_count) %>% 
    mutate(BA = c * dbh_cm^2, # (m^2).
           VBAR = agb_Mg / BA) %>% # (Mg/m^2 per tree).
    summarize(m = n(), # Number of trees.
              mean = mean(VBAR), # (Mg/m^2).
              cv = 100 * sd(VBAR) / mean, # Coefficient of variation.
              se = sd(VBAR) / sqrt(m), # (Mg/m^2).
              se_percent = 100 * se / mean) # Std. err. as % of mean.

VBAR

#> # A tibble: 1 × 5
#>       m  mean    cv    se se_percent
#>   <int> <dbl> <dbl> <dbl>      <dbl>
#> 1    50  8.93  40.3 0.508       5.69

Lastly, given BA and VBAR, we compute the Big BAF estimate of AGB per unit area. The code below implements (16.37) and (16.41), and also converts the percent standard error into the standard error in AGB units using (16.44).

# Mean estimate (Mg/ha).
agb_Mg_ha <- VBAR$mean * BA$mean
agb_Mg_ha

#> [1] 362.69

# Standard error as percent of biomass.
se_agb_Mg_percent <- sqrt(BA$se_percent^2 + VBAR$se_percent^2)

se_agb_Mg_percent

#> [1] 11.172

# Standard error (Mg/ha).
se_agb_Mg <- agb_Mg_ha * se_agb_Mg_percent/100

se_agb_Mg

#> [1] 40.519

As discussed earlier, it’s expected for VBAR to be less variable than BA. This pattern holds for these data as well, with $\text{CV}_\text{VBAR} = 40.27$% and $\text{CV}_\text{BA} = 47.09$% reflecting the relative variability observed in the sample data, and $s_{\overline{\text{VBAR}}\%} = 5.69$ and $s_{\overline{\text{BA}}\%} = 9.61$ reflecting the relative variability of the sampling estimates. This suggests that greater precision could be achieved by adding more sample locations to improve the estimate of BA, rather than by selecting a smaller BAF$^b$ to increase the number of measurement trees.

We can then construct a confidence interval for the mean AGB estimate using the approximate degrees of freedom formula in (16.45) and the interval expression in (16.46).

BA_df <- BA$n - 1
VBAR_df <- VBAR$m - 1

df <- (BA$se_percent^2 + VBAR$se_percent^2)^2 /
  (BA$se_percent^4 / BA_df + VBAR$se_percent^4 / VBAR_df)

t <- qt(p = 1 - 0.05 / 2, df = df)

c(agb_Mg_ha - t * se_agb_Mg, agb_Mg_ha + t * se_agb_Mg)

#> [1] 280.78 444.60

We see that Big BAF yields a mean of 362.69 ($\text{Mg}/\text{ha}$) with a 95% confidence interval from 280.78 to 444.6, which covers the true biomass of 341.17.

The code below shows what we would get using a traditional cruise that measures every “in” tree using BAF$^s$. These data are held in "datasets/BigBAF/HF_BAF_s_measure.csv".

measurement_trees <- read_csv("datasets/BigBAF/HF_BAF_s_measure.csv")

measure_all <- measurement_trees %>%
  mutate(TF = ifelse(tree_count != 0, BAF_small / (c * dbh_cm^2), 0)) %>%
  group_by(point_id) %>%
  summarize(agb_Mg = sum(TF * tree_count * agb_Mg)) %>% # (Mg/ha per point).
  summarize(
    n = n(), # Number of points.
    mean = mean(agb_Mg), # (Mg/ha).
    se = sd(agb_Mg) / sqrt(n), # (Mg/ha).
    se_percent = 100 * se / mean
  ) # Std. err. as % of mean.

agb_Mg_ha <- measure_all$mean
se_agb_Mg <- measure_all$se
se_agb_Mg_percent <- measure_all$se_percent

agb_Mg_ha

#> [1] 349.77

se_agb_Mg

#> [1] 29.091

se_agb_Mg_percent

#> [1] 8.3171

The Big BAF approach offers a clear tradeoff: by measuring 145 fewer trees (the difference between the number of trees counted with BAF$^s$ and BAF$^b$), we reduce field effort and associated costs. In exchange, we accept a modest increase in uncertainty—with percent standard error rising from 8.32 to 11.17, or standard error in AGB units from 29.09 to 40.52. If you don’t want that increased uncertainty, you can simply add more count trees to improve the BA estimate at very low cost—reinvesting part of the time saved by not measuring those 145 trees (see Iles (2003) for a detailed discussion of Big BAF cruise planning and associated software).

16.6 3P sampling

Probability proportional to prediction (3P) sampling directs detailed measurements toward the most influential units in a population. It begins by making a quick, inexpensive prediction for every unit—something easy to obtain, like an ocular estimate or a simple field measurement. Units with larger predictions are more likely to be selected for full measurement, focusing effort on those expected to have the greatest influence on population parameter estimates. Predictions and measurements are combined using a mean-of-ratios estimator, which is appropriate for the tree-level selection mechanism in 3P sampling and accounts for unequal selection probabilities. The resulting design focuses measurement effort on influential units while still producing accurate, stable estimates under reasonable conditions.

Classical 3P sampling was introduced by Lewis R. Grosenbaugh in the 1960s (see, e.g., Grosenbaugh (1965)) and remains the most widely applied version of the method. While originally developed for estimating timber volume, 3P is useful in any setting where the variable of interest is costly to measure but can be predicted with reasonable accuracy. We’ll use total volume to motivate the discussion, but the same logic applies to other forest parameters. Classical 3P sampling involves the following steps.

Predict. Visit every tree and record a quick prediction $x$ that scales with the variable of interest $y$. Here, $x$ is the covariate. This might be an ocular estimate of volume or a simple dimensional measure like DBH or height. The goal isn’t precision, but consistency—$x$ should be roughly proportional (or transformable to be proportional) to the measurement $y$.
Select and measure. Use the predictions to assign each tree a selection probability proportional to $x$. Select a subsample accordingly, and for each selected tree, collect the measurement $y$.
Estimate. Combine the paired $x$ and $y$ values using a mean-of-ratios estimator. This yields an estimate of the population-level ratio between the measurement and the prediction. Scaling this ratio by the known mean or total of $x$ (recorded for all units) gives an estimate of the mean or total of $y$. The estimator naturally accounts for the unequal selection probabilities.

Classical 3P is most effective when the variable of interest is expensive to measure, predictions are available for all trees, and the measurement-to-prediction ratio is relatively stable—even if the variable itself is highly variable.

The design concentrates measurement effort on the most influential trees while still visiting every tree, making it especially well suited for small areas or situations where all trees must be assessed (e.g., for marking).

In these settings, 3P offers several advantages:

straightforward implementation with minimal training;
every tree is visited, avoiding boundary complications;
high efficiency, often yielding small standard errors relative to effort;
detailed data that support stand and stock tables by species and diameter class;
better estimation for rare species;
no need to survey tract area, which can save considerable time in some settings.

Although our focus is on classical 3P, several useful variants exist. One example is point-3P sampling, which extends the idea of classical 3P to point sampling. At each point, trees determined to be “in” using the usual angle method (e.g., prism, angle gauge, or relascope) each receive a prediction $x$ (e.g., DBH or height), which then determines the probability they’re measured for $y$. It has all the advantages of classical 3P, but doesn’t require every tree in the population to receive a prediction. An overview of point-3P and related methods is provided by West (2011).

For a deeper treatment of 3P sampling, Iles (2003) provides a comprehensive coverage, including historical context, practical guidance on implementation, and common pitfalls with solutions. For a more theoretical perspective, see Gregoire and Valentine (2008), which situates 3P within the broader class of unequal-probability sampling designs.

The next section further develops the classical 3P design, detailing its implementation step-by-step.

16.6.1 Sampling design and implementation

General steps in implementing a 3P cruise are as follows.

Determine the sample size and selection threshold.

The sample size $n$ refers to the number of trees that will receive detailed measurements of $y$. As with SRS, $n$ is typically chosen based on available resources or an allowable error. The SRS sample size formula from Section 11.3.6 can be used, with the variance (or CV) reflecting anticipated variability in the measurement-to-prediction ratio $y / x$.
To implement the selection rule, define a threshold $\text{max}_z = \hat{t}_x / n$, where $\hat{t}_x$ is a pre-cruise estimate of the population total of $x$. The $i$-th tree is selected to measure $y_i$ with probability $x_i / \text{max}_z$.¹⁶⁹ Choosing $\text{max}_z$ this way yields an expected sample size of $n$ under the 3P selection rule described below.

Traverse the population and apply the 3P selection rule.

Visit each tree in turn:
- record a quick prediction $x_i$;
- draw a random number $z_i$ from a uniform distribution $U(0, \text{max}_z)$;
- if $x_i \ge z_i$, select the tree for measurement and record $y_i$; otherwise, retain only $x_i$.
Continue until all trees have been visited.
The result is a complete set of predictions $\{x_1, \dots, x_N\}$ and a subsample of about $n$ trees with paired values $(x_i, y_i)$.

Estimation procedures for totals, means, and standard errors based on the 3P sample are presented in Section 16.6.2.

The actual sample size achieved after the cruise depends on your initial estimate $\hat{t}_x$. If this estimate is smaller than the true total $\tau_x = \sum^N_{i=1}x_i$, you’ll likely end up with fewer than $n$ sampled trees. If it’s larger, you’ll end up with more. A poor guess at $\tau_x$ won’t bias your estimates—it only affects the final sample size.

Iles (2003) outlines a practical approach for ensuring that you hit the desired sample size. He suggests deliberately selecting a large $\hat{t}_x$ that results in too many sampled trees—for example, using a value roughly twice the anticipated $\tau_x$. As you work through the population, use a random mechanism to measure only half of the selected trees, and flag the others (which he calls “insurance trees”). At the end of the cruise, if your actual sample size is below target, you can return to the insurance trees and randomly select enough of them for measurement to meet your target $n$.

As noted earlier, $x$ should be proportional to $y$, meaning we assume a relationship of the form $y = Rx$, where $R$ is a scalar to be estimated. This implies a linear relationship through the origin—when $x = 0$, then $y = 0$. For example, if $y$ is tree volume in m$^3$, a natural choice for $x$ is also volume, possibly in different units, with any mismatch absorbed into the estimate of $R$. More generally, $x$ can be any quantity that scales proportionally with volume.

A good choice for $x$ is DBH, for several reasons. First, it’s often easier to give a quick and reasonably accurate guess of a tree’s DBH than its volume. Second, if DBH is recorded for every tree, we can produce detailed stand and stock tables by diameter class (and species, if recorded). This adds value beyond simple estimation of totals and means, especially in operational settings.

While DBH is a convenient and useful choice, it’s not itself proportional to volume, so it must be transformed prior to use as the 3P $x$ variable.

One option for this transformation is to use a DBH-volume table to get a volume value to use for $x$. For example, you might use the one in Section 16.4.1, which takes DBH and height as inputs. If height isn’t practical to estimate (or even guess) for every tree, a DBH-only volume equation can be used instead. A simple form is $\text{volume} = \alpha_0 \text{DBH}^{\alpha_1}$, where $\alpha_0$ and $\alpha_1$ are parameters taken from the literature or estimated from prior data. The exponent $\alpha_1$ ensures that $x$ and $y$ are proportional, and $\alpha_0$ scales the ratio so that $R$ is approximately one. While this scaling isn’t necessary for valid estimation, it can make it easier to choose $\text{max}_z$, since we usually have better intuition about total volume than about the total of DBH raised to a power.

If you read other references on 3P, you may come across the term “sure-to-be-measured” trees. In our notation, these are trees with predictions $x_i$ exceeding the threshold $\text{max}_z$. Given how we define $\text{max}_z$, such trees are unlikely unless your pre-cruise estimate $\hat{t}_x$ is far too low and/or the target sample size $n$ is very large. When $x_i > \text{max}_z$, the tree must be selected for measurement and flagged in the cruise data accordingly. These trees are treated as a separate stratum in which all units are measured (i.e., a census), so the standard error for that stratum is zero.¹⁷⁰

In some cases—such as for high-value trees or rare species—you might choose to measure all occurrences and treat them as their own stratum. This ensures complete accounting for critical trees while still allowing 3P to be applied to the remainder of the population.

If you anticipate wanting to summarize results by category—such as species, diameter class, or product type—be sure to record those variables during the cruise. These additional data allow estimates to be broken down by group in addition to providing overall totals.

Once all trees have been visited and predictions and measurements collected, the next step is to estimate the population mean, total, and associated standard errors, as outlined in the next section.

16.6.2 Estimation using 3P

Once the cruise data are in hand—that is, a complete set of predictions for all trees and detailed measurements for a subsample—the goal is to estimate the population mean and total of $y$. Because trees with larger predictions are more likely to be selected for measurement, the observed measurements don’t form a simple random sample, so we can’t just average them. A design-consistent estimator is therefore needed to account for the unequal selection probabilities induced by 3P sampling. Under the working assumption that $y$ is proportional to $x$, the measurement-to-prediction ratio is approximately stable, motivating a mean-of-ratios estimator.

Following the notation from Section 16.2.2, we use the mean-of-ratios estimator, expressed as \[\begin{equation*} \hat{R} = \frac{1}{n}\sum_{i=1}^n r_i, \end{equation*}\] where $r_i = y_i/x_i$ is the unit ratio.

Then, following (16.16) and (16.20), we multiply the estimated ratio $\hat{R}$ by the known population mean of predictions, $\mu_x = \tau_x / N$, to estimate the population mean and total \[\begin{equation*} \bar{y} = \hat{R} \mu_x, \end{equation*}\] \[\begin{equation*} \hat{t} = N \bar{y}. \end{equation*}\]

Standard errors, confidence intervals, and interpretation follow directly from the general mean-of-ratios case in Section 16.2.2. Because the sample is drawn from a known, finite population of size $N$, the standard error formula includes the FPC. In most 3P applications, however, the sample fraction is small enough that the FPC has little practical effect on the resulting standard errors.

Notice that once we have $\hat{R}$, we could go directly to estimating the total from the total of predictions, i.e., $\hat{t} = \hat{R} \tau_x$, which is perfectly valid. However, we’re following the estimators in Section 16.2.2, which work up the total from the mean $\mu_x$ instead.

16.6.3 Estimation using 3P by group

In many cases, you’ll want to summarize estimates by one or more categorical variables, such as species, DBH class, and/or stand. Even when you’re not specifically interested in stratified summaries, it may be that the relationship between predictions and measured values differs systematically across categories. In these cases, estimation accuracy and precision can often be improved by applying group-specific ratios.

To support such estimation, the relevant categorical variable(s) must be recorded for all trees. For example, if you want estimates by species, then species must be recorded for every tree in the population. The same applies for DBH class, stand, or any other grouping variable of interest.

Let the population be divided into $H$ groups (e.g., species), with known group sizes $N_h$ and mean predicted values $\mu_{x_h}$ (both are known after the cruise). Within each group, using the 3P rule a subsample of $n_h$ trees is measured to obtain paired $(x_{hi}, y_{hi})$ values. For each group $h = 1, \dots, H$, we compute the sample mean of unit ratios \[\begin{equation} \hat{R}_h = \frac{1}{n_h} \sum_{i=1}^{n_h} r_{hi}, \end{equation}\] with $r_{hi} = y_{hi} / x_{hi}$.

The estimated mean for group $h$ is \[\begin{equation} \bar{y}_h = \hat{R}_h \mu_{x_h} \end{equation}\] and the corresponding total is \[\begin{equation} \hat{t}_h = N_h \bar{y}_h. \end{equation}\]

The sample variance of unit ratios within each group is \[\begin{equation} s_{r_h}^2 = \frac{1}{n_h - 1} \sum_{i=1}^{n_h} \left( r_{hi} - \hat{R}_h \right)^2. \end{equation}\]

The standard error of $\bar{y}_h$ is estimated as \[\begin{equation} s_{\bar{y}_h} = \mu_{x_h} \sqrt{ \frac{s_{r_h}^2}{n_h} \left( 1 - \frac{n_h}{N_h} \right) } \end{equation}\] and the standard error for $\hat{t}_h$ is \[\begin{equation} s_{\hat{t}_h} = N_h s_{\bar{y}_h}. \end{equation}\]

To obtain overall estimates, we apply the standard stratified estimators \[\begin{equation} \bar{y} = \frac{1}{N} \sum_{h=1}^H N_h \bar{y}_h, \end{equation}\] \[\begin{equation} \hat{t} = N \bar{y} = \sum_{h=1}^H \hat{t}_h. \end{equation}\]

The standard error of the overall mean is \[\begin{equation} s_{\bar{y}} = \sqrt{ \frac{1}{N^2} \sum_{h=1}^H N_h^2 s_{\bar{y}_h}^2 } \end{equation}\] and the standard error of the total is \[\begin{equation} s_{\hat{t}} = N s_{\bar{y}}. \end{equation}\]

This approach allows 3P sampling to accommodate known groupings in the population, deliver category-specific summaries, and potentially improve estimates of population parameters.

16.6.4 Applying 3P

We’ll demonstrate 3P using a simulated dataset. Although synthetic, the tree values are derived from allometric equations and reflect the type of data encountered in practice. A key advantage of using simulated data is that the true population parameters are known, allowing us to assess the estimator’s performance directly.

Our goal is to estimate the total volume of a mixed-species stand in coastal British Columbia, Canada. We also want species-specific estimates, as well as stand and stock tables by species and DBH class. Although unknown at the outset, the stand contains $N = 1{,}000$ trees and two species.

For this analysis, the 3P cruise uses volume predicted from an ocular DBH estimate as the $x$ covariate, with $y$ also being volume. Both volume variables are expressed in cubic meters.

We begin by fitting a local DBH–volume equation using data from a mix of the anticipated species. In practice, these data might come from a pilot cruise of the stand or from a previous cruise in a similar stand. Once the DBH–volume equation is in place, we follow the 3P steps outlined in the previous section.

Our local DBH–volume equation takes the form \[\begin{equation} y = \alpha_0 \, x^{\alpha_1}, \tag{16.47} \end{equation}\] where $y$ is volume ($\text{m}^3$), $x$ is DBH (cm), and the parameters are estimated to be $\alpha_0 = 0.00034$ and $\alpha_1 = 2.3$.¹⁷¹

Based on similar stands and prior experience, our pre-cruise estimate of total volume is 2,000 $\text{m}^3$. We’ll aim to collect $n = 25$ measurements to estimate $R$. Given this information, the maximum threshold is $z_{\text{max}} = 2000 / 25 = 80$.

We follow the 3P cruise steps.

Each tree in the population is visited, its species is recorded, and an ocular estimate of DBH (cm) is obtained. This DBH is used to compute a predicted volume using (16.47), which is also recorded.
A tree is selected for measurement if its predicted volume exceeds a random number drawn from $U(0, 80)$. If selected, its volume is accurately determined and recorded.

Both the predicted volume (i.e., given the DBH prediction) and random number can be generated on a tablet or smartphone as each tree is visited.

At the end of the cruise, we know $N$= 1,000 and have predicted DBH and volume for all trees. We also have a subsample of measured trees. These data are stored in "datasets/BC_3P/predicted.csv" and "datasets/BC_3P/measured.csv", respectively. Shared columns in both files include tree_id, species, dbh_cm_predicted, and vol_m3_predicted. The measured file also contains vol_m3_measured. These files are read below, and the 3P estimator is applied under the assumption of a common $R$ across species.

predicted <- read_csv("datasets/BC_3P/predicted.csv")
measured <- read_csv("datasets/BC_3P/measured.csv")

predicted %>% glimpse()

#> Rows: 1,000
#> Columns: 4
#> $ tree_id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1…
#> $ species          <chr> "A", "A", "B", "A", "A", "B"…
#> $ dbh_cm_predicted <dbl> 52.5, 37.5, 37.5, 47.5, 52.5…
#> $ vol_m3_predicted <dbl> 2.7064, 1.2609, 1.2609, 2.15…

measured %>% glimpse()

#> Rows: 34
#> Columns: 5
#> $ tree_id          <dbl> 4, 43, 54, 57, 66, 81, 93, 9…
#> $ species          <chr> "A", "B", "B", "A", "B", "A"…
#> $ dbh_cm_predicted <dbl> 47.5, 42.5, 47.5, 57.5, 52.5…
#> $ vol_m3_predicted <dbl> 2.1564, 1.6752, 2.1564, 3.32…
#> $ vol_m3_measured  <dbl> 2.1970, 2.2316, 2.7591, 3.25…

As you can see above, the measured file has 34 rows, which is our actual sample size $n$ at the end of the cruise (just a bit off from our target of 25). This might be our first clue that the pre-cruise guess about the sum of the estimates may have been a little low. But that’s okay. It doesn’t bias our estimates, we just ended up measuring a few more trees than anticipated, and our reward is a slightly larger sample to inform the estimates worked up in the code below.

We first compute the population mean of predicted volume, $\mu_x$, then, within the summarize(), we estimate the mean-of-ratios estimator $\hat{R}$ (16.17), mean estimated volume $\bar{y}$ (16.16), sample variance of the ratios $s^2_r$ (16.19), standard error of the mean $s_{\bar{y}}$ (16.18), and the associated estimates for the total volume $\hat{t}$ and its standard error $s_{\hat{t}}$ using (16.20) and (16.21), respectively.

# Known from cruise.
N <- nrow(predicted)

# Compute mu_x (and tau_x for fun).
mu_x <- mean(predicted$vol_m3_predicted)
tau_x <- sum(predicted$vol_m3_predicted)

# Estimates for volume.
measured %>%
  summarize(
    n = n(),
    R_hat = mean(vol_m3_measured / vol_m3_predicted),
    y_bar = R_hat * mu_x,
    s_sq_r = var(vol_m3_measured / vol_m3_predicted),
    s_y_bar = mu_x * sqrt(s_sq_r / n * (1 - n / N)),
    t_hat = N * y_bar,
    s_t_hat = N * s_y_bar,
    t = qt(df = n - 1, p = 1 - 0.05 / 2),
    t_ci_lower = t_hat - s_t_hat * t,
    t_ci_upper = t_hat + s_t_hat * t,
  ) %>% select(n, R_hat, t_hat, s_t_hat, t_ci_lower, t_ci_upper)

#> # A tibble: 1 × 6
#>       n R_hat t_hat s_t_hat t_ci_lower t_ci_upper
#>   <int> <dbl> <dbl>   <dbl>      <dbl>      <dbl>
#> 1    34  1.05 2653.    62.5      2526.      2780.

So, how’d we do? Our estimated total, with associated 95% confidence interval, is 2,653 (2,526, 2,780) cubic meters. With only $n$ = 34 measurements, we achieved an estimated total error of 2.4%, where the percent is computed as the RSE (relative standard error; see (11.38))!

Because this is simulated data, we know the true volume is $\mu_y = 2,590$ cubic meters. For this sample, our 95% confidence interval covers the true value, and the estimated total is remarkably close to the actual volume.

As for the difference between our target and actual sample size $n$, we had initially guessed that $\tau_x$ was 2,000. After the cruise, we see that this was an underestimate—the actual population total of $x$ is 2,525 (from the tau_x value in the code above). Again, this isn’t a problem; it just means we might end up with a few more measurement trees than intended, as we did here.

Now let’s work up species-specific estimates following the methods in Section 16.6.3. We simulated these data with two species, labeled A and B. Figure 16.5 shows the relationship between predicted and measured volume by species for the $n = 34$ trees used to estimate ratios. For exploratory analysis, we added best-fit regression lines by species. The plot suggests that species A and B have different ratios, which in turn suggests we might improve estimation by using species-specific ratios.¹⁷²

Predicted versus measured volume by species for the simulated 3P dataset. Lines are fit using least squares regression constrained to pass through the origin and included for visual exploration.

FIGURE 16.5: Predicted versus measured volume by species for the simulated 3P dataset. Lines are fit using least squares regression constrained to pass through the origin and included for visual exploration.

We’ll implement the estimator in steps, starting with species-specific summaries of the predictions, measurements, and estimates. We’ll then apply the stratified estimator to combine these summaries and produce stand-level estimates of the mean and total.

pred_summary <- predicted %>%
  group_by(species) %>%
  summarize(
    N_h = n(),
    mu_xh = mean(vol_m3_predicted),
    tau_xh = sum(vol_m3_predicted)
  )

pred_summary

#> # A tibble: 2 × 4
#>   species   N_h mu_xh tau_xh
#>   <chr>   <int> <dbl>  <dbl>
#> 1 A         734  2.62  1920.
#> 2 B         266  2.27   605.

From the prediction summary, we see that species A dominates the stand, with roughly three times as many stems ($N_h$) and three times the volume ($\tau_{xh}$) as species B. We’ll need $N_h$ and $\mu_{x_h}$ in pred_summary for the subsequent estimation after computing species-specific ratios and variances below.

measure_summary <- measured %>%
  group_by(species) %>%
  summarize(
    n_h = n(),
    R_hat_h = mean(vol_m3_measured / vol_m3_predicted),
    s_sq_rh = var(vol_m3_measured / vol_m3_predicted),
  )

measure_summary

#> # A tibble: 2 × 4
#>   species   n_h R_hat_h s_sq_rh
#>   <chr>   <int>   <dbl>   <dbl>
#> 1 A          25   0.975 0.00336
#> 2 B           9   1.26  0.0103

Estimates for species-specific $\hat{R}_h$ suggest the relationship between predicted and measured volume varies by species—our predictions tend to overestimate volume for species A and underestimate it for species B. Again, this is okay as long as the over- or underestimation is proportional across the observed pairs of $x$ and $y$ values—and, by assumption, across the full population.

For example, $\hat{R}$ for species B is 1.26, so if we predict volumes of 1, 2, 3, and 4 cubic meters, we expect measured values of 1.26, 2.53, 3.79, and 5.05 cubic meters, respectively. In other words, $y$ and $x$ are proportional. These proportional relationships are evident for species A and B in the exploratory plot shown in Figure 16.5, supporting our assumptions about the linear relationship between predicted and measured volumes.

Below are the species-specific estimates, followed by stand-level means and totals obtained by combining across strata. We begin by joining the species population information in pred_summary to measure_summary.

measure_summary <- left_join(measure_summary,
  pred_summary,
  by = join_by(species)
)

# Species-specific estimates.
measure_summary %>%
  mutate(
    y_bar_h = R_hat_h * mu_xh,
    s_y_bar_h = mu_xh * sqrt(s_sq_rh / n_h * (1 - n_h / N_h)),
    t_hat_h = N_h * y_bar_h,
    s_t_hat_h = N_h * s_y_bar_h,
    t_h = qt(df = n_h - 1, p = 1 - 0.05 / 2),
    t_ci_lower_h = t_hat_h - s_t_hat_h * t_h,
    t_ci_upper_h = t_hat_h + s_t_hat_h * t_h,
  ) %>%
  select(n_h, t_hat_h, s_t_hat_h, t_ci_lower_h, t_ci_upper_h)

#> # A tibble: 2 × 5
#>     n_h t_hat_h s_t_hat_h t_ci_lower_h t_ci_upper_h
#>   <int>   <dbl>     <dbl>        <dbl>        <dbl>
#> 1    25   1871.      21.9        1826.        1916.
#> 2     9    764.      20.1         717.         810.

Using a sample size of $25$ and $9$ measurements for species A and B, respectively, we achieved RSE values of 1.2% and 2.6% for the corresponding species totals. As defined in (11.38), the RSE is computed as $100 \cdot s_{\hat{t}_h} / \hat{t}_h$.

Our estimated totals, with associated 95% confidence intervals, are 1,871 (1,826, 1,916) and 764 (717, 810) cubic meters for species A and B, respectively. The true volume totals are A = 1,851 and B = 740 cubic meters.

Now we’ll estimate the stand-level mean and total by combining the species-specific results using their population weights (again following estimators in Section 16.6.3).

measure_summary %>%
  mutate(
    y_bar_h = R_hat_h * mu_xh,
    s_y_bar_h = mu_xh * sqrt(s_sq_rh / n_h * (1 - n_h / N_h)),
  ) %>%
  summarize(
    N = sum(N_h),
    y_bar = 1 / N * sum(N_h * y_bar_h),
    s_y_bar = sqrt(1 / N^2 * sum(N_h^2 * s_y_bar_h^2)),
    t_hat = N * y_bar,
    s_t_hat = N * s_y_bar,
    t = qt(df = sum(n_h) - n(), p = 1 - 0.05 / 2),
    t_ci_lower = t_hat - s_t_hat * t,
    t_ci_upper = t_hat + s_t_hat * t
  ) %>%
  select(t_hat, s_t_hat, t_ci_lower, t_ci_upper)

#> # A tibble: 1 × 4
#>   t_hat s_t_hat t_ci_lower t_ci_upper
#>   <dbl>   <dbl>      <dbl>      <dbl>
#> 1 2635.    29.7      2575.      2696.

The species-specific stratified estimator produced an RSE of 1.1% for the total volume. The estimated total volume, with a 95% confidence interval, is 2,635 (2,575, 2,696) cubic meters. Recall, again, that the true volume is $\mu_y = 2,590$ cubic meters.

In the first analysis, we ignored potential species differences and estimated a single pooled ratio, which yielded an RSE of 2.4%. We then allowed for species-specific ratios and applied the stratified estimator, which substantially reduced the RSE to 1.1%—less than half the pooled-ratio value. This demonstrates the value of using category-specific ratios when there are indications of differences (e.g., as seen in exploratory analysis) and enough observations to support separate estimation.

16.7 Advantages and disadvantages of estimation with covariates

Estimation with covariates improves efficiency by incorporating auxiliary information—such as remotely sensed data, ocular estimates, or other inexpensive field measures—into the estimation process. When the covariate is informative, this approach can substantially improve precision or reduce field effort. Below, we summarize the main advantages and limitations of covariate-based estimation.

16.7.1 Advantages

Improved precision: When the covariate is strongly related to the parameter of interest, ratio and regression estimators can yield substantially smaller standard errors than estimators that ignore auxiliary information—often achieving higher precision with the same or even smaller sample size.
Statistical and practical efficiency: Using inexpensive covariates allows effort to be shifted toward a smaller number of careful measurements, improving cost efficiency without sacrificing accuracy.
Flexible and widely applicable: Covariate-based estimators can be paired naturally with common probability sampling designs and are especially useful in settings where full measurement of the variable of interest is impractical.
Supports small-area and category-specific inference: When covariates are available for the full population or a large initial sample, estimates can be extended to subregions, species groups, or other categories without requiring additional measurements.
Multiple estimator choices: Ratio and regression estimators provide flexible options that can be matched to the form and strength of the relationship between the covariate and the parameter of interest.

16.7.2 Disadvantages

Dependence on the strength of the relationship: Covariate-based estimators are most effective when a reasonably linear or proportional relationship exists between the covariate and the variable of interest. When this relationship is weak, gains in precision may be modest.
Small-sample behavior: Regression and ratio estimators can exhibit a small bias when sample sizes are limited, although this typically decreases as the number of sampled units increases. Departures from the assumed relationship mainly reduce gains in precision rather than leading to systematically misleading estimates.
Greater analytical effort: Selecting an appropriate estimator, exploring the relationship between variables, and assessing whether assumptions such as linearity or proportionality are reasonable requires more judgment and diagnostic work than simpler estimators that rely only on direct measurements.
Choice among estimators requires expertise: Different covariate-based estimators are suited to different relationship structures and sampling implementations. Choosing among regression, ratio-of-means, and mean-of-ratios estimators requires understanding how variability in the variable of interest changes with the covariate and how the covariate is used in the sampling process. In designs such as 3P sampling, inappropriate estimator choices can lead not only to reduced efficiency but also to biased estimates.
Risk of extrapolation: Care is needed when applying covariate-based estimators beyond the range of observed data. Estimates can become unstable and highly variable if the relationship between the covariate and the parameter of interest is poorly supported in parts of the population.

16.8 Exercises

16.8.1 Estimating forest carbon using ALS and inventory data

A manager has wall-to-wall (i.e., for every location in the forest) LiDAR-derived canopy height measurements covering a 1,000 ha forest and field-based measurements of forest carbon for a set of inventory plots collected using systematic sampling. The LiDAR data were obtained using a drone-based aerial laser scanning (ALS) system. The goal is to estimate the mean and total forest carbon (Mg C/ha) across the forest using the covariate-based estimators developed in this chapter.

In this setting, LiDAR-derived canopy height is the covariate $x$, and field-measured forest carbon is the variable of interest $y$. For each inventory plot, the manager observes both the LiDAR-derived canopy height over the plot footprint and the corresponding field-based carbon measurement.

Because canopy height is available wall-to-wall, the population mean canopy height $\mu_x$ is known and equals 23.99 m. This makes the setting appropriate for regression or ratio estimation with a known covariate mean.

The code below reads in the paired $(x, y)$ values for the $n$ field plots and shows the file structure.

c_xy <- read_csv("datasets/xy_canopy_height_carbon.csv")

#> Rows: 25 Columns: 3
#> ── Column specification ────────────────────────────────
#> Delimiter: ","
#> dbl (3): plot_id, x, y
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

c_xy %>% glimpse()

#> Rows: 25
#> Columns: 3
#> $ plot_id <dbl> 359, 705, 363, 427, 114, 253, 97, 734…
#> $ x       <dbl> 34.821, 28.144, 33.448, 12.389, 22.08…
#> $ y       <dbl> 262.594, 232.289, 243.618, 88.522, 15…

Columns in c_xy are

plot_id: plot identifier;
x: LiDAR-derived canopy height (m);
y: field-measured forest carbon (Mg C/ha).

Next, to assess the relationship between $x$ and $y$, the code below plots the paired observations along with a best-fit line.

ggplot(c_xy, aes(x = x, y = y)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, fullrange = TRUE) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  theme_bw() +
  labs(x = "LiDAR-derived canopy height (m)",
       y = "Field-measured forest carbon (Mg C/ha)")

$LiDAR-derived canopy height ($x$) versus field-measured forest carbon ($y$) for the paired plots.$

FIGURE 16.6: LiDAR-derived canopy height ($x$) versus field-measured forest carbon ($y$) for the paired plots.

(#exr:exr:canopyHeightCarbon)

Based on Figure 16.6, explain in one sentence why the ratio-of-means estimator is appropriate in this setting.
Using the ratio-of-means estimator in Section 16.2.1, compute the population mean forest carbon $\bar{y}_{\text{RM}}$, its standard error $s_{\bar{y}_{\text{RM}}}$ (with no FPC), and a 95% confidence interval for $\bar{y}_{\text{RM}}$.
Using your estimate of the mean forest carbon and forest area $A$ = 1,000 ha, estimate the total forest carbon $\hat{t}_{\text{RM}}$ and its 95% confidence interval.
Suppose the manager ignored the LiDAR canopy height data entirely and relied only on the $n$ field-measured forest carbon values, treating them as a SRS from the population. Using these data, estimate the population mean forest carbon $\bar{y}$, its standard error $s_{\bar{y}}$ (with no FPC), and a 95% confidence interval for $\bar{y}$. Then, using (11.39), quantify the percent improvement in precision of the ratio-of-means estimator relative to the SRS estimator. When applying (11.39), take estimator A to be the SRS estimator and estimator B to be the ratio-of-means estimator, so that $s_A = s_{\bar{y}}$ and $s_B = s_{\bar{y}_{\text{RM}}}$. Based on this metric, what percent precision is gained by incorporating the LiDAR covariate in this example?

#> # A tibble: 2 × 5
#>   Estimator                Mean    SE CI_lower CI_upper
#>   <chr>                   <dbl> <dbl>    <dbl>    <dbl>
#> 1 SRS (ignores LiDAR)      191. 12.4      166.     217.
#> 2 Ratio-of-means (uses L…  184.  3.08     177.     190.

#> # A tibble: 1 × 2
#>   Metric                                       Value
#>   <chr>                                        <dbl>
#> 1 Percent improvement in precision (RM vs SRS)  93.8

Your ratio-of-means estimate of the total $\hat{t}_{\text{RM}}$ should be approximately 183,692 Mg C,
with a 95% confidence interval of (177,335, 190,050) Mg C.

16.8.2 Estimating growing stock volume using TLS and inventory data

An agency is evaluating an inventory design that couples measurements collected with handheld personal laser scanning (PLS) devices with traditional field measurements. Such PLS data are increasingly used to augment or, in some cases, replace traditional tree measurement methods (see, e.g., Gollob, Ritter, and Nothdurft (2020) and Nothdurft et al. (2025b)). The goal is to produce high-quality growing stock volume (GSV) estimates for management units while reducing field costs (i.e., crew time). PLS-based measurements can be collected much more quickly than traditional tree measurements, but the resulting PLS-based GSV estimates are coarser and are best viewed as approximations to GSV derived from high-quality field measurements. In this setting, the efficient but lower-fidelity PLS-based GSV serves as the covariate $x$, and traditionally measured GSV serves as the variable of interest $y$.

Under the proposed design, the field crew collects PLS data on a dense SYS plot grid across a management unit, yielding a large phase-1 sample of PLS-derived GSV values. From this phase-1 sample, they then select a nested random subsample of plots for traditional field measurement, producing paired observations of PLS-based and field-based GSV for use in estimation.

The code below reads the two-phase dataset and shows the file structure.

The two-phase data are stored in "datasets/tls_two_phase_gsv.csv". This file has columns:

plot_id: sampling location identifier;
x: PLS-derived GSV (m$^3$/ha);
y: traditional field-measured GSV (m$^3$/ha), observed for the phase-2 subsample only (NA otherwise).

gsv_xy <- read_csv("datasets/tls_two_phase_gsv.csv")

#> Rows: 100 Columns: 3
#> ── Column specification ────────────────────────────────
#> Delimiter: ","
#> dbl (3): plot_id, x, y
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

gsv_xy %>% glimpse()

#> Rows: 100
#> Columns: 3
#> $ plot_id <dbl> 121, 164, 285, 292, 327, 329, 353, 68…
#> $ x       <dbl> 1290.37, 1040.38, 176.59, 275.07, 107…
#> $ y       <dbl> NA, NA, NA, NA, 1049.60, 506.62, NA, …

Next, to assess the relationship between the phase-2 data $x_2$ and $y_2$, the code below plots the paired observations along with a best-fit line.

$PLS-derived proxy ($x_2$) versus field-measured growing stock volume ($y_2$) for the paired phase-2 plots.$

FIGURE 16.7: PLS-derived proxy ($x_2$) versus field-measured growing stock volume ($y_2$) for the paired phase-2 plots.

(#exr:exr:tlsGSV)

Using Figure 16.7 of $(x_2, y_2)$, assess whether the fitted line passes near the origin and whether variance appears roughly constant or increases with $x$. Based on this assessment, which estimator would you expect to be more appropriate: the regression estimator or the ratio-of-means estimator?
Using the double sampling estimators in Sections 16.3.1 and 16.3.2, compute:
- the regression estimate of the population mean, $\bar{y}_{\text{Rd}}$, its standard error $s_{\bar{y}_{\text{Rd}}}$, and a 95% confidence interval;
- the ratio-of-means estimate of the population mean, $\bar{y}_{\text{RMd}}$, its standard error $s_{\bar{y}_{\text{RMd}}}$, and a 95% confidence interval.

Note, because an areal sampling frame is used, the second FPC term in (16.23) and (16.28) is omitted.

The management unit has area $A$ = 10,000 ha. Scale each mean estimate to obtain the total GSV, i.e., $\hat{t}_{\text{Rd}}$ and $\hat{t}_{\text{RMd}}$, along with their respective 95% confidence intervals.
Using the percent improvement in precision metric defined in (11.39), quantify the percent improvement in precision of the regression estimator relative to the ratio-of-means estimator. When applying (11.39), take estimator A to be the ratio-of-means estimator and estimator B to be the regression estimator, so that $s_A = s_{\bar{y}_{\text{RMd}}}$ and $s_B = s_{\bar{y}_{\text{Rd}}}$. Based on this metric, which estimator is preferred in this setting? Does this result agree with your initial assessment based on Figure 16.7?
Suppose the agency ignored the PLS data entirely and collected only traditional field measurements of GSV, yielding a sample of size $n_2$ of $y$ values. Using these data, estimate the population mean $\bar{y}$, its standard error, and a 95% confidence interval. Then, using (11.39), quantify the percent improvement in precision of the regression estimator relative to this SRS estimator (again treating the SRS estimator as A and the regression estimator as B). How much precision is gained by incorporating the PLS covariate in this example?

To check that you’re on the right track, you should get something close to the following.

Mean estimates (m$^3$/ha)
- Regression estimator: $\bar{y}_{\text{Rd}}$ = 665.94,
  $s_{\bar{y}_{\text{Rd}}}$ = 37.11,
  95% CI (589.18, 742.71)
- Ratio-of-means estimator: $\bar{y}_{\text{RMd}}$ = 666.64,
  $s_{\bar{y}_{\text{RMd}}}$ = 38.88,
  95% CI (586.4, 746.88)
Total estimates over $A$ = 10,000 ha (m$^3$)
- Regression estimator: $\hat{t}_{\text{Rd}}$ = 6,659,433,
  95% CI (5,891,759, 7,427,107)
- Ratio-of-means estimator: $\hat{t}_{\text{RMd}}$ = 6,666,410,
  95% CI (5,863,975, 7,468,846)

16.8.3 Northern California 3P cruise

Suppose you conducted a 3P cruise on a conifer stand in northern California with the goal of estimating the total volume (ft$^3$) with associated 95% confidence intervals. Because the landowner wanted all high-value trees fully accounted for, you defined any tree with a predicted volume exceeding 85 ft$^3$ as sure-to-be-measured. These trees are always selected for measurement.

Following the 3P sampling steps in Section 16.6.1, this 85 ft$^3$ cutoff also serves as the upper bound of the uniform distribution used to generate the random number $z_i$. As each tree was encountered in the field, you recorded its predicted volume $x_i$ and drew a corresponding $z_i \sim U(0, 85)$. A tree was selected for measurement when $x_i \ge z_i$, in which case its volume $y_i$ was determined from tree measurements (e.g., DBH, height) and a volume table. Trees with $x_i < z_i$ were not measured and retain their prediction only.

This procedure produces three categories of trees:

census of predicted volume $x_i$ for trees with $x_i \le 85$ ft$^3$ (these are used to compute $\mu_x$ and $\tau_x$),
sample of measured volume $y_i$ with paired $(x_i, y_i)$ values (used to estimate $\hat{R}$ and other parameters),
census of sure-to-be-measured volume for trees with $x_i > 85$ ft$^3$ (added directly to the estimated total and confidence interval bounds).

As discussed in Section 16.6.2, sure-to-be-measured trees form a census stratum with zero sampling variance. Their volumes are added directly to the estimated total, and they contribute no sampling uncertainty.

Table 16.2 lists the 30 trees in the population along with their predicted values, random draws, and (when selected) measured volumes. For trees in the sure-to-be-measured stratum, only the measured volume is shown to avoid confusion with the unused prediction. As you inspect the table, note that under the 3P design, whenever $x_i \ge z_i$, the observed volume $y_i$ is recorded.

TABLE 16.2: Field data from a 3P cruise in northern California conifers. All values are in cubic foot volume.
Tree	$x_i$	$z_i$	$y_i$	Sure-to-be-measured	Tree	$x_i$	$z_i$	$y_i$	Sure-to-be-measured
1	54	57			16	48	48	54
2	48	48	51		17	60	51
3				99	18	54	48
4	48	54			19	63	57	60
5	57	54	63		20	51	66
6	51	60			21				96
7	60	69			22	66	57	69
8	51	45	54		23	57	66
9				93	24	69	75
10	45	51			25	51	63
11	63	72			26	57	48	60
12	60	54	66		27	66	66	69
13	72	75			28	51	63
14	57	63			29				102
15				93	30	66	57	63

We now turn to estimation. The first step is to read the three files that store the data in Table 16.2. After reading them in, run glimpse() on each to verify their column names and values, noting that the files use descriptive column names while the table uses the notation introduced for 3P sampling.

predicted <- read_csv("datasets/CA_3P/predicted.csv")
measured <- read_csv("datasets/CA_3P/measured.csv")
sure <- read_csv("datasets/CA_3P/sure_to_be_measured.csv")

Exercise 16.1 In this first exercise, ignore the sure-to-be-measured trees and apply the 3P estimators detailed in Section 16.6.2. Here are the steps.

Compute $N$, $\mu_x$, and $\tau_x$ from the $x_i$ in predicted.
Compute $n$, $\hat{R}$ (16.17), and $s^2_r$ (16.19) from the paired $(x_i, y_i)$ in measured.
Given values and estimates from Steps 1 and 2, compute $\bar{y}$ (16.16), $s_{\bar{y}}$ (16.18), $\hat{t} = N\bar{y}$ (16.20), and $s_{\hat{t}}$ (16.21).
Construct a 95% confidence interval for $\hat{t}$ using $s_{\hat{t}}$ and a $\tval$-value with $n-1$ degrees of freedom.

Your sample size should be $n$ = 10. After rounding, you should get an $\hat{R}$ of 1.05, $s^2_r$ of 0.003, and $\hat{t}$ of 1496.54 ft$^3$ with a standard error of 20.19 ft$^3$. Your 95% confidence interval for the total should be (1450.88, 1542.21) ft$^3$.

Exercise 16.2 Now modify your code to add in the sure-to-be-measured tree totals using the following steps.

Compute $t_{\text{sure}}$ by summing the measured volumes in sure_to_be_measured.
Compute $\hat{t}_{\text{adj}} = \hat{t} + t_{\text{sure}}$.
Shift both confidence interval endpoints by adding $t_{\text{sure}}$.

As discussed in Section 16.6.2, the standard error $s_{\hat{t}}$ doesn’t change because the sure-to-be-measured trees are a census and hence contribute no sampling variance.

Your sure-to-be-measured total $t_{\text{sure}}$ should be 483. Your adjusted total $\hat{t}_{\text{adj}}$ should be 1979.54 ft$^3$, with 95% confidence interval (1933.88, 2025.21) ft$^3$.

References

Bell, John F., and Lucien B. Alexander. 1957. “Application of the Variable Plot Method of Sampling Forest Stands.” Oregon State Board of Forestry.

Bruce, Donald. 1961. Prism Cruising in the Western United States and Volume Tables for Use Therewith. Mason, Bruce & Girard.

Gollob, Christoph, Tim Ritter, and Arne Nothdurft. 2020. “Forest Inventory with Long Range and High-Speed Personal Laser Scanning (PLS) and Simultaneous Localization and Mapping (SLAM) Technology.” Remote Sensing 12 (9): 1509. https://doi.org/10.3390/rs12091509.

Gove, Jeffrey H, Timothy G Gregoire, Mark J Ducey, and Thomas B Lynch. 2020. “A Note on the Estimation of Variance for Big BAF Sampling.” Forest Ecosystems 7 (1): 62.

Gregoire, Timothy .G., and Harry T. Valentine. 2008. Sampling Strategies for Natural Resources and the Environment. Chapman & Hall/CRC Applied Environmental Statistics. Taylor & Francis.

———. 1965. Three-Pee Sampling Theory and Program ’THRP’ for Computer Generation of Selection Criteria. U.s. Forest Service Research Paper PSW. Pacific Southwest Forest; Range Experiment Station, Forest Service, U.S. Department of Agriculture.

Iles, Kim. 2003. A Sampler of Inventory Topics: A Practical Discussion for Resource Samplers, Concentrating on Forest Inventory Techniques. Kim Iles & Associates.

Marshall, David D, Kim Iles, and John F Bell. 2004. “Using a Large-Angle Gauge to Select Trees for Measurement in Variable Plot Sampling.” Canadian Journal of Forest Research 34 (4): 840–45.

Mesavage, Clement, and James W. Girard. 1956. Tables for Estimating Board-Foot Volume of Timber. U.S. Government Printing Office.

———, et al. 2025b. “Small Area Estimation of Growing Stock Timber Volume, Basal Area, Mean Stem Diameter, and Stem Density for Mountain Forests in Austria.” Canadian Journal of Forest Research 55: 1–20. https://doi.org/10.1139/cjfr-2024-0302.

Welch, B. L. 1947. “The Generalization of Student’s Problem When Several Different Population Variances Are Involved.” Biometrika 34 (1-2): 28–35.

West, Philip. 2011. “Potential for Wider Application of 3P Sampling in Forest Inventory.” Canadian Journal of Forest Research 41 (July): 1500–1508.

Wiant, Harry V. 1986. “Computer Corner: Formulas for Mesavage and Girard’s Volume Tables.” Northern Journal of Applied Forestry 3 (3): 124–24. https://doi.org/10.1093/njaf/3.3.124.

Looking at (16.12), the estimator might arguably be better described as a ratio-of-totals, since it’s equivalent to the ratio of summed values.↩︎
In R, the sample variance and covariance can be computed using var() and cov(), respectively.↩︎
Volume for guessed trees is based on Girard Form Class 78, International 1/4-inch rule in Table 21 of Mesavage and Girard (1956) (8-inch DBH volumes interpolated from the published 10-inch values). For example, looking at the first row in Table 16.1, a 10-inch DBH tree with 1.5 16-foot logs has a volume of 48 board feet according to Table 21. Volumes for measured trees use the corresponding Form Class 78, International 1/4-inch formula published in Wiant (1986). This formula was used to create R functions in Exercise 5.3. For instance, to calculate the measured volume of a 10.4-inch DBH tree with 27 feet of merchantable height (i.e., 27/16=1.6875 16-foot logs), use vol_I(10.4, 27/16), which returns 57.9 board feet after rounding.↩︎
Technically, the VBAR estimate could come from a separate set of sampling locations or even a different sampling design; it’s just convenient to take spatially paired measurements.↩︎
The wide-scale relascope includes larger BAF increments, such as 20.25 $\text{m}^2/\text{ha}$, suitable for Big BAF sampling. To convert from metric to English units ($\text{ft}^2/\text{ac}$), multiply by 4.356; the chosen BAFs correspond approximately to English BAFs of 21.8 (small) and 88.2 (big).↩︎
We didn’t cover the tidyr::count() and uncount() functions, but do check them out—they’re very handy.↩︎
Trees with $x_i > \text{max}_z$ are always selected. See the discussion of “sure-to-be-measured” trees later in this section.↩︎
As we’ll see later, the 3P estimator includes an FPC, so this census stratum’s standard error will automatically be zero.↩︎
You can reproduce the estimates for $\alpha_0$ and $\alpha_1$ using nls(vol_m3 ~ alpha_0 * dbh_cm^alpha_1, start = list(alpha_0 = 0.001, alpha_1 = 2.5), data = dbh_vol) with dbh_vol read from "datasets/BC_3P/dbh_vol.csv".↩︎
Since these are simulated data, we know they were generated with slightly different ratios.↩︎

Want to know when the book is for sale? Enter your email so we can let you know.

Tree	\(x_i\)	\(z_i\)	\(y_i\)	Sure-to-be-measured	Tree	\(x_i\)	\(z_i\)	\(y_i\)	Sure-to-be-measured
1	54	57			16	48	48	54
2	48	48	51		17	60	51
3				99	18	54	48
4	48	54			19	63	57	60
5	57	54	63		20	51	66
6	51	60			21				96
7	60	69			22	66	57	69
8	51	45	54		23	57	66
9				93	24	69	75
10	45	51			25	51	63
11	63	72			26	57	48	60
12	60	54	66		27	66	66	69
13	72	75			28	51	63
14	57	63			29				102
15				93	30	66	57	63

Tree	\(x_i\)	\(z_i\)	\(y_i\)	Sure-to-be-measured	Tree	\(x_i\)	\(z_i\)	\(y_i\)	Sure-to-be-measured
1	54	57			16	48	48	54
2	48	48	51		17	60	51
3				99	18	54	48
4	48	54			19	63	57	60
5	57	54	63		20	51	66
6	51	60			21				96
7	60	69			22	66	57	69
8	51	45	54		23	57	66
9				93	24	69	75
10	45	51			25	51	63
11	63	72			26	57	48	60
12	60	54	66		27	66	66	69
13	72	75			28	51	63
14	57	63			29				102
15				93	30	66	57	63

Tree	\(x_i\)	\(z_i\)	\(y_i\)	Sure-to-be-measured	Tree	\(x_i\)	\(z_i\)	\(y_i\)	Sure-to-be-measured
1	54	57			16	48	48	54
2	48	48	51		17	60	51
3				99	18	54	48
4	48	54			19	63	57	60
5	57	54	63		20	51	66
6	51	60			21				96
7	60	69			22	66	57	69
8	51	45	54		23	57	66
9				93	24	69	75
10	45	51			25	51	63
11	63	72			26	57	48	60
12	60	54	66		27	66	66	69
13	72	75			28	51	63
14	57	63			29				102
15				93	30	66	57	63