This book is a work in progress and is Open Review. We want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right hand corner of the page

Chapter 13 Sampling designs, implementation, and analysis

In Chapters 11 and 12, we focused on inference using sample data collected under an SRS design. In this chapter, we introduce several additional probability sampling designs commonly used in forestry. Recall that a sampling design is a set of rules for selecting units from a population and using the resulting data to generate statistically valid estimates for parameters of interest. We focus on several sampling designs: 1) systematic, 2) stratified, 3) regression, 4) cluster sampling, and 5) multistage. These designs differ in how they configure and select observations and in the estimators they prescribe. In most cases, a design is chosen to improve efficiency from a statistical, logistical, or combined standpoint.

As discussed in Section 12.1, forest inventory typically relies on an areal sampling frame, where sampling locations are identified by randomly placing points within the frame’s spatial extent. Under plot sampling, the location might mark the center of a circular plot, a corner of a rectangular plot, or an anchor point for more complex plot shapes. Under point sampling, a location might serve as the point from which a forester projects the discerning angle to identify measurement trees. Unless stated otherwise, the sampling designs described in this chapter can be used with either plot or point sampling—that is, the design governs the selection of sampling locations, while a separate set of rules determines how trees are selected around each location.

As in other mensuration texts (Kershaw et al. 2016; Burkhart, Avery, and Bullock 2018), we present estimators associated with each design. The text by Iles (K. Iles 2003) is especially valuable for readers seeking practical guidance: it expands on these methods, discusses their implementation in the field, and provides real-world perspectives and advancements. Our added contribution is to show how these estimators can be implemented using tidyverse tools and other R packages. Emphasis is placed on building computing workflows for efficient and repeatable analysis.

13.1 A gentle introduction to `sf`

Because sampling in forestry is inherently spatial—we’re selecting locations within an areal sampling frame—it’s helpful to start working with spatially aware data structures. In R, the primary tool for handling spatial vector data is the sf package.

If you’ve used GIS software before, many of the concepts will feel familiar. Spatial vector data come in three main types: points (e.g., sampling locations), lines (e.g., transects or travel paths), and polygons (e.g., stands, management unit, or forest extents). The sf package represents each of these as a special kind of data frame with an attached spatial geometry column.

You can work with sf objects much like regular data frames, using familiar tidyverse tools such as filter(), select(), and mutate(). We’ll introduce key sf data structures and functions as needed throughout the chapter. For more background, see the sf documentation at https://r-spatial.github.io/sf or the package vignette via vignette("sf1").

13.2 Systematic sampling

Systematic sampling (SYS) is the most widely used probability sampling design in forest inventory. In SYS, sampling locations are arranged in a fixed pattern across the population. To introduce randomness and ensure unbiased estimation of means and totals, the pattern—typically a rectangular grid—is positioned using a randomly chosen starting point. As illustrated in Figure 1.2, sampling locations are usually aligned along straight lines. The distance between locations on a line, known as the sampling interval, and the distance between lines can differ, creating a rectangular grid. This is commonly referred to as line-plot sampling, which is our focus here.¹⁴⁸

SYS designs are popular in forestry for several reasons. Compared to SRS, their advantages include the following.

Sampling locations are evenly distributed across the population, improving spatial coverage and often yielding more stable mean and total estimates when the population shows spatial structure or broad trends. When population values are randomly distributed, SYS and SRS provide similar accuracy.
Sampling locations are easier and faster to find, reducing cost. The cruiser can follow straight lines between locations, minimizing travel time and simplifying relocation during remeasurement.

The grid is typically positioned using GIS or spatial software (e.g., R), and the resulting points are uploaded to a GPS device for navigation. Alternatively, the cruiser may locate the first sampling point in the field and lay out the remainder using the sampling interval and grid orientation.

Key considerations when implementing a SYS design include:

the pattern and orientation of the grid;
the method used to randomly position the grid.

These points are covered in the next few sections.

13.2.1 Grid pattern, number of grid points, and orientation

In practice, either a rectangular or square grid is used. A rectangular grid defines the spacing between sampling locations along a line, denoted $D_p$ (with the subscript $p$ reminding us this spacing is for “plots” or “points”), and the spacing between lines, $D_l$. A square grid is a special case where $D_{\text{sq}} = D_p = D_l$. Unless there’s a compelling reason to use unequal spacing, a square grid is a reasonable and efficient default.

Given a sample size $n$ and total area $A$, the spacing between locations in a square grid is given by

\[\begin{equation} D_{\text{sq}} = \sqrt{\frac{cA}{n}}, \end{equation}\]

where $c$ is a conversion factor to express area in square distance units. For example, use $c = 10,\!000\ \text{m}^2/\text{ha}$ if $A$ is in hectares and $D_{\text{sq}}$ is in meters, or $c = 43,\!560\ \text{ft}^2/\text{ac}$ if $A$ is in acres and $D_{\text{sq}}$ is in feet.

As introduced in Section 11.3.6, sample size $n$ should be based on the level of precision needed to meet the survey’s objectives. However, in practice, it’s common for experienced foresters to adopt a sampling intensity that works well for their forest type and inferential objectives. For example, the forester who conducted the Elk County cruise described in Section 1.2.2 typically uses a seven-by-seven chain SYS grid.

If you do use a sample size calculation, as recommended in Section 11.3.6, it’s likely that the number of grid points won’t exactly match the computed $n$ due to how the grid falls within the boundary. In such cases, we agree with the following sage words.

My advice is simply to take the number of points that are indicated [by the grid], and live with the slight variability. Sample sizes that are “needed” … are just estimated anyway.

— Kim Iles

If you really must have a specific sample size, then the best approach is to specify a denser grid than needed and randomly or systematically thin points until the target sample size is reached (see, e.g., K. Iles (2003)).

The grid orientation can be adjusted to improve field logistics. For example, aligning lines with a forest boundary may simplify navigation. This was the approach used in the Elk County cruise (Figure 1.2).

If a known gradient exists in the population—for example, productivity change with elevation—orienting the grid so that its lines run perpendicular to the gradient can improve sample representativeness and often precision. This benefit arises because the grid helps capture broad spatial trends that might otherwise be missed when sampling locations are placed at random.

Although rare, it’s possible for a population to exhibit a systematic pattern that aligns with the grid’s spacing and orientation. For example, a grid aligned with plantation rows might consistently place sampling locations between rows, giving an incomplete or biased view of the population. Rotating the grid to avoid alignment with such periodic features can prevent this problem.

13.2.2 Positioning the grid

To avoid bias, the grid must be randomly placed. This is typically done by selecting a random point start within the areal sampling frame and extending the grid in each direction from that point. If rotation is applied, it should be performed about this random point start.

The random point start can be one of the sampling locations or used as an anchor point from which to extend the grid. In the development that follows, we treat the starting point as a location on the sampling grid.

In the next section, we introduce a function that generates systematic sampling grids in R. This function allows users to specify grid spacing, orientation, and how sampling locations are ordered. The code and examples also help highlight some practical considerations when laying out systematic designs.

13.2.3 Creating a systematic sampling grid with `sys_grid()`

This section is presented in the style of a condensed R help page to document the sys_grid() function. For a deeper understanding, you’re encouraged to read through the function source code in "scripts/utils.R" (see Section 5.4). The code comments walk through each step, and many of the elements should already be familiar.

Description

Creates a systematic grid of points within a polygonal boundary using a random start. The grid is defined by a fixed spacing between lines (D_l) and between points along each line (D_p). The grid can be rotated and the points ordered from a specified origin to facilitate field navigation.

Usage

sys_grid(boundary, D_l, D_p, angle = 0, origin = "se")

Arguments

boundary: An sf polygon object defining the sampling boundary. Must be projected with linear units.
D_l: Distance between sampling lines (x-direction) in the same units as boundary.
D_p: Distance between points along a line (y-direction) in map units as boundary.
angle: Optional rotation angle (in degrees) applied to the grid. Default is 0.
origin: Optional character string indicating the origin corner used to order points. One of "se", "sw", "ne", or "nw" (default is "se").

Details

The sys_grid() function creates a rectangular grid of points with spacing determined by D_l and D_p, starting from a randomly placed point within the boundary polygon. Here’s how it works.

A random point start is placed within the boundary.
A full grid is generated around that point, extended enough to accommodate possible rotation.
Points are ordered in a snaking pattern from the specified origin.
If specified, the grid is rotated by angle degrees about the random point start.
The grid is clipped to the polygon, and the resulting points are labeled and returned.

Ordering by origin can help plan field navigation, especially when following lines. The choice of origin affects how line_id, point_id_in_line, and point_id are assigned, but not the design itself.

Value

Returns a list with the following components¹⁴⁹.

grid: An sf point object with columns:
- point_id: unique ID, ordered from the origin;
- line_id: identifier for each line;
- point_id_in_line: position along each line.
Columns in boundary are appended.
path: An sf LINESTRING connecting the points in sampling order.
random_point_start: The initial random point start (an sf point object).
boundary: The input polygon(s) used to clip the grid.
type: A string used internally to verify the object when plotting.

Use plot_sys_grid() to visualize the resulting grid and path.

13.2.4 Systematic sampling grid illustration with `sys_grid()`

Let’s apply sys_grid() to set up a SYS design for the Elk County cruise described in Section 1.2.2. In that cruise, the forester used a square grid with sampling locations spaced every seven chains (recall that one chain equals 66 feet).

The sys_grid() function expects the boundary polygon to be in a projection that uses the same distance units as D_l and D_p. The Elk County property is stored in the "datasets/PA/Elk_bounds" ESRI shapefile (Environmental Systems Research Institute 1998), which uses a geographic coordinate system (i.e., decimal degrees of latitude and longitude)—which isn’t an intuitive measurement system for field work. Instead, we want the boundary in a projected coordinate system that allows us to specify D_l and D_p in feet or meters.

After reading the shapefile using st_read(), the code below reprojects the boundary to a projection appropriate for northern Pennsylvania and consistent with the cruise protocol. This is done using st_transform().¹⁵⁰

# Read in the bounding polygon.
bounds <- st_read("datasets/PA", "Elk_bounds", quiet = TRUE)
bounds <- st_transform(bounds, crs = st_crs(2271))

Once reprojected, bounds has units in feet, so we specify D_l and D_p in feet when calling sys_grid(). Because we don’t specify values for angle and origin, the defaults of 0 and "se", respectively, are used. To ensure the same random point start is generated each time the code is run, we set the random seed prior to calling the function.

# Set random seed.
set.seed(1)

sys <- sys_grid(bounds, D_l = 7 * 66, D_p = 7 * 66)

We can visualize the resulting SYS design by passing the sys object to the plot_sys_grid() function, defined in the "scripts/utils.R" script. The resulting map is shown in Figure 13.1. We encourage you to explore the plot_sys_grid() function to see how it builds the graphic using ggplot(). Compared to the simpler examples in Chapter 9, this plot is a bit more involved, with a customized legend and multiple spatial layers.¹⁵¹

FIGURE 13.1: Systematic sampling design for an Elk County property cruise.

Let’s create a few variations on the design shown in Figure 13.1 by rotating the grid and choosing a different origin—while keeping the same random point start. This isolates the effect of orientation and ordering. To reduce code clutter, we don’t show the calls to plot_sys_grid() or the patchwork functions used to produce the composite graphic shown in Figure 13.2.

# (a) Rotate grid 45 degrees.
set.seed(1)
sys_a <- sys_grid(bounds, 
                            D_l = 7 * 66, D_p = 7 * 66, angle = 45)

# (b) Northwest corner start.
set.seed(1)
sys_b <- sys_grid(bounds, 
                            D_l = 7 * 66, D_p = 7 * 66, origin = "nw")

# (c) Rotation and northwest corner start.
set.seed(1)
sys_c <- sys_grid(bounds, 
                            D_l = 7 * 66, D_p = 7 * 66, 
                            angle = 45,  origin = "nw")

FIGURE 13.2: Systematic sampling design for an Elk County cruise under different grid rotations and starting origins.

Notice in Figures 13.1 and 13.2 that the only random component of systematic sampling is the initial point start. Once that point is chosen, all other sampling locations are fixed by the grid’s spacing and rotation. Thus, the layout looks highly structured, but it still retains a probabilistic foundation because the sample depends on a randomly chosen start. This single source of randomness is what allows SYS to be used for statistical inference, subject to the assumptions and caveats outlined in the next section.

With the sampling grid in place, we now turn to estimation using data collected under systematic sampling.

13.2.5 Estimation using SYS

The SYS estimators for the population mean $\mu$ (11.1) and total $\tau$ (11.6) are the same unbiased sample mean $\bar{y}$ (11.15) and total $\hat{t}$ (11.22) used under SRS (Section 11.3). Both are unbiased under SYS.

There is, however, no valid estimator for the sample variance (and thus no estimator for the standard error of the mean) under SYS. The common workaround is to apply the SRS estimators for variance (11.17) and standard error (11.21).

In theory, these SRS-based estimators can either overstate or understate variability, and it’s not straightforward to know which case applies without additional information about the population. In practice, however, they usually perform well enough to be useful and are widely adopted.

The reason SYS lacks a valid variance estimator is fairly intuitive. Recall from Section 11.3 that the sample variance estimator for SRS assumes the $n$ observations are drawn independently and at random from the population. This independence is critical: it justifies the use of $n - 1$ in the denominator of the variance estimator (11.17) (the subtraction of one reflects that one piece of information has already been used to estimate the sample mean).

By contrast, under SYS a single random start point determines the entire sample. Once the start is fixed, all subsequent locations are determined by the grid layout, so the observations are not independent. From this perspective, SYS effectively behaves like a sample of size one. Plugging $n = 1$ into the SRS variance estimator (11.17) leads to division by zero, which is undefined. This highlights the core issue: SYS does not yield a set of independent observations, so the usual SRS variance estimators do not apply.

Despite these limitations, the SRS variance estimator is commonly used as a practical approximation for SYS. In most applications it provides a reasonable estimate of sample variance and, by extension, the standard error of the mean. Still, interpretation requires care: how the population is structured, and how that structure interacts with the systematic grid, strongly influences the reliability of variance estimates. See Kershaw et al. (2016) and K. Iles (2003) for further discussion, including cautions and justification for why the SRS variance estimator is often considered acceptable for SYS data.

13.2.6 Advantages and disadvantages of systematic sampling

Systematic sampling offers several practical benefits and is widely used in forestry, but like all designs, it involves trade-offs. Here we summarize its main strengths and limitations, particularly in comparison to SRS.

Advantages

Efficient field layout: SYS designs are easy to implement in the field. Sampling locations fall on a predictable grid, allowing cruisers to follow straight lines, reducing travel time and simplifying navigation.
Improved spatial coverage: Grid-based sampling ensures sampling locations are evenly distributed across the population, helping to capture spatial heterogeneity and reduce the risk of missing important features—a common concern with SRS.
Simplifies remeasurement: Because sampling locations follow a fixed, repeatable pattern, they’re easier to relocate during future inventory cycles.
Cost-effective: The combination of reduced travel time and simplified logistics typically results in lower per-unit sampling cost compared to SRS.
Good performance in many populations: SYS often yields more precise estimates than SRS for the same sample size, particularly when the population lacks strong periodicity that aligns with the grid. When a broad spatial trend is present, careful grid orientation can actually improve precision—for example, by laying the grid across the direction of the trend to ensure full coverage.

Disadvantages

No valid variance estimator: SYS lacks a design-based estimator for the sample variance and standard error. While the SRS estimator is commonly used as an approximation, it can overestimate uncertainty, depending on the population structure.
Requires spatial data infrastructure: Implementing SYS effectively typically relies on GIS tools and spatial data, which may be a barrier in some settings.

Despite these drawbacks, SYS remains a go-to design in operational forestry due to its simplicity, spatial balance, and low implementation cost. When used thoughtfully—with attention to population structure and proper randomization of the starting point—SYS can yield highly effective inventory data.

13.3 Stratified sampling

In forest inventory, it’s common to estimate population parameters not only for the entire area of interest but also for smaller subareas, or strata. Stratification refers to dividing the population into non-overlapping subareas, either to improve statistical efficiency or to ensure estimates are available for specific subareas of interest. Ideally, strata are defined using characteristics expected to influence the parameter being estimated, making them internally homogeneous but meaningfully different from one another. Common stratification variables include species, age, productivity, and management unit.

Stratified sampling involves drawing a probability sample independently within each stratum. While sampling within strata is most often done using SRS or SYS, other probability sampling methods can be used. The main benefits of stratification are potential gains in statistical efficiency, i.e., reduced standard errors, and the ability to produce both overall and stratum-specific estimates.

Stratification is often implemented in one of two ways.

Pre-stratification: Strata are delineated before sampling begins, letting us allocate stratum-specific sample sizes based on stratum size, expected variability, and sampling costs. This gives full control over the number of observations placed in each stratum. In this case, the area of each stratum must be known ahead of time.
Post-stratification: Here, sampling proceeds without regard to strata, and strata are assigned after the fact based on information collected during or after sampling. This still supports stratum-specific estimation, but the per-stratum sample sizes aren’t under our control. The area of each stratum must be determined after sampling.

Both approaches are valid. The main advantage of pre-stratification is the ability to apply optimum allocation—assigning more observations to strata that are large or exhibit greater variability, and fewer to those that are small, more homogeneous, or costly to sample. This improves statistical efficiency while controlling sampling costs.

There are substantial statistical and cost-saving benefits to stratification. Even a poor choice of strata yields unbiased estimates and, in almost all practical cases, performs no worse than unstratified sampling. At worst, you’ll end up with about the same standard error as you’d get without stratification; at best, you can dramatically improve statistical and cost efficiency.

Stratified sampling proceeds in two steps.

Assignment: Before any measurements are taken (in the case of pre-stratification), areas of the population—such as stands or management units—are assigned to strata using available information.
Measurement: A sample is then drawn within each stratum. At least two observations per stratum are needed to compute within-stratum variance.

13.3.1 Choosing stratum membership

Stratification begins by subdividing the population into distinct strata based on available information. Assignment can be guided by prior inventory data (e.g., species, age, height, volume, or biomass), expert judgment, or ancillary information like topography, canopy structure, or remote sensing-based spectral or structural measurements. Strata may also align with management units or areas where a separate estimate is desired.

The key requirement is that strata don’t overlap. For example, if strata are defined using polygons—where one or more polygons comprise a stratum—those polygons must be spatially distinct, i.e., their boundaries don’t overlap. Each location in the population must belong to one and only one stratum.

Useful strata group together areas that are similar to each other and different from the rest. Ideally, variance within a stratum is low, and variance between strata is large. These characteristics improve statistical efficiency.

A stratum can consist of spatially contiguous or disjoint areas. For example, a stratum may comprise spatially adjacent stands or stands scattered across the forest. It’s the internal consistency of each stratum that counts—not whether the areas are next to each other on the map.

Stratum assignment doesn’t introduce bias in the estimate, even if the grouping doesn’t reflect real differences in the population. As long as every part of the population belongs to one and only one stratum, and sampling within each follows the intended design, the overall estimate remains unbiased. A poor choice of strata mainly affects estimate precision rather than accuracy—at worst giving a standard error about the same as you’d get without stratification, and at best providing a clear gain in efficiency.

In the next section, we introduce the estimators used in stratified sampling.

13.3.2 Estimation using stratified sampling

Throughout this section, and in the subsequent illustration, we assume that sampling within each stratum is done using SRS or SYS, and that SRS estimators are used for the stratum means and variances. Other within-stratum designs are possible but may require different estimators.

We introduce the following notation:

$H$ number of strata in the population;
$N_h$ population size in the $h$-th stratum;
$N = \sum_{h = 1}^H N_h$ total population size across all strata;
$n_h$ sample size in the $h$-th stratum;
$n = \sum_{h = 1}^H n_h$ total sample size across all strata.

If we’re working from an areal sampling frame, then $N$ and $N_h$ are area units, e.g., acres or hectares.

The stratum population mean $\mu_h$ is estimated using the sample mean \[\begin{equation} \bar{y}_h = \frac{\sum_{i = 1}^{n_h} y_{hi}}{n_h}, \tag{13.1} \end{equation}\] where $y_{hi}$ is the $i$-th observation in the $h$-th stratum.

The overall population mean $\mu$ (i.e., across all strata) is estimated using \[\begin{equation} \bar{y} = \frac{\sum_{h = 1}^H N_h \bar{y}_h}{N}. \tag{13.2} \end{equation}\] This is a weighted sum of the stratum-specific sample means, with weights given by each stratum’s proportion of the total population size (or area), i.e., $N_h / N$.

The population total $\tau$ is estimated using \[\begin{equation} \hat{t} = \sum_{h = 1}^H N_h \bar{y}_h = N \bar{y}. \tag{13.3} \end{equation}\]

The sample variance within each stratum, $s^2_h$, is estimated using the SRS variance estimator (11.17). If standard errors and confidence intervals are desired for individual strata, use (11.20) or (11.21) and (11.31), respectively.

Assuming a finite population, the standard error of the overall mean $\bar{y}$ is \[\begin{equation} s_{\bar{y}} = \sqrt{ \frac{1}{N^2} \sum_{h = 1}^H \left[ \frac{N_h^2 s_h^2}{n_h} \left( \frac{N_h - n_h}{N_h} \right) \right] }. \tag{13.4} \end{equation}\] If we’re working from an areal sampling frame (i.e., an infinite population), the FPC is omitted (see Section 12.2), giving \[\begin{equation} s_{\bar{y}} = \sqrt{ \frac{1}{N^2} \sum_{h = 1}^H \frac{N_h^2 s_h^2}{n_h} }. \tag{13.5} \end{equation}\]

The standard error for the total estimate $\hat{t}$ is \[\begin{equation} s_{\hat{t}} = N s_{\bar{y}}. \tag{13.6} \end{equation}\]

Confidence intervals for the mean follow the usual form \[\begin{equation} \bar{y} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\bar{y}}, \tag{13.7} \end{equation}\] with degrees of freedom computed as \[\begin{equation} df = \sum_{h = 1}^H (n_h - 1) = n - H. \tag{13.8} \end{equation}\]

Confidence intervals for the total $\hat{t}$ follow the same form using $s_{\hat{t}}$.

13.3.2.1 Sampling in stratum and independence

In the current setup, we assume $n_h$ observations within each stratum are drawn independently and at random from the stratum population. In practice, however, SYS is more commonly used. For example, you might place a separate systematic grid within each stratum using a random point start, as described in Section 13.2.2. The grids may differ in spacing or rotation across strata.

The usual concern with SYS remains: with only one random start per stratum, the $n_h$ observations are not independent. This has implications for inference. For example, the degrees of freedom in (13.8) assume independent observations within each stratum. If a single systematic grid is used to sample the stratum population, that assumption is violated. In such cases, (13.8) may be overstated, and resulting confidence intervals may be misleading.

It’s also common to use a single SYS grid that spans all strata. This is often convenient because the cruiser can follow continuous straight sampling lines that cross stratum boundaries. In this case, only one random point start is used to position the SYS grid over the entire population. Here, we’re even further from the assumption of $n$ independent units of information.

SYS is widely used in practice and is adopted in our illustration in Section 13.3.6. As discussed in Section 13.2, it’s important to rotate the grid to avoid alignment with periodic features. In most applications, such alignment is rare, and the risk of bias or distortion of standard errors is negligible.

13.3.3 Sample size determination

After identifying strata (in the case of pre-stratification) and selecting a sampling design, the next step is to determine the total sample size $n$. This sample size can either be fixed in advance based on available resources or determined based on a desired level of precision for estimating a population mean or total.

Stratified sampling supports two common allocation strategies: proportional allocation and optimum allocation. Proportional allocation assigns sample size to each stratum based solely on its relative area. Optimum allocation also considers variability, and if cost is taken into account, balances that against the relative cost of sampling. The goal is to allocate effort where it provides the greatest reduction in the variance of the estimated mean (and hence its standard error) per unit cost.

When the total sample size $n$ is unknown, it can be estimated using equations similar to those for SRS (see Section 11.3.6). These equations have the same circularity issue as before: $n$ appears on both sides of the equation. As with SRS, this can be resolved using the iterative algorithm described in Section 11.3.6.

Importantly, these sample size estimates are intended to meet allowable error targets for the overall population, not for individual strata.

13.3.3.1 Total sample size with proportional allocation

As introduced in Section 11.3.6, we can estimate sample size given an allowable error $E$ that is in the same units as $\bar{y}$ and an estimate for each stratum’s sample variance $s^2_h$. Recall, because we’ve not taken the sample, the value for $s^2_h$ must come from somewhere else, e.g., from prior experience or inventory data from a similar population.

The total sample size required under proportional allocation for a finite population is given by \[\begin{equation} n = \frac{N \sum_{h=1}^H N_h s_h^2}{\frac{N^2 E^2}{t^2} + \sum_{h=1}^H N_h s_h^2}, \tag{13.9} \end{equation}\] where $t$ is the $t$-value corresponding to the desired confidence level ($1 - \alpha$).

If the population is assumed infinite (i.e., under an areal sampling frame), the simplified formula is \[\begin{equation} n = \frac{t^2 \sum_{h=1}^H N_h s_h^2}{N E^2}. \tag{13.10} \end{equation}\]

13.3.3.2 Total sample size with optimum allocation

Optimum allocation minimizes the variance of the stratified mean by accounting for both stratum size and variability. The total sample size for a finite population is \[\begin{equation} n = \frac{\left( \sum_{h=1}^H N_h s_h \right)^2}{\frac{N^2 E^2}{t^2} + \sum_{h=1}^H N_h s_h^2}. \tag{13.11} \end{equation}\]

For an infinite population, this becomes \[\begin{equation} n = \frac{t^2 \left( \sum_{h=1}^H N_h s_h \right)^2}{N^2 E^2}. \tag{13.12} \end{equation}\]

13.3.4 Allocation of samples to strata

Given the total sample size $n$, the next step is to allocate observations among the strata using equations specific to proportional or optimum allocation. Whichever approach was used to determine $n$ should also be used to allocate samples to the strata.

13.3.4.1 Proportional allocation

In proportional allocation, each stratum receives a share of the total sample based solely on its size (e.g., area). The sample size in stratum $h$ is given by \[\begin{equation} n_h = n \frac{N_h}{N}. \tag{13.13} \end{equation}\] This approach is easy to apply but ignores differences in variability across strata.

13.3.4.2 Optimum allocation

In optimum allocation, sample sizes are chosen to minimize the variance of the stratified mean. The number of samples in stratum $h$ is computed as \[\begin{equation} n_h = n \frac{N_h s_h}{\sum_{h=1}^H N_h s_h}. \tag{13.14} \end{equation}\] This method improves efficiency by assigning more samples to strata that are larger or more variable.

13.3.4.3 Optimum allocation with cost constraints

If sampling costs differ between strata, we can modify the optimum allocation formula to incorporate cost per sample. Let $c_h$ denote the cost of sampling a unit in stratum $h$. The cost-adjusted optimum allocation is \[\begin{equation} n_h = n \frac{N_h s_h / \sqrt{c_h}}{\sum_{h=1}^H N_h s_h / \sqrt{c_h}}. \tag{13.15} \end{equation}\]

This formula reduces sample sizes in strata that are more expensive to survey, while still accounting for size and variability. It’s particularly useful in operational forest inventories where travel cost, terrain, or accessibility varies across the landscape.

13.3.5 Creating a stratified systematic sampling grid

Once strata have been defined and sample sizes allocated (either proportionally or optimally), the next step is to lay out a systematic sampling grid within each stratum. This can be done using functions that extend the sys_grid() approach to stratified designs.

One option is to build stratum-specific grids with the sys_grid() function described in Section 13.2.3. This gives full flexibility in grid spacing, orientation, and pattern, and can be useful when strata are large or widely separated.

In many applications, however, a separate grid for each stratum complicates field logistics. When strata are small or adjacent, it’s often easier for field crews to work from a single systematic grid that covers the entire population. The two functions below support this approach by generating a unified grid across all strata under either proportional or optimum allocation. As in Section 13.2.3, the documentation is presented in the style of condensed R help pages.

13.3.5.1 Grid generation for proportional allocation `strat_prop_sys_grid()`

Description

Generates a systematic sampling grid using fixed spacing between lines (D_l) and points (D_p) across the entire area. Points are assigned to strata based on spatial intersection with the input boundary. This is designed for proportional allocation, where grid density is uniform across strata. Built on sys_grid() with added logic to track stratum membership and within-stratum ordering.

Usage

strat_prop_sys_grid(boundary, D_l, D_p, angle = 0, origin = "se", 
                    strat_id_col)

Arguments

boundary: An sf polygon object defining the stratum boundaries. Must be projected with linear units.
D_l: Distance between sampling lines (x-direction) in the same units as boundary.
D_p: Distance between points along a line (y-direction) in the same units as boundary.
angle: Optional rotation angle (in degrees). Default is 0.
origin: Corner from which to order lines and points. One of "se", "sw", "ne", or "nw". Default is "se".
strat_id_col: Name of the column in boundary that identifies strata.

Details

This function builds a systematic grid across the union of stratum polygons using sys_grid(), then assigns each point to a stratum via spatial intersection. Each sampling location is tagged with its unique point ID as well as its position within each line and stratum. This approach simplifies logistics when a single, uniform grid is acceptable for all strata.

Value

Returns a list with the following components.

grid: An sf object of all points, with columns for unique point_id, stratum_id, and other useful indexing IDs. Columns in boundary are appended.
path: An sf LINESTRING connecting grid points in order.
random_point_start: The initial random point start.
boundary: The input polygon with stratum attributes.
type: A string used internally for validation when plotting.

Use plot_sys_grid() to visualize the resulting grid and path.

13.3.5.2 Grid generation for optimum allocation `strat_opt_sys_grid()`

Description

Creates a stratified systematic sampling grid that supports optimum allocation, assigning different numbers of grid points to each stratum. Builds a dense grid using sys_grid(), then intersects it with stratum polygons and thins the points in each stratum to approximate the target sample sizes. The final number of points per stratum may not match the targets exactly, but the function tries to get as close as possible. Supports both random and sequential thinning.

Usage

strat_opt_sys_grid(boundary, D_l, angle = 0, origin = "se",
                   n_col, strat_id_col, thinning_method = "random", 
                   D_p = NULL)

Arguments

boundary: An sf polygon object defining the stratum boundaries. Must be projected with linear units.
D_l: Distance between sampling lines (x-direction) in the same units as boundary.
angle: Optional rotation angle (in degrees). Default is 0.
origin: Corner from which to order lines and points. One of "se", "sw", "ne", or "nw". Default is "se".
n_col: Name of the column in boundary containing target sample sizes per stratum.
strat_id_col: Name of the column in boundary that identifies strata.
thinning_method: Method used to thin grid points within each stratum. Either "random" or "sequential". Default is "random".
D_p: Optional distance between points along a line. If not supplied, it’s computed automatically from the stratum with the densest sampling, using the same units as boundary.

Details

This function is designed for optimum allocation, where sample size varies across strata. It uses sys_grid() to build an initial grid, assigns each point to a stratum, then selects a subset of points from each stratum using the specified thinning method. Additional columns in the output identify both the full grid and the subset of points selected for measurement.

Value

Returns a list with the following components.

grid: An sf object of all points, with logical measure column and columns for unique point_id, stratum_id, and other useful indexing IDs. Columns in boundary are appended.
measure_path: A LINESTRING showing the path through the selected (measure = TRUE) points.
path: The full path through all grid points.
random_point_start: The initial random point start.
boundary: The original input polygon.
type: A string used internally for validation when plotting.

Use plot_strat_opt_sys_grid() to visualize the sampled grid and path.

13.3.6 Illustrative analysis

We use the Harvard Forest dataset introduced in Section 1.2.5 to illustrate the application of stratified estimators. While this is a small and somewhat toy-like example, it’s useful because a complete census of trees is available—allowing us to compare our estimates directly to the true population values.

Figure 13.3 shows the extent of the 35 ha forest, partitioned into four strata. Strata 1 and 4 are mixed conifer-hardwood forest, Stratum 2 is conifer-dominated, and Stratum 3 is a sparsely forested wetland.

FIGURE 13.3: Harvard Forest population extent partitioned into four strata.

Our goal is to estimate average AGB per hectare and total AGB for each stratum and for the entire forest. The population of interest includes all stems 12.7 cm DBH and larger. We’ll also break out estimates by species groups of interest.

The cruise will use fixed-area circular plots with a radius of 9 meters. At each selected sampling location, all trees meeting the DBH threshold will be measured. See Section 12.3 for additional details on plot-based sampling.

We’d like our AGB estimates to be within $\pm$ 35 Mg/ha of the population value with 95% confidence. This allowable error corresponds to roughly 10% of the expected AGB.

The cruise will use a stratified systematic sampling design, with sample sizes determined by optimum allocation as described in the preceding section.

TABLE 13.1: Information used for the Harvard Forest example. The stratum-specific above ground biomass (AGB) standard deviation $s_h$ is an estimate used in sample size calculations.
Stratum ID	$N_h$ (ha)	$s_h$ (Mg/ha)
1	1.24	100
2	8.93	100
3	2.25	75
4	22.58	125

We begin by computing the total sample size using the stratum-specific information in Table 13.1 and the optimum allocation formula (13.12). As discussed in Section 11.3.6, this formula involves a circular dependency: we’re solving for $n$ on the left, but also need it to determine the appropriate $t$-value on the right, which depends on the degrees of freedom.

To resolve this circular dependency, we use the iterative algorithm described in Section 11.3.6. The code below applies that method using a for loop, as introduced in Section 5.2.3, to iteratively solve for $n$ until it converges.¹⁵²

H <- 4 # Number of strata.
N_h <- c(1.24, 8.93, 2.25, 22.58) # Stratum area (ha).
N <- sum(N_h) # Total area (ha).
E <- 35 # Allowable error for AGB (Mg/ha).
alpha <- 0.05 # For 95% confidence level.
s_h <- c(100, 100, 75, 125) # Stratum standard deviation (Mg/ha).
n_start <- 25 # Initial guess for total sample size.

n_prev <- n_start
for (i in 1:3) {
  t <- qt(p = 1 - alpha / 2, df = n_prev - H) # df = n - H
  n_next <- ceiling((t / (N * E))^2 * sum(N_h * s_h)^2)
  cat("Iteration", i, ": n =", n_next, "\n")
  n_prev <- n_next
}

#> Iteration 1 : n = 47 
#> Iteration 2 : n = 44 
#> Iteration 3 : n = 44

n <- n_prev # Final sample size.

The algorithm converged quickly to $n$ = 44. In the code below, we allocate the total sample size among strata using the optimum allocation formula (13.14).

n_h <- n * (N_h * s_h) / sum(N_h * s_h)
n_h <- ceiling(n_h) # Round up to the nearest whole number.
n_h

#> [1]  2 10  2 31

Note that due to rounding, we end up with $n$ = 45 plots in the final design.

Now, let’s build our sampling grid using strat_opt_sys_grid(). As described in Section 13.3.5, this function requires an sf polygon object that delineates the stratum boundaries. As briefly noted in Section 13.1, sf objects are data frames with a spatial geometry column and can be manipulated using tidyverse functions.

The code below reads in the Harvard Forest stratum boundaries as an sf polygon object, adds a new column for the target stratum-specific sample sizes, and prints the data frame portion of the object so we can inspect its columns. The st_drop_geometry() function removes the spatial geometry column to simplify printed output.

bounds <- st_read("datasets/HF", "HF_stratified", quiet = TRUE)

bounds <- bounds %>% mutate(n_h = n_h, .after = stratum_id)

bounds %>% st_drop_geometry()

#>   stratum_id n_h      ha
#> 1          1   2  1.2361
#> 2          2  10  8.9315
#> 3          3   2  2.2494
#> 4          4  31 22.5830

With a unique identifier column for each stratum (stratum_id) and the target sample sizes (n_h), we’re ready to pass bounds to strat_opt_sys_grid(). This and other sampling functions are defined in "scripts/utils.R".

Before calling strat_opt_sys_grid() in the code below, we set a random seed to ensure reproducible results.

source("scripts/utils.R")

set.seed(1)
strat_sys <- strat_opt_sys_grid(bounds, D_l = 50, n_col = "n_h", 
                                strat_id_col = "stratum_id",
                                thinning_method = "random")

The call to strat_opt_sys_grid() above sets the distance between lines to 50 meters and allows the function to compute the sampling interval along lines (D_p) automatically.¹⁵³

Systematic sampling with stratification for the Harvard Forest example. Illustrates random removal of sampling locations to approximate the target optimum sample size.

FIGURE 13.4: Systematic sampling with stratification for the Harvard Forest example. Illustrates random removal of sampling locations to approximate the target optimum sample size.

Figure 13.4, created using plot_strat_opt_sys_grid(strat_sys, path = "lines"), shows the full systematic grid, the selected sampling locations, and those removed from the grid to meet the target sample sizes (labeled as “Removed sampling location” in the figure legend).

If you only want to visualize the final sampling locations, set the path argument in plot_strat_opt_sys_grid() to "points" as shown below in Figure 13.5.

FIGURE 13.5: Systematic sampling with stratification for the Harvard Forest example. Illustrates the path between sampling locations.

We can confirm the final sample size by stratum using the code below, which shows the target sample sizes were achieved. This won’t always be the case, depending on how the sampling grid intersects with each stratum.

strat_sys$grid %>%
  filter(measure) %>%
  group_by(stratum_id) %>%
  summarize(realized_n_h = n()) %>%
  st_drop_geometry()

#> # A tibble: 4 × 2
#>   stratum_id realized_n_h
#> *      <int>        <int>
#> 1          1            2
#> 2          2           10
#> 3          3            2
#> 4          4           31

Given the sampling locations, we’re ready to head to the field (virtually, in this case). For our virtual cruise, we’ll use a 9 meter radius plot centered on the sampling locations. These plot areas are shown as gray circles in Figure 13.6.

As always, if any portion of a plot extends beyond a boundary, then a boundary correction method must be applied—this includes stratum boundaries. Figure 13.6 highlights the three plots with boundary slopover and applies the mirage method described in Section 12.3.3.

Systematic sampling with stratification for the Harvard Forest example. Fixed-area plots are shown at each sampling location, including those that extend beyond stratum boundaries and require boundary slopover correction. The mirage method is used for these plots; see Section 12.3.3 for details.

FIGURE 13.6: Systematic sampling with stratification for the Harvard Forest example. Fixed-area plots are shown at each sampling location, including those that extend beyond stratum boundaries and require boundary slopover correction. The mirage method is used for these plots; see Section 12.3.3 for details.

The cruise tree data are stored in "datasets/HF/HF_strat_sys_n45.csv" and are read into plot_trees below. Each row represents one or more trees. The dataset includes a plot_id column that uniquely identifies plots across all strata, and a tree_id column that identifies individual trees within each plot. The stratum_id column indicates the stratum in which the plot falls. Tree counts are recorded in the tree_count column, which equals 2 when a tree is double-counted using mirage correction. Additional columns include agb_Mg (aboveground biomass in metric tons), dbh_cm (diameter at breast height in centimeters), and species (tree species).

plot_trees <- read_csv("datasets/HF/HF_strat_sys_n45.csv")

plot_trees %>% glimpse()

#> Rows: 797
#> Columns: 7
#> $ plot_id    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2…
#> $ tree_id    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, …
#> $ stratum_id <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ agb_Mg     <dbl> 0.086166, 1.853968, 2.227097, 1.27…
#> $ dbh_cm     <dbl> 14.4, 44.6, 48.0, 38.4, 27.0, 41.8…
#> $ species    <chr> "Acer rubrum", "Quercus rubra", "Q…

Given these tree data, and following the steps laid out in Section 12.6.1, our first task is to expand tree-level agb_Mg measurements to a per-hectare basis using the appropriate tree factor, then summarize these values at the plot level. These plot-level values correspond to the $y_{hi}$ terms in (13.1), i.e., the $i$-th plot measurement in the $h$-th stratum.

To compute these plot-level AGB values, we group by plot_id and use summarize() to calculate y_hi, the plot-level AGB in Mg/ha. We also retain each plot’s stratum_id using first(stratum_id).¹⁵⁴

# Add tree factor for the given plot area.
r <- 9 # Plot radius in meters.

plot_trees <- plot_trees %>% 
  mutate(TF = 10000 / (pi * r^2))

# Compute plot-level AGB estimates.
plot_summary <- plot_trees %>%
  group_by(plot_id) %>%
  summarize(y_hi = sum(TF * tree_count * agb_Mg),
            stratum_id = first(stratum_id))

plot_summary

#> # A tibble: 45 × 3
#>    plot_id  y_hi stratum_id
#>      <dbl> <dbl>      <dbl>
#>  1       1  468.          4
#>  2       2  197.          4
#>  3       3  253.          4
#>  4       4  419.          4
#>  5       5  174.          4
#>  6       6  388.          4
#>  7       7  579.          4
#>  8       8  346.          4
#>  9       9  364.          4
#> 10      10  158.          4
#> # ℹ 35 more rows

Next, we compute the stratum-level mean $\bar{y}_h$ and standard deviation $s_h$ by grouping plot_summary by stratum_id. We also compute the number of plots within each stratum for convenient reference in the next step.

stratum_summary <- plot_summary %>%
  group_by(stratum_id) %>% 
  summarize(
    y_bar_h = mean(y_hi),
    s_h = sd(y_hi),
    n_h = n_distinct(plot_id)
  )

stratum_summary

#> # A tibble: 4 × 4
#>   stratum_id y_bar_h   s_h   n_h
#>        <dbl>   <dbl> <dbl> <int>
#> 1          1   311.   38.3     2
#> 2          2   474.  140.     10
#> 3          3    63.0  51.0     2
#> 4          4   348.  108.     31

These stratum-level estimates are useful when reporting means, totals, and associated uncertainty for individual strata. For now, though, we’re focused on estimating the overall population across strata.

The estimators for the stratified population mean and its standard error are given in (13.2) and (13.5), respectively. Both require $n_h$, $N_h$, and the stratum-level summaries. To streamline the workflow, we join the stratum areas to stratum_summary before applying the estimators and computing 95% confidence intervals.

# Join stratum areas.
stratum_area <- tibble(
  stratum_id = 1:4,
  N_h = c(1.24, 8.93, 2.25, 22.58)
)

stratum_summary <- left_join(stratum_summary, stratum_area, 
                           by = "stratum_id")

# Estimate population mean and 95% CI.
stratum_summary %>%
  summarize(
    N = sum(N_h),
    y_bar = sum(N_h * y_bar_h) / N,
    s_y_bar = sqrt(1 / N^2 * sum((N_h^2 * s_h^2) / n_h)),
    t = qt(df = sum(n_h - 1), p = 1 - 0.05 / 2),
    ci_lower = y_bar - t * s_y_bar,
    ci_upper = y_bar + t * s_y_bar
  ) %>%
  select(y_bar, ci_lower, ci_upper)

#> # A tibble: 1 × 3
#>   y_bar ci_lower ci_upper
#>   <dbl>    <dbl>    <dbl>
#> 1  360.     326.     395.

To estimate the population total and its confidence interval in the code below, we adapt the code above using the total estimator (13.3) and its standard error (13.6).

stratum_summary %>%
  summarize(
    N = sum(N_h),
    t_hat = N * sum(N_h * y_bar_h) / N, # Or just sum(N_h * y_bar_h).
    s_t_hat = N * sqrt(1 / N^2 * sum((N_h^2 * s_h^2) / n_h)),
    t = qt(df = sum(n_h - 1), p = 1 - 0.05 / 2),
    ci_lower = t_hat - t * s_t_hat,
    ci_upper = t_hat + t * s_t_hat
  ) %>%
  select(t_hat, ci_lower, ci_upper)

#> # A tibble: 1 × 3
#>    t_hat ci_lower ci_upper
#>    <dbl>    <dbl>    <dbl>
#> 1 12615.   11411.   13819.

13.3.6.1 Checking coverage and allowable error

How well did our estimates perform in terms of uncertainty quantification? The census-based AGB for the Harvard Forest is 341.17 Mg/ha, with a forest-wide total of 11,941 Mg. Our estimated mean and total, with 95% confidence intervals, were 360.44 (326.03, 394.84) Mg/ha and 12,615 (11,411, 13,819) Mg, respectively. Based on this sample, both confidence intervals successfully captured the true values.

If the theory holds, we’d expect about 95% of confidence intervals from repeated samples to cover the true mean. Because we have a census, we can perform repeated sampling to empirically check that the coverage rate matches this expectation. We can also evaluate whether our optimum allocation strategy meets the target allowable error of $E = 35$ Mg/ha with 95% confidence.

Using the estimated $n$ and $n_h$, we drew 1,000 samples from the population using stratified SYS. For each, we computed the mean AGB and its 95% confidence interval. These 1,000 replicate estimates are stored in "datasets/HF/strat_sys_agb_ests_1000reps.csv". The code below reads in these estimates and computes two key empirical rates: the proportion of intervals that covered the true mean, and the proportion with absolute error less than the allowable error $E = 35$ Mg/ha. Both rates are very close to 95%, which is strong evidence that the design-based theory is holding up in this example!

agb_ests_reps <- read_csv("datasets/HF/strat_sys_agb_ests_1000reps.csv")

# Compute the true mean AGB (Mg/ha) from the census.
y_true <- read_csv("datasets/HF/HF_trees.csv") %>%
  filter(dbh_cm > 12.7) %>% 
  summarize(y_true = sum(agb_Mg) / N) %>%
  pull()

# Percent of confidence intervals that cover the true mean.
mean(y_true > agb_ests_reps$ci_lower_95 & 
       y_true < agb_ests_reps$ci_upper_95) * 100

#> [1] 98.1

# Percent of estimates within the allowable error E
mean(abs(agb_ests_reps$y_bar - y_true) < E) * 100

#> [1] 97.3

13.3.6.2 Efficiency gains from stratification

Stratified sampling is especially effective when variation within strata is smaller than variation between them, allowing for more precise estimates than an unstratified sample of the same size. In our Harvard Forest example, we defined four strata. With the exception of the small Stratum 3, the others were fairly similar, so we might not expect dramatic efficiency gains. As a reminder, efficiency here means achieving greater precision—that is, a smaller standard error—for the same number of plots compared to an unstratified design.

To compare estimator performance (i.e., unstratified vs. stratified), we repeated the procedure from Section 13.3.6.1, drawing 1,000 independent samples from the same population. For each sample, we computed the mean AGB and its 95% confidence interval using both unstratified and stratified estimators.

The unstratified replicate estimates are stored in "datasets/HF/sys_agb_ests_1000reps.csv". The stratified replicates are the same as those used in Section 13.3.6.1. For each estimator, we computed the average 95% confidence interval width across replicates. These were 84 and 73 (Mg/ha) for the unstratified and stratified cases, respectively.

To quantify the gain in precision from stratification, we applied the same logic used in (11.39), replacing variances with average confidence interval widths. This substitution is reasonable because CI width is directly proportional to the standard error and provides an intuitive, interpretable measure of estimator precision. The resulting percent gain in precision was 13%, reflecting the reduction in average interval width achieved by incorporating strata.

This confirms that even modest stratification can yield measurable gains in precision, making it a worthwhile method when stratifying variables are available.

13.3.6.3 Group-level estimates from stratified data

So far, our focus has been on estimating AGB within strata and for the overall population. But we’re often also interested in estimates for subgroups of trees within each stratum and across the entire forest. For example, we might want to know how much AGB is tied up in specific species, age classes, forest types, or functional groups.

Let’s walk through one such example.

Eastern hemlock (Tsuga canadensis) is under threat throughout the northeastern U.S. due to the spread of hemlock woolly adelgid (HWA), an invasive insect causing widespread mortality. Since HWA is present in the region of Massachusetts where our data were collected, it’s reasonable to ask how much AGB would be lost if all eastern hemlock trees in the forest were killed.¹⁵⁵

We start by creating a new column in plot_trees called species_groups, which assigns each tree to one of two groups of interest:

"Eastern hemlock" (species == "Tsuga canadensis")
"Other" (species != "Tsuga canadensis")

Both species_groups and plot_id are converted to factors. This prepares the data for use with complete(), which relies on factor levels to insert rows for combinations of plot and group that may not appear in the data.

plot_trees <- plot_trees %>%
  mutate(
    species_groups = factor(if_else(species == "Tsuga canadensis", 
                                    "Eastern hemlock", 
                                    "Other")),
    plot_id = factor(plot_id)
  )

As noted in Chapter 12, our convention is to use a tree_count of zero to retain plots with no trees. This keeps their plot indexes in the dataset, ensuring they’re included in summaries and downstream calculations. What’s important is that all plot indexes—even those with no trees—are retained when converting plot_id to a factor. We used a similar approach when building stand and stock tables in Section 12.6.3.

Next, we compute AGB at the plot level for each species group.

plot_summary <- plot_trees %>%
  group_by(plot_id, species_groups) %>%
  summarize(
    y_hi = sum(TF * tree_count * agb_Mg),
    stratum_id = first(stratum_id),
    .groups = "drop"
  )

This yields one row per plot per group. If no trees from a group were observed in a plot, it won’t appear yet. We’ll fix that next.

The code below uses complete() to add any missing combinations of plot and species group, inserting explicit zeros. We then use fill() to recover stratum_id for the new rows. To avoid crossing strata, we group by plot_id before filling stratum_id, ensuring correct propagation.

plot_summary_filled <- plot_summary %>%
  complete(plot_id, species_groups, fill = list(y_hi = 0)) %>%
  group_by(plot_id) %>%
  fill(stratum_id, .direction = "downup") %>%
  ungroup()

This is conceptually the same as our earlier use of complete() in Section 12.6.3 to fill in missing combinations of species and DBH class. Here, we’re doing the same with species groups. After inserting rows, fill() restores plot-level variables. Grouping by plot_id ensures this is done within the correct plot.

Take a look at plot_summary_filled to verify that it looks right—the number of rows should equal the number of plots times the number of species groups.

We now summarize to the stratum level, computing the mean AGB per hectare ($\bar{y}_h$) with associated 95% confidence intervals.

stratum_summary <- plot_summary_filled %>%
  group_by(stratum_id, species_groups) %>%
  summarize(
    y_bar_h = mean(y_hi),
    n_h = n(),
    s_h = sd(y_hi),
    s_y_bar_h = s_h/sqrt(n_h),
    t = qt(df = n_h - 1, p = 1 - 0.05 / 2),
    ci_lower_h = y_bar_h - t * s_y_bar_h,
    ci_upper_h = y_bar_h + t * s_y_bar_h,
    .groups = "drop"
  ) 

stratum_summary %>% 
  select(stratum_id, species_groups, y_bar_h, ci_lower_h, ci_upper_h)

#> # A tibble: 8 × 5
#>   stratum_id species_groups  y_bar_h ci_lower_h ci_upper_h
#>        <dbl> <fct>             <dbl>      <dbl>      <dbl>
#> 1          1 Eastern hemlock    89.8     -678.       857. 
#> 2          1 Other             221.      -891.      1333. 
#> 3          2 Eastern hemlock   255.       101.       410. 
#> 4          2 Other             219.       110.       327. 
#> 5          3 Eastern hemlock     0          0          0  
#> 6          3 Other              63.0     -396.       522. 
#> 7          4 Eastern hemlock    48.3       24.3       72.2
#> 8          4 Other             300.       258.       341.

These estimates show that Stratum 2 carries about 250 Mg/ha of eastern hemlock, which is substantially more than any of the other strata. Not surprisingly, with a sample size of $n_h = 2$, Strata 1 and 3 have very low precision (i.e., wide confidence intervals). No eastern hemlock was observed on the plots in Stratum 3, so its estimates are zero. To better understand these differences across the forest, Figure 13.7 shows the estimated mean AGB for eastern hemlock in each stratum (i.e., y_bar_h computed above).

Estimated mean aboveground biomass (AGB) of eastern hemlock (Tsuga canadensis) per hectare in each stratum. Estimates are based on plot-level measurements and stratified sampling.

FIGURE 13.7: Estimated mean aboveground biomass (AGB) of eastern hemlock (Tsuga canadensis) per hectare in each stratum. Estimates are based on plot-level measurements and stratified sampling.

To estimate forest-wide AGB by species group, along with standard errors and 95% confidence intervals, we apply the total estimator (13.3) and its associated standard error (13.6). The beauty of this approach is that it’s the same code used to estimate total AGB for all species in the previous section—the only difference is the added group_by(species_groups). This again illustrates a key strength of the tidyverse: once the logic for a summary is in place, it’s easy to extend it to subgroups with minimal changes.

# Join the stratum areas.
stratum_summary <-  stratum_summary %>% 
  left_join(stratum_area, by = "stratum_id")

stratum_summary %>%
  group_by(species_groups) %>%
  summarize(
    N = sum(N_h),
    t_hat = N * sum(N_h * y_bar_h) / N,
    s_t_hat = N * sqrt(1 / N^2 * sum((N_h^2 * s_h^2) / n_h)),
    t = qt(df = sum(n_h - 1), p = 1 - 0.05 / 2),
    ci_lower = t_hat - t * s_t_hat,
    ci_upper = t_hat + t * s_t_hat,
    .groups = "drop"
  ) %>%
  select(species_groups, t_hat, ci_lower, ci_upper)

#> # A tibble: 2 × 4
#>   species_groups  t_hat ci_lower ci_upper
#>   <fct>           <dbl>    <dbl>    <dbl>
#> 1 Eastern hemlock 3482.    2134.    4831.
#> 2 Other           9133.    7838.   10428.

So, to answer our initial question, how much AGB would be lost if all eastern hemlock trees in the forest were killed by HWA? We estimate a total loss of 3,482 Mg, with a 95% confidence interval of (2,134, 4,831) Mg, with the majority of this loss coming from Stratum 2.

13.3.7 Advantages and disadvantages of stratified sampling

Stratified sampling is widely used in forest inventory, especially when auxiliary information is available to define meaningful strata. Below, we summarize some key advantages and disadvantages of this design.

13.3.7.1 Advantages

Improved precision: When strata are well defined, stratified sampling increases efficiency by grouping similar areas together. This structure reduces variability within each group and leads to more precise overall estimates.
Supports subregion estimation: Stratification enables valid estimation not just for the entire population, but also for each individual stratum. This is useful when separate summaries are needed for forest types, disturbance regimes, management units, ownership classes, or other subdivisions.
Enables efficient allocation: With optimum allocation, more samples can be directed to strata that are larger or more variable, improving efficiency without increasing cost.
Compatible with other designs: Different sampling strategies—such as SRS, SYS, or others—can be used within strata, depending on operational needs or design goals. See, for example, Section 13.5.5.
Robust to poor strata choices: Even if strata aren’t optimal, stratified sampling rarely performs worse than unstratified sampling.
Supports post-stratification: If stratum membership isn’t known at the time of sampling, post-stratification can still be used. This allows for flexibility when auxiliary data become available after the field campaign.

13.3.7.2 Disadvantages

Area requirements for pre-stratification: A key requirement of pre-stratification is that the area of each stratum must be known prior to sampling in order to support design-based estimation. This can be a limitation if strata are based on incomplete or evolving information.
Limited allocation flexibility in post-stratification: Post-stratified designs don’t allow proportional or optimum allocation, since sample sizes per stratum aren’t controlled in advance. They may also result in too few samples in some strata to estimate variance reliably.
Added bookkeeping: Stratified sampling requires tracking stratum membership, managing stratum-specific calculations, and combining results with appropriate weights.

13.4 Estimation using covariates

Up to this point, our estimators have relied primarily on direct measurements of the variable of interest, $y$, collected from sampled units. In many settings, however, we also have access to auxiliary information—additional variables that are already available or inexpensive to obtain. When such auxiliary data are correlated with $y$, they can be used to improve estimation and increase efficiency. A covariate $x$ is one form of auxiliary information: it’s a variable that is related to, but distinct from, $y$. When $x$ is predictive of $y$, incorporating it into estimation can reduce standard errors or lower field measurement costs.

We have already seen covariates at work indirectly. For example, in stratified sampling (Section 13.3), a stratifying variable improved estimation by grouping similar units together, which reduced within-stratum variability and led to more precise estimates. Here, we treat $x$ as an explicit part of estimation.

We consider two common settings.

Regression sampling: The population mean of $x$ (i.e., $\mu_x$) is known and used to improve the estimate of the population mean of $y$ (i.e., $\mu_y$).
Double sampling (two-phase): The population mean of $x$ is unknown, but $x$ is inexpensive to measure. A large phase-1 sample is used to measure $x$ only, followed by a smaller nested phase-2 sample where $y$ is measured. This yields a large set of $x$ values ($x_1$) and a smaller set of paired observations ($x_2$, $y_2$). The phase-1 information is then used to improve the estimate of $\mu_y$.

Within each setting, the choice of estimator depends on the nature of the relationship between $y$ and $x$.

Regression estimation is preferred when the relationship is approximately linear with a nonzero intercept, and the variability of $y$ remains roughly constant across values of $x$ (Figure 13.8(a)).
Ratio estimation is also based on a linear relationship, but one that passes through (or near) the origin. It’s most appropriate when the variability of $y$ tends to increase proportionally with $x$ (Figure 13.8(b,c)).

We’ll work with two common ratio estimators.

Ratio-of-means: Estimate a single multiplier $\hat{R} = \bar{y} / \bar{x}$ and apply it to a known (or well-estimated) population mean of $x$.
Mean-of-ratios: Estimate the average of unit ratios as $\hat{R} = \sum_{i=1}^n r_i / n$, where $r_i = y_i / x_i$, then apply $\hat{R}$ to a known (or well-estimated) population mean of $x$.

The ratio-of-means works well when variability in $y$ increases roughly in proportion to $x$ (Figure 13.8(b)), while the mean-of-ratios is useful when variability in $y$ increases with $x^2$ (Figure 13.8(c)) or when unit ratios are of direct interest.

Three linear relationship patterns. (a) Nonzero intercept with roughly constant variance. (b) Passes through the origin with variance increasing in proportion to $x$. (c) Passes through the origin with variance increasing in proportion to $x^2$.

FIGURE 13.8: Three linear relationship patterns. (a) Nonzero intercept with roughly constant variance. (b) Passes through the origin with variance increasing in proportion to $x$. (c) Passes through the origin with variance increasing in proportion to $x^2$.

13.4.1 Regression estimation

A common way to describe the relationship between two variables is with a straight line. Suppose we have a variable of interest $y$ and a covariate $x$. The general equation for a line is \[\begin{equation} y = b_0 + b_1x, \tag{13.16} \end{equation}\] where $b_0$ is the point where the line crosses the $y$-axis (the intercept) and $b_1$ is the slope, which tells us how much $y$ changes for a one-unit increase in $x$.

Suppose we have a population of paired values $(x_i, y_i)$ for $i = 1, \ldots, N$. From this population we select a sample of $n$ pairs, which could produce a scatter plot like Figure 13.8(a). With these data we can fit a straight line using ordinary least squares regression. This method finds the values of $b_0$ and $b_1$ that minimize the squared vertical distances between the observed points and the line. In R, this is done with the lm() function, which returns estimates for $b_0$ and $b_1$. This is simple linear regression in a nutshell.

Our goal here isn’t to build a predictive regression model or test hypotheses about regression coefficients. Instead, we’re after a practical estimator for $\mu_y$ that borrows strength from a linear relationship with $x$. We treat the regression equation as a stepping stone: by fitting a line through the sample data, we derive an estimator that incorporates known information about the population mean of $x$. Let’s walk through how this works.

To see how regression leads to an estimator, note that in ordinary least squares, the fitted regression line always passes through the point formed by the sample means, $(\bar{x}, \bar{y})$. Substituting these values into (13.16) gives \[\begin{equation} \bar{y} = b_0 + b_1 \bar{x}. \tag{13.17} \end{equation}\]

This can be rearranged to express the intercept in terms of sample quantities \[\begin{equation} b_0 = \bar{y} - b_1 \bar{x}. \tag{13.18} \end{equation}\]

Now replace the sample mean $\bar{x}$ in (13.17) with the known population mean $\mu_x$ and step through the substitution explicitly. \[\begin{equation} \bar{y} = b_0 + b_1 \mu_x. \end{equation}\] Next, substitute $b_0$ from (13.18) \[\begin{equation} \bar{y} = (\bar{y} - b_1 \bar{x}) + b_1 \mu_x. \end{equation}\] Finally, collect terms to obtain the regression estimator \[\begin{equation} \bar{y}_{\text{R}} = \bar{y} + b_1(\mu_x - \bar{x}). \tag{13.19} \end{equation}\] The subscript on $\bar{y}_{\text{R}}$ distinguishes the regression estimator from the usual sample mean $\bar{y}$. We’ll carry this subscript forward when referring to associated totals and standard errors.

From (13.19) we see the logic of regression estimation. The usual sample mean $\bar{y}$ serves as the baseline. The adjustment term $b_1(\mu_x - \bar{x})$ shifts this baseline according to how far the sample mean of $x$ is from the known population mean, scaled by the slope. If $\bar{x}$ is less than $\mu_x$, the estimate is adjusted upward; if greater, it’s adjusted downward. In this way, the regression estimator incorporates auxiliary information from $x$ to improve our estimate of $\mu_y$. The extent of improvement depends on how strong and linear the relationship is between $x$ and $y$.

The slope in (13.19) is computed from the sample as \[\begin{equation} b_1 = \frac{S_{xy}}{S_{xx}}, \tag{13.20} \end{equation}\] where $S$ denotes a sum of squares (or cross-products). In particular, \[\begin{equation} S_{xy} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \end{equation}\] and \[\begin{equation} S_{xx} = \sum_{i=1}^n (x_i - \bar{x})^2. \end{equation}\]

The standard error of $\bar{y}_{\text{R}}$ can be estimated as \[\begin{equation} s_{\bar{y}_{\text{R}}} = \sqrt{s^2_{y|x} \left(\frac{1}{n} + \frac{(\mu_x - \bar{x})^2}{S_{xx}}\right)\left(1-\frac{n}{N}\right)}, \tag{13.21} \end{equation}\] where the residual variance of $y$ about the fitted regression line is \[\begin{equation} s^2_{y|x} = \frac{S_{yy} - \frac{S_{xy}^2}{S_{xx}}}{n - 2} \tag{13.22} \end{equation}\] and \[\begin{equation} S_{yy} = \sum_{i=1}^n (y_i - \bar{y})^2. \end{equation}\]

The FPC term should be dropped from (13.21) when appropriate, see Section 12.2.1.

To estimate the population total we scale by $N$ \[\begin{equation} \hat{t}_{\text{R}} = N \bar{y}_{\text{R}} \tag{13.23} \end{equation}\] and its standard error is \[\begin{equation} s_{\hat{t}_{\text{R}}} = N s_{\bar{y}_{\text{R}}}. \tag{13.24} \end{equation}\]

Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y}_{\text{R}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\bar{y}_{\text{R}}},\\ \hat{t}_{\text{R}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{t}_{\text{R}}}, \end{align*}\] with degrees of freedom $df = n - 2$.

Like regression itself, regression estimation works best when two assumptions hold: (1) a roughly linear relationship exists between $y$ and $x$, and (2) the variability of $y$ around the regression line is approximately constant across the range of $x$. If the first assumption fails, a transformation of $x$ may help. If the second fails and variance increases with $x$, a variance-stabilizing transformation may be considered, or ratio estimation (Section 13.4.2) might be more appropriate.

13.4.2 Ratio estimation

Ratio estimation assumes a proportional relationship between $y$ and $x$, so that $y \approx R x$ for some constant multiplier $R$. This implies the line relating $y$ and $x$ passes through (or near) the origin. We estimate $R$ from the sample as $\hat{R}$ and then use it to estimate the population mean as $\bar{y}_{\text{ratio}} = \hat{R} \mu_x$, and the population total as $\hat{t}_{\text{ratio}} = \hat{R} \tau_x$.

There are two common ways to define $\hat{R}$, leading to two distinct ratio estimators. To keep things clear, we use the subscript RM for the ratio-of-means estimator and MR for the mean-of-ratios estimator in the expressions that follow.

13.4.2.1 Ratio-of-means

The estimator of the population mean is \[\begin{equation} \bar{y}_{\text{RM}} = \hat{R}\,\mu_x, \tag{13.25} \end{equation}\] where \[\begin{equation} \hat{R} = \frac{\bar{y}}{\bar{x}}. \end{equation}\] Equivalently, this ratio can be written \[\begin{equation} \hat{R} = \frac{\sum_{i=1}^n y_i}{\sum_{i=1}^n x_i}, \end{equation}\] since the factors of $n$ in the sample means cancel out.

The standard error of $\bar{y}_{\text{RM}}$ can be estimated as \[\begin{equation} s_{\bar{y}_{\text{RM}}} = \sqrt{\left(\frac{s^2_y + \hat{R}^2 s_x^2 - 2\hat{R}s_{xy}}{n}\right)\left(1-\frac{n}{N}\right)}, \tag{13.26} \end{equation}\] where $s^2_y$ and $s^2_x$ are the sample variances of $y$ and $x$, and \[\begin{equation} s_{xy} = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \end{equation}\] is the sample covariance between $x$ and $y$. The FPC term should be dropped from (13.26) when appropriate, see Section 12.2.1.

The population total is estimated using \[\begin{equation} \hat{t}_{\text{RM}} = N \bar{y}_{\text{RM}} \tag{13.27} \end{equation}\] and its standard error is \[\begin{equation} s_{\hat{t}_{\text{RM}}} = N s_{\bar{y}_{\text{RM}}}. \tag{13.28} \end{equation}\]

Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y}_{\text{RM}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\bar{y}_{\text{RM}}},\\ \hat{t}_{\text{RM}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{t}_{\text{RM}}}, \end{align*}\] with degrees of freedom $df = n - 1$.

13.4.2.2 Mean-of-ratios

The estimator of the population mean is \[\begin{equation} \bar{y}_{\text{MR}} = \hat{R}\mu_x, \tag{13.29} \end{equation}\] where \[\begin{equation} \hat{R} = \frac{1}{n}\sum_{i=1}^n r_i, \tag{13.30} \end{equation}\] and $r_i = y_i/x_i$ is the unit ratio.

The standard error of $\bar{y}_{\text{MR}}$ can be estimated as \[\begin{equation} s_{\bar{y}_{\text{MR}}} = \mu_x \sqrt{\frac{s_r^2}{n}\left(1 - \frac{n}{N}\right)}, \tag{13.31} \end{equation}\] where \[\begin{equation} s_r^2 = \frac{1}{n-1}\sum_{i=1}^n \left(r_i - \hat{R}\right)^2. \tag{13.32} \end{equation}\] The FPC term should be dropped from (13.31) when appropriate, see Section 12.2.1.

The population total is estimated using \[\begin{equation} \hat{t}_{\text{MR}} = N \bar{y}_{\text{MR}} \tag{13.33} \end{equation}\] and its standard error is \[\begin{equation} s_{\hat{t}_{\text{MR}}} = N s_{\bar{y}_{\text{MR}}}. \tag{13.34} \end{equation}\]

Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y}_{\text{MR}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\bar{y}_{\text{MR}}},\\ \hat{t}_{\text{MR}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{t}_{\text{MR}}}, \end{align*}\] with degrees of freedom $df = n - 1$.

13.4.3 Double sampling (Two-phase)

Double sampling is useful when $x$ is inexpensive to measure but its population mean $\mu_x$ is unknown. The design proceeds in two phases. In phase 1, we draw a large sample of size $n_1$ and measure only $x$. In phase 2, we draw a nested subsample of size $n_2$ from the phase-1 units and measure $y$ on those. As a result, we have a large sample of $x$ values ($x_1$) and a smaller set of paired observations ($x_2$, $y_2$).

The two phases play complementary roles. Phase 1 provides a precise estimate of the population mean of $x$, anchoring the estimation. Phase 2 focuses resources on measuring $y$, allowing us to estimate how $y$ varies with $x$—for example, through a fitted slope $b_1$ or a ratio $\hat{R}$. Just as in regression or ratio estimation, the idea is to start with the mean of $y$ from the smaller sample and adjust it based on the relationship between $x$ and $y$ and how far the phase-2 sample mean $\bar{x}_2$ deviates from the phase-1 mean $\bar{x}_1$. In essence, phase 1 tells us where $x$ is centered, and phase 2 tells us how to translate $x$ into $y$. When $n_1$ is large, most of the uncertainty in the estimator arises from residual variation around the fitted relationship, rather than from estimating $\bar{x}_1$.

13.4.3.1 Regression estimation

The regression estimator of the population mean under double sampling is \[\begin{equation} \bar{y}_{\text{Rd}} = \bar{y}_2 + b_1\left(\bar{x}_1 - \bar{x}_2\right), \tag{13.35} \end{equation}\] where the slope $b_1$ is defined in (13.20), but computed here using only the $n_2$ phase-2 observations.

The standard error of $\bar{y}_{\text{Rd}}$ can be estimated as \[\begin{equation} s_{\bar{y}_{\text{Rd}}} = \sqrt{s^2_{y|x} \left(\frac{1}{n_2} + \frac{(\bar{x}_1 - \bar{x}_2)^2}{S_{xx}}\right)\left(1 - \frac{n_2}{n_1} \right) + \frac{s^2_y}{n_1}\left(1 - \frac{n_1}{N}\right)}, \tag{13.36} \end{equation}\] where $s^2_{y|x}$ is the residual variance defined in (13.22), again computed using the phase-2 data.

Note that (13.36) includes two FPC terms: one for the phase-2 sample nested within phase 1, and one for the phase-1 sample drawn from the population. As before, the second FPC can be dropped under an areal frame (see Section 12.2.1).

The population total is estimated as \[\begin{equation} \hat{t}_{\text{Rd}} = N \bar{y}_{\text{Rd}} \tag{13.37} \end{equation}\] with standard error \[\begin{equation} s_{\hat{t}_{\text{Rd}}} = N s_{\bar{y}_{\text{Rd}}}. \tag{13.38} \end{equation}\]

Confidence intervals follow the usual form \[\begin{align*} \bar{y}_{\text{Rd}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\bar{y}_{\text{Rd}}},\\ \hat{t}_{\text{Rd}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{t}_{\text{Rd}}}, \end{align*}\] with degrees of freedom $df = n_2 - 2$.

13.4.3.2 Ratio-of-means

The estimator of the population mean is \[\begin{equation} \bar{y}_{\text{RMd}} = \hat{R}\bar{x}_1, \tag{13.39} \end{equation}\] where \[\begin{equation} \hat{R}=\frac{\bar{y}_2}{\bar{x}_2}. \tag{13.40} \end{equation}\]

The standard error of $\bar{y}_{\text{RMd}}$ can be estimated as \[\begin{equation} s_{\bar{y}_{\text{RMd}}} = \sqrt{\left(1 - \frac{n_2}{n_1}\right)\left(\frac{\bar{x}_1}{\bar{x}_2}\right)^2\left(\frac{s_y^2 + \hat{R}^2s_x^2-2\hat{R}s_{xy}}{n_2}\right) + \frac{s^2_y}{n_1}\left(1 - \frac{n_1}{N}\right)}. \tag{13.41} \end{equation}\] If you’re using an areal sampling frame, it’s appropriate to drop the second FPC term (see Section 12.2.1 for discussion).

The population total is estimated using \[\begin{equation} \hat{t}_{\text{RMd}}=N\bar{y}_{\text{RMd}} \tag{13.42} \end{equation}\] and its standard error is \[\begin{equation} s_{\hat{t}_{\text{RMd}}}=Ns_{\bar{y}_{\text{RMd}}}. \tag{13.43} \end{equation}\]

Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y}_{\text{RMd}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\bar{y}_{\text{RMd}}},\\ \hat{t}_{\text{RMd}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{t}_{\text{RMd}}}, \end{align*}\] with degrees of freedom $df = n_2 - 1$.

13.4.3.3 Mean-of-ratios

The estimator of the population mean is \[\begin{equation} \bar{y}_{\text{MRd}} = \hat{R}\,\bar{x}_1, \tag{13.44} \end{equation}\] where \[\begin{equation} \hat{R} = \frac{1}{n_2}\sum_{i=1}^{n_2} r_i, \end{equation}\] and $r_i = y_i/x_i$ is the unit ratio.

The standard error of $\bar{y}_{\text{MRd}}$ can be estimated as \[\begin{equation} s_{\bar{y}_{\text{MRd}}} = \sqrt{\bar{x}_1^2\left(\frac{s_r^2}{n_2}\right)\left(1 - \frac{n_2}{n_1}\right) + \frac{s_y^2}{n_1}\left(1 - \frac{n_1}{N}\right)}, \tag{13.45} \end{equation}\] where $s_r^2$ was defined in (13.32). If you’re using an areal sampling frame, it’s appropriate to drop the second FPC term (see Section 12.2.1 for discussion).

The population total is estimated using \[\begin{equation} \hat{t}_{\text{MRd}} = N \bar{y}_{\text{MRd}} \tag{13.46} \end{equation}\] and its standard error is \[\begin{equation} s_{\hat{t}_{\text{MRd}}} = N s_{\bar{y}_{\text{MRd}}}. \tag{13.47} \end{equation}\]

Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y}_{\text{MRd}} \pm t_{n_2-1, 1 - \frac{\alpha}{2}}\, s_{\bar{y}_{\text{MRd}}},\\ \hat{t}_{\text{MRd}} \pm t_{n_2-1, 1 - \frac{\alpha}{2}}\, s_{\hat{t}_{\text{MRd}}}, \end{align*}\] with degrees of freedom $df = n_2 - 1$.

13.4.4 Illustrative analysis and extensions

The next few sections illustrate how the estimators we’ve introduced are applied in forest inventory settings. Each example highlights how auxiliary information in the form of a covariate can improve efficiency, and how the choice of estimator depends on available data, relationship between variables, sampling designs, and field constraints.

We begin in Section 13.4.4.1 with a timber sale cruise, where ratio estimation offers a simple and efficient approach to estimating total volume. Section 13.4.4.2 turns to the challenge of quantifying insect damage using remotely sensed data, where double sampling proves useful. Finally, Sections 13.4.4.3 and 13.4.4.4 explore more advanced inventory designs built around ratio and regression estimators.

13.4.4.1 Using ratio estimators to estimate timber volume

This illustration is adapted from a short essay by Ken Desmarais, who at the time was the Assistant District Ranger in the White Mountain National Forest, New Hampshire. The piece, entitled Marcy and Ted Take a Crack at Ratio Estimators, is both fun and informative, and we encourage you to read the original (included in the datasets/WMNF directory). Here we present the key elements, focusing on the methods and analysis.

13.4.4.1.1 Setting

In this example, the crew needed to mark every merchantable tree and estimate its volume for a spruce-fir timber sale in the White Mountain National Forest. This is a demanding task if DBH and height must be measured for every stem and then entered into a volume equation to calculate volume.

Rather than measure every tree, Marcy, the heroine of the story, suggests using ratio estimation. Her idea is to guess DBH and height for each tree, yielding quick but coarse volume estimates, and then measure a random subset of trees to get accurate volumes. The guesses were made by ocular assessment only—no instruments—just a quick guess at the tree’s DBH class and number of 16-foot logs (height), which were then run through a volume table to produce coarse volume estimates. The ratio estimator is used to correct these guesses, producing an accurate and precise estimate of the total sale volume.

Mapping this back to our regression sampling methods, the coarse (guessed) tree volumes play the role of $x$. Since every tree is guessed, the population mean $\mu_x$ is known. The $y$ values come from trees whose DBH and height were measured and then run through a volume equation. The parameter of interest is $\mu_y$, the true mean tree volume, which can then be scaled by $N$ to estimate the total sale volume $\tau_y = N \mu_y$.

13.4.4.1.2 Sampling design and implementation

Each tree in the sale was marked and assigned an ocular estimate of 2-inch DBH class and number of 16-foot logs. These coarse values were run through a balsam-fir volume table to yield a guessed volume (in board feet).
A subsample of trees was selected for measurement. This subsample had to be chosen using a random mechanism (e.g., SRS or SYS). To simplify implementation in the field, a SYS design was used. After a random start, every 200-th tree received a DBH and height measurement. The crew figured they needed about 20 observations to estimate the relationship between guessed and measured volume, i.e., $n = 20$ $(x, y)$ pairs should give a good estimate of $R$. They estimated about 4,000 trees in the sale, so a SYS sampling interval of 200 should yield the desired sample size $n$.
To start the systematic sample, a random start between 1 and 200 was chosen; the draw produced 97. Starting from the first tree, each was assigned a guessed volume. When the crew reached the 97-th tree, they guessed as usual but also took a measurement. The same was done for the 297-th tree, the 497-th, and so on, until all trees were marked.

13.4.4.1.3 Data and analysis

In the end, $N$ = 4,181 trees were marked and included in the sale. Table 13.2 shows the $n = 20$ trees that received both a guess and a measurement. Guessed volumes are coarse values from a volume table; measured volumes are based on DBH and height entered into a volume equation, providing more accurate estimates.¹⁵⁶ The mean guessed volume across all $N$ trees was $\mu_x = 62.1$ board feet per tree.

TABLE 13.2: Paired guessed and measured tree volumes for the subsample of 20 spruce-fir trees in the White Mountain National Forest.
		Guessed			Measured
Tree	Species	DBH (in)	Logs (16-ft)	Vol. (bd ft)	DBH (in)	Height (ft)	Vol. (bd ft)
1	rs	10	1.5	48	10.4	27	57.9
2	bf	8	1.0	24	8.6	18	27.4
3	rs	12	1.5	74	11.8	28	79.9
4	rs	10	1.5	48	10.2	25	52.4
5	rs	12	2.0	92	12.4	36	107.0
6	rs	12	2.0	92	12.1	38	104.9
7	bf	10	1.5	48	10.7	24	57.0
8	rs	14	2.5	153	14.8	43	182.7
9	rs	12	2.0	92	11.5	31	80.8
10	rs	10	1.5	48	10.7	28	63.6
11	rs	8	1.0	24	7.7	18	20.6
12	bf	8	1.0	24	8.6	19	28.5
13	rs	8	1.0	24	8.2	16	22.2
14	rs	10	1.0	36	10.8	18	47.6
15	rs	8	1.0	24	8.1	14	19.5
16	bf	12	1.5	74	12.2	27	84.2
17	rs	14	1.5	105	14.8	28	134.6
18	bf	12	2.0	92	11.7	30	82.3
19	rs	10	1.0	36	10.3	18	42.5
20	rs	10	1.5	48	9.7	25	46.5

The paired data in Table 13.2 are available in datasets/WMNF/sf_xy.csv. We read them into sf_xy and use a scatter plot (Figure 13.9) to characterize the relationship between guessed and measured volumes, helping inform our choice of estimator.

sf_xy <- read_csv("datasets/WMNF/sf_xy.csv")
sf_xy %>% glimpse()

#> Rows: 20
#> Columns: 8
#> $ tree_id      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1…
#> $ species      <chr> "rs", "bf", "rs", "rs", "rs", "r…
#> $ dbh_guess    <dbl> 10, 8, 12, 10, 12, 12, 10, 14, 1…
#> $ logs_guess   <dbl> 1.5, 1.0, 1.5, 1.5, 2.0, 2.0, 1.…
#> $ vol_guess    <dbl> 48, 24, 74, 48, 92, 92, 48, 153,…
#> $ dbh_measure  <dbl> 10.4, 8.6, 11.8, 10.2, 12.4, 12.…
#> $ ht_measure   <dbl> 27, 18, 28, 25, 36, 38, 24, 43, …
#> $ vol_measured <dbl> 57.9, 27.4, 79.9, 52.4, 107.0, 1…

$Guessed ($x$) versus measured ($y$) tree volume for the paired trees listed in Table 13.2. A simple linear regression line is drawn for exploratory analysis.$

FIGURE 13.9: Guessed ($x$) versus measured ($y$) tree volume for the paired trees listed in Table 13.2. A simple linear regression line is drawn for exploratory analysis.

The scatter in Figure 13.9 shows a strong positive relationship between guessed ($x$) and measured ($y$) volumes, and the fitted line crosses the $y$-axis near the origin. With only 20 pairs it’s hard to assess variance patterns, but the spread looks roughly constant. This supports using the ratio-of-means estimator defined in Section 13.4.2.1.

The code below computes the ratio-of-means estimates for mean volume (13.25) and total volume (13.27), along with their respective standard errors (13.26) and (13.28), and a 95% confidence interval for the total.

# Some known values.
N <- 4181 # Trees.
mu_x <- 62.1 # (bd ft).
n <- nrow(sf_xy)

# Pull out variables.
x <- sf_xy %>% pull(vol_guess)
y <- sf_xy %>% pull(vol_measured)

# Ratio estimate.
R_hat <- sum(y) / sum(x)

# Mean estimate (bd ft/tree).
y_bar <- R_hat * mu_x

# Total estimate (bd ft).
t_hat <- N * y_bar

# Standard error of the mean.
s_y_bar <- sqrt((var(y) +
                 R_hat^2 * var(x) -
                 2 * R_hat * cov(x, y)) / n *
                (1 - n / N))

# Standard error of the total.
s_t_hat <- N * s_y_bar

# 95% CI
t <- qt(df = n - 1, p = 1 - 0.05 / 2)
ci_lower <- t_hat - t * s_t_hat
ci_upper <- t_hat + t * s_t_hat

From the code above, after rounding, the computed $\hat{R}$ is 1.11. This multiplier converts the population mean of $x$ into an estimate of the population mean of $y$. The estimated total sale volume is 288,941 board feet, with a 95% confidence interval of (270,274, 307,608).

These results come from just 20 measured trees. Guessing takes time, but it’s a fraction of what it would take to measure DBH and height on every stem. That’s the payoff of a good design: many quick, inexpensive values adjusted by a small set of careful measurements. We’ll return to variations on this theme in the sections that follow.

13.4.4.2 Double sampling for dead-tree totals

In the previous example, the auxiliary variable $x$ was observed for every tree in the sale. In practice, we often can’t afford to collect $x$ for the entire population. As described in Section 13.4.3, a common workaround is double sampling, where a relatively inexpensive measurement is taken on a large phase-1 sample, and a more accurate but costly measurement is taken on a smaller, nested phase-2 sample. Remote sensing is often used in phase 1 to provide broad, inexpensive coverage, while phase 2 relies on targeted field data collection for more detailed and accurate measurements.

This example involves a forest manager who wants to estimate the total number of dead trees in a 400-acre area under heavy insect infestation. The area is divided into $N = 200$ equal-area tiles. Aerial imagery-based mortality counts are collected for a phase-1 sample of $n_1 = 100$ tiles. From those, a nested phase-2 subsample of $n_2 = 10$ tiles is visited for a field-based mortality survey. The aerial counts ($x$) are fairly accurate, but the manager is concerned about missing subcanopy mortality and other potential biases due to image interpretation. The field-based counts ($y$) are more accurate.

Our goal is to estimate total mortality across the full area using a double sampling estimator. For comparison, we’ll also estimate total mortality using only the phase-2 field measurements, ignoring the phase-1 data.

13.4.4.2.1 Data and setup

Here, the phase-1 and phase-2 data are stored in a single file, with one row per sampled tile.

Column names are:

tile_id unique identifier for each sampled tile ($n_1$ rows);
x aerial image count (available for all $n_1$ tiles);
y field-based mortality count (available for $n_2$ tiles; NA otherwise).

We begin by reading the data into mort. From this, we extract three vectors: $x_1$ contains all phase-1 counts; $x_2$ is the subset of aerial counts paired with field measurements; and $y_2$ holds the phase-2 field measurements. These align with the notation used in our estimators.

mort <- read_csv("datasets/insects/mortality_two_phase.csv")

# Extract the phase 1.
x_1 <- mort %>% pull(x)

# Extract phase 2.
x_2 <- mort %>% filter(!is.na(y)) %>% pull(x)
y_2 <- mort %>% filter(!is.na(y)) %>% pull(y)

$Paired aerial ($x_2$) and field ($y_2$) mortality counts for ten tiles. The fitted regression line is used only to characterize the relationship and help choose an appropriate estimator. The gray dashed line marks the 1:1 reference.$

FIGURE 13.10: Paired aerial ($x_2$) and field ($y_2$) mortality counts for ten tiles. The fitted regression line is used only to characterize the relationship and help choose an appropriate estimator. The gray dashed line marks the 1:1 reference.

Figure 13.10 shows the paired $(x_2, y_2)$ values for the ten tiles with both aerial and field-based counts. The plot affirms the manager’s concerns: assuming the field survey is more accurate, the image-based survey tends to undercount smaller values and overcount larger ones. But that’s okay—the important thing is the strong linear relationship between $y$ and $x$. The fitted line has a $y$-intercept well above the origin, the relationship appears roughly proportional, and the scatter of points about that line has fairly constant variance across values of $x$. Given these observations, the regression estimator is more appropriate than the ratio estimator. However, for comparison, we’ll apply both the regression and ratio-of-means estimators, and compare their results to a baseline SRS estimate based on the $n_2$ field measurements alone.

As noted in Section 11.3.7, the relative standard error (RSE) is a useful way to express estimation uncertainty as a percentage of the estimate itself. It shares the same mathematical form as the general relative variation measure in (11.38), with the standard error of the estimate in the numerator.

In this case, since we’re estimating the total number of dead trees, the numerator is the standard error of the total, and the denominator is the estimated total. All else equal, estimators with lower RSE are preferred because they reflect greater precision relative to the size of the estimate.

The code below applies the three estimators and collects their results into a summary table.

# Known values.
N <- 200
n_1 <- length(x_1)
n_2 <- length(x_2)

# Means and variances.
x_bar_1 <- mean(x_1)
x_bar_2 <- mean(x_2)
y_bar_2 <- mean(y_2)
s_sq_x_2 <- var(x_2)
s_sq_y_2 <- var(y_2)
s_xy <- cov(x_2, y_2)

# Ratio-of-means (double sampling).
R_hat_RMd <- y_bar_2 / x_bar_2
y_bar_RMd <- R_hat_RMd * x_bar_1
s_y_bar_RMd <- sqrt((1 - n_2 / n_1) *
  ((x_bar_1 / x_bar_2)^2 *
    ((s_sq_y_2 +
      R_hat_RMd^2 * s_sq_x_2 -
      2 * R_hat_RMd * s_xy) / n_2)) +
  (s_sq_y_2 / n_1) * (1 - n_1 / N))

t_hat_RMd <- N * y_bar_RMd
s_t_hat_RMd <- N * s_y_bar_RMd

RSE_t_hat_RMd = 100*s_t_hat_RMd/t_hat_RMd

# Regression (double sampling).
S_xx <- sum((x_2 - x_bar_2)^2)
S_yy <- sum((y_2 - y_bar_2)^2)
S_xy <- sum((x_2 - x_bar_2) * (y_2 - y_bar_2))
b1 <- S_xy / S_xx
s_sq_y_given_x <- (S_yy - S_xy^2 / S_xx) / (n_2 - 2)

y_bar_Rd <- y_bar_2 + b1 * (x_bar_1 - x_bar_2)

s_y_bar_Rd <- sqrt(s_sq_y_given_x * 
                     (1 / n_2 + (x_bar_1 - x_bar_2)^2 / S_xx) * 
                     (1 - n_2 / n_1) + 
                     (s_sq_y_2 / n_1) * 
                     (1 - n_1 / N)
)

t_hat_Rd <- N * y_bar_Rd
s_t_hat_Rd <- N * s_y_bar_Rd

RSE_t_hat_Rd = 100*s_t_hat_Rd/t_hat_Rd

# Phase-2 only, SRS.
y_bar_SRS <- y_bar_2
s_y_bar_SRS <- sqrt((s_sq_y_2 / n_2) * (1 - n_2 / N))

t_hat_SRS <- N * y_bar_SRS
s_t_hat_SRS <- N * s_y_bar_SRS

RSE_t_hat_SRS = 100*s_t_hat_SRS/t_hat_SRS

# Output summary.
tibble(
  method = c("SRS (Phase-2)", "Ratio (RMd)", "Regression (Rd)"),
  y_bar = c(y_bar_SRS, y_bar_RMd, y_bar_Rd),
  s_y_bar = c(s_y_bar_SRS, s_y_bar_RMd, s_y_bar_Rd),
  t_hat = c(t_hat_SRS, t_hat_RMd, t_hat_Rd),
  s_t_hat = c(s_t_hat_SRS, s_t_hat_RMd, s_t_hat_Rd),
  RSE_t_hat = c(RSE_t_hat_SRS, RSE_t_hat_RMd, RSE_t_hat_Rd)
)

#> # A tibble: 3 × 6
#>   method          y_bar s_y_bar t_hat s_t_hat RSE_t_hat
#>   <chr>           <dbl>   <dbl> <dbl>   <dbl>     <dbl>
#> 1 SRS (Phase-2)    24.5    2.99 4904.    598.     12.2 
#> 2 Ratio (RMd)      23.9    2.21 4788.    442.      9.24
#> 3 Regression (Rd)  24.2    1.06 4836.    212.      4.39

The results in the table above highlight the benefit of incorporating auxiliary information from the phase-1 aerial counts. Both the regression and ratio estimators yield substantially smaller RSEs compared with the phase-2-only SRS estimator.

Because these are simulated data, we know the true population values: $\mu_y = 24.75$ dead trees per tile and $\tau_y = 4951$ dead trees across the full 400 acres.

As suggested by the exploratory plot in Figure 13.10, the regression estimator is more appropriate than the ratio estimator due to the nonzero intercept. Ratio estimation assumes a linear relationship between $y$ and $x$ that passes through the origin. When that assumption is violated, the estimator can be substantially biased. The relatively low ratio estimate hints at this bias, but a full repeated sampling analysis—like the one in Section 13.3.6.1—would be needed to quantify it. Still, based on the evidence from Figure 13.10 and the markedly lower RSE, it’s clear that the regression estimator is both more appropriate and more efficient for these data.

This example underscores the importance of EDA when working with auxiliary variables. A simple scatterplot with a fitted line can reveal whether the relationship between $x$ and $y$ departs from the zero-intercept assumption underlying ratio estimation. In this case, visual inspection of the $(x, y)$ relationship provided an early and reliable indication that regression was the better choice.

13.4.4.3 Big BAF sampling

This section continues our focus on combining inexpensive measurements with a targeted subset of more costly measurements to improve parameter estimates in terms of both statistical and operational efficiency. Big BAF sampling fits squarely within this theme. It builds on the principles of horizontal point sampling (Section 12.4) and introduces an elegant way to obtain per-unit-area and total estimates by splitting the estimation process into two parts: a count-based estimate of basal area (BA) per unit area, and a tree-based estimate of volume-to-basal-area ratio (VBAR). While the method was originally devised to estimate volume, it can be applied more broadly to other parameters of interest. As initially introduced in Section 12.4.6, think of the “V” in VBAR more generically as the variable of interest.

The beauty of the Big BAF method lies in its simplicity. The basic idea is straightforward: at each sampling location, conduct a tree count using a standard BAF (e.g., one that yields about 4 to 10 “in” trees). These tree counts provide an estimate of BA per unit area using point-level summaries, as described in Section 12.4. Then, using a second sweep at the same sampling location, apply a larger BAF to identify a subset of trees to measure. These measurement trees are used to calculate tree-level VBAR values by recording both DBH (used to compute BA) and volume (or another variable of interest). The average of these tree-level VBAR values is computed across all measurement trees—that is, pooled across sampling locations. This average VBAR is then multiplied by the BA estimate from the tree counts to yield an estimate of volume per unit area.

Counting trees with a standard-size BAF is fast and inexpensive, which makes it feasible to visit many sample locations. Measuring DBH and other tree variables to calculate VBAR, by contrast, is more time-consuming and costly. The Big BAF method balances this tradeoff by using a big BAF to identify a smaller subset of trees to measure at each location. For example, if a standard-size BAF yields an average of 8 count trees per point, a big BAF that is four times larger would result in about 2 measurement trees on average—just one-fourth as many.

Studies and personal observations have shown that BA per-unit-area estimates are virtually always more variable than VBAR estimates, so it’s generally more efficient to invest in a larger number of sampling locations (for BA counts) and fewer tree measurements (for VBAR). The Big BAF design supports this strategy naturally and allows us to concentrate effort where it’s most needed.

Kim Iles conceived the Big BAF method, building on earlier related ideas and methods developed by John F. Bell and others, as well as foundational work by Lewis R. Grosenbaugh in the 1940s and 1950s. A thorough treatment of the topic—including historical background, estimator behavior, variations, and sample size versus cost considerations—can be found in K. Iles (2003), Marshall, Iles, and Bell (2004), Gove et al. (2020), and references therein. These sources provide practical guidance on selecting both the number of sampling locations and the number of measurement trees, along with deeper insights into estimator performance and design tradeoffs.

Big BAF is a form of double sampling with a ratio estimator, although it’s not like the double sampling introduced in Section 13.4.3, where phase-1 and phase-2 are a sample and subsample of sampling units (e.g., plots or points). Under Big BAF, the phases occur at the tree level: phase-1 can be seen as all count trees identified using the standard-size BAF and used to estimate BA per unit area through point-level summaries; phase-2 consists of the subset of trees measured at each sampling location using the big BAF and used to estimate VBAR.¹⁵⁷ As we’ll see in the next section, the estimator follows the same spirit as other ratio estimators—correcting a coarse, inexpensive estimate using a more accurate, but costly, measurement.

13.4.4.3.1 Estimation using Big BAF

As described above, the Big BAF estimator comprises two components: (1) the mean basal area per unit area ($\overline{\text{BA}}$), based on point-level estimates using a standard-size BAF, and (2) the mean volume-to-basal-area ratio ($\overline{\text{VBAR}}$), based on tree-level measurements using a big BAF. Throughout, we’ll use superscripts $s$ and $b$ to distinguish quantities associated with the small (standard-size) and big BAFs, respectively.

Following the methods in Section 12.4, the mean basal area per unit area is estimated using \[\begin{equation} \overline{\text{BA}} = \frac{1}{n} \sum_{i=1}^{n} \text{BA}_i, \tag{13.48} \end{equation}\] where $n$ is the number of sampling locations and $\text{BA}_i = m^s_i \cdot \text{BAF}^s$ is the observed basal area per unit area at the $i$-th location, with $m^s_i$ being the number of count trees selected using the small BAF (BAF$^s$).

We define VBAR for the $j$-th measurement tree at the $i$-th sampling location as
\[\begin{equation} \text{VBAR}_{ij} = \frac{v_{ij}}{BA_{ij}} \end{equation}\]
and compute the mean across all measurement trees as
\[\begin{equation} \overline{\text{VBAR}} = \frac{1}{m^b} \sum_{i=1}^n \sum_{j=1}^{m^b_i} \text{VBAR}_{ij}, \tag{13.49} \end{equation}\]
where $m^b_i$ is the number of measurement trees selected using the big BAF (BAF$^b$) at the $i$-th sampling location, $n$ is the number of sampling locations, and $m^b = \sum_{i=1}^n m^b_i$ is the total number of measurement trees across all locations. Note that, unlike the point-level summary of $\text{BA}$, which includes locations with zero count trees, the mean $\overline{\text{VBAR}}$ is defined only over measurement trees, so only locations with at least one measurement tree contribute to the sum (i.e., the $i$-th sampling location could have $m^b_i = 0$ and hence doesn’t contribute to the sum).

The Big BAF estimator for volume per unit area is then given by \[\begin{equation} \hat{v} = \overline{\text{BA}} \cdot \overline{\text{VBAR}}. \tag{13.50} \end{equation}\]

It’s useful to verify how the units work out when applying (13.50). For example, suppose volume is measured in cubic meters and basal area is reported in square meters per hectare. Then (13.50) with units becomes \[\begin{align} \hat{v}\left(\frac{\text{m}^3}{\text{ha}}\right) &= \overline{\text{BA}}\left(\frac{\text{m}^2}{\text{ha}}\right) \cdot \overline{\text{VBAR}} \left(\frac{\text{m}^3}{\text{m}^2}\right) \nonumber\\ &= \overline{\text{BA}}\left(\frac{\cancel{\text{m}^2}}{\text{ha}}\right) \cdot \overline{\text{VBAR}} \left(\frac{\text{m}^3}{\cancel{\text{m}^2}}\right). \end{align}\] The $\text{m}^2$ units cancel, yielding the desired volume per unit area.

Because $\hat{v}$ is the product of two sample means—each estimated from a different sample—its standard error must account for the variability in both components. A common and practical approach is to use Bruce’s formula, which assumes that $\overline{\text{BA}}$ and $\overline{\text{VBAR}}$ are uncorrelated and combines their uncertainty in quadrature \[\begin{equation} s_{\hat{v}} \approx \sqrt{ s_{\overline{\text{BA}}}^2 + s_{\overline{\text{VBAR}}}^2 }. \tag{13.51} \end{equation}\] This approach is well established in forestry sampling (Bell 1957; D. Bruce 1961) and is explained in the context of Big BAF estimation by Gove et al. (2020).

Each standard error is estimated from its respective sample data using (11.21).¹⁵⁸

The standard error for $\overline{\text{BA}}$ is
\[\begin{equation} s_{\overline{\text{BA}}} = \frac{s_{\text{BA}}}{\sqrt{n}}, \tag{13.52} \end{equation}\] where \[\begin{equation} s_{\text{BA}} = \sqrt{\frac{1}{n - 1} \sum_{i=1}^{n} \left( \text{BA}_i - \overline{\text{BA}} \right)^2}. \end{equation}\]

The standard error for $\overline{\text{VBAR}}$ is
\[\begin{equation} s_{\overline{\text{VBAR}}} = \frac{s_{\text{VBAR}}}{\sqrt{m^b}}, \tag{13.53} \end{equation}\] where \[\begin{equation} s_{\text{VBAR}} = \sqrt{ \frac{1}{m^b - 1} \sum_{i=1}^{n} \sum_{j=1}^{m^b_i} \left( \text{VBAR}_{ij} - \overline{\text{VBAR}} \right)^2 }. \end{equation}\]

Because $\overline{\text{BA}}$ and $\overline{\text{VBAR}}$ are on wholly different scales (e.g., $m^2/\text{ha}$ and $m^3/\text{m}^2$, respectively), we apply (13.51) using percent standard errors. This avoids scale incompatibility and offers a clearer sense of relative precision. Rewriting (13.51) using percent standard errors gives
\[\begin{equation} s_{\hat{v}\%} \approx \sqrt{s_{\overline{\text{BA}}\%}^2 + s_{\overline{\text{VBAR}}\%}^2}, \tag{13.54} \end{equation}\]
where the percent standard errors for $\overline{\text{BA}}$ and $\overline{\text{VBAR}}$ are defined as
\[\begin{equation} s_{\overline{\text{BA}}\%} = 100 \cdot \frac{s_{\overline{\text{BA}}}}{\overline{\text{BA}}} \tag{13.55} \end{equation}\]
and
\[\begin{equation} s_{\overline{\text{VBAR}}\%} = 100 \cdot \frac{s_{\overline{\text{VBAR}}}}{\overline{\text{VBAR}}}. \tag{13.56} \end{equation}\]

In addition to aligning the two error components on a common scale, expressing the standard errors as percents in (13.54) highlights how each component contributes to the overall uncertainty in $\hat{v}$. It also provides a clear framework for evaluating tradeoffs between count and measurement effort; see, e.g., K. Iles (2003) and Marshall, Iles, and Bell (2004).

To return the standard error to the original units of $\hat{v}$, simply multiply the percent standard error by the estimate and divide by 100 \[\begin{equation} s_{\hat{v}} = \hat{v} \cdot \frac{s_{\hat{v}\%}}{100}. \tag{13.57} \end{equation}\]

If a confidence interval is desired, a $t$-value can be computed using the Welch–Satterthwaite approximation for degrees of freedom (Welch 1947). This approach is appropriate when combining uncertainty from two independent sample means with unequal variances, and it aligns with our application of Bruce’s formula to estimate the standard error. The approximate degrees of freedom are given by
\[\begin{equation} \text{df} \approx \frac{ s_{\overline{\text{BA}}\%}^2 + s_{\overline{\text{VBAR}}\%}^2 }{ \frac{s_{\overline{\text{BA}}\%}^4}{n - 1} + \frac{s_{\overline{\text{VBAR}}\%}^4}{m^b - 1} }. \tag{13.58} \end{equation}\]

With this degrees of freedom approximation, the confidence interval for $\hat{v}$ is \[\begin{equation} \hat{v} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{v}}. \tag{13.59} \end{equation}\]

To estimate totals, simply multiply the mean estimate $\hat{v}$ and its standard error $s_{\hat{v}}$ by the population area.

Some general observations about Big BAF. As mentioned earlier, the estimates for BA and VBAR need not come from the same locations or sampling designs; it’s simply convenient to pair them. When sampling locations follow a SYS design, combining the two sampling efforts at the same points helps ensure good spatial coverage and representation of the population.

The method offers great flexibility in how BA and VBAR are sampled, and at what intensity. Working with percent standard errors makes optimal allocation of sampling resources a straightforward exercise, assuming costs can be assigned to sampling locations (i.e., cost per unit of $n$), tree counting (for BA), and tree measurement (for VBAR), and that reasonable percent standard error estimates are available (e.g., from prior inventories).

The standard error estimators presented here for $\overline{\text{VBAR}}$ and $\hat{v}$ assume independence between variables—that is, they assume that volume and basal area are not correlated. The standard error for $\overline{\text{VBAR}}$ treats each tree-level ratio as a draw from a single random variable, ignoring any correlation between numerator and denominator. Similarly, Bruce’s formula assumes that $\overline{\text{BA}}$ and $\overline{\text{VBAR}}$ are independent and omits a third term in the general error propagation formula that would account for their covariance (see, e.g., Gove et al. (2020)). While these assumptions may not strictly hold, they’re generally reasonable and, even when violated, tend to have little practical impact on estimation in typical forest inventory applications.

As Marshall (Marshall, Iles, and Bell 2004) notes, a useful feature of Big BAF is that it provides an internal check on basal area estimates. Because the count and measurement data use different BAFs, they yield independent estimates of BA. If those estimates differ substantially, it may point to issues with field protocols that warrant review.

If estimates are to be broken down by species, product type, strata, or any other category requiring separate VBAR estimates, care must be taken to ensure enough measurement trees are sampled to support reliable estimation within each group. The more categories considered, the more measurement trees are needed to inform the VBAR estimates—which has cost implications.

13.4.4.3.2 Apply Big BAF

To illustrate Big BAF, we return once more to the Harvard Forest dataset. As in earlier examples, this is a simplified setting, but it allows us to walk through the full workflow using real data and compare sample-based estimates to known population values.

The objective is to estimate mean AGB per hectare for all stems with DBH $\ge$ 12.7 cm across the 35 ha forest.

We again use the sys_grid() function, introduced in Section 13.2, to generate a systematic grid of sampling locations across the forest. For this example, the spacing is 125-by-125 meters, yielding $n = 24$ sampling locations (Figure 13.11).

FIGURE 13.11: Systematic sample locations for the Harvard Forest Big BAF example.

We’ll use metric BAF$^s$ = 5 for count trees and BAF$^b$ = 20.25 for measurement trees.¹⁵⁹ This setup implies about one measurement tree for every four count trees (since 5/20.25 $\approx$ 0.25).

For both count and measurement trees, the walkthrough method was used to address boundary slopover bias. Trees with walkthrough points falling outside the boundary received a count of two; see Section 12.3.3 for details. As in previous sections, sampling locations with no trees were assigned a tree count of zero.

In the analysis that follows, we’ll first apply the Big BAF mean and standard error estimators outlined in the previous section. Then, for comparison, we’ll repeat the analysis assuming measurements were taken on all count trees (i.e., all “in” trees under BAF$^s$), representing a more traditional cruise.

The data are located in the "datasets/BigBAF/" directory. "HF_BAF_s_count.csv" contains the count data collected using BAF$^s$ and "HF_BAF_b_measure.csv" contains the measurement data collected using BAF$^b$. We begin by reading these files and inspecting their structure.

count_trees <- read_csv("datasets/BigBAF/HF_BAF_s_count.csv")
measurement_trees <- read_csv("datasets/BigBAF/HF_BAF_b_measure.csv")

glimpse(count_trees)

#> Rows: 194
#> Columns: 2
#> $ point_id   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1…

glimpse(measurement_trees)

#> Rows: 54
#> Columns: 4
#> $ point_id   <dbl> 1, 1, 2, 3, 4, 4, 4, 5, 6, 6, 6, 7…
#> $ dbh_cm     <dbl> 48.0, 48.6, 22.6, 0.0, 15.9, 33.6,…
#> $ agb_Mg     <dbl> 2.22710, 2.29694, 0.44771, 0.00000…
#> $ tree_count <dbl> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1…

Let’s get a sense of the number of count ($m^s_i$) and measurement trees ($m^b_i$), based on their respective BAFs, across sampling locations (which are uniquely identified using point_id).

count_trees %>%
  group_by(point_id) %>%
  summarize(m_s_i = sum(tree_count)) %>%
  summarize(min = min(m_s_i), max = max(m_s_i), 
            mean = mean(m_s_i), total = sum(m_s_i))

#> # A tibble: 1 × 4
#>     min   max  mean total
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1    18  8.12   195

measurement_trees %>%
  group_by(point_id) %>%
  summarize(m_b_i = sum(tree_count)) %>%
  summarize(min = min(m_b_i), max = max(m_b_i), 
            mean = mean(m_b_i), total = sum(m_b_i),
            n_zero = sum(m_b_i == 0))

#> # A tibble: 1 × 5
#>     min   max  mean total n_zero
#>   <dbl> <dbl> <dbl> <dbl>  <int>
#> 1     0     5  2.08    50      4

Given these summaries, as expected, there are roughly four times as many count trees as measurement trees (195 vs. 50 total). The number of count trees per sampling location ranged from 1 to 18, with an average of about 8. This wide range in the number of “in” trees reflects the spatial variability in diameter distributions across the forest. Measurement trees ranged from 0 to 5 per location, with an average of about 2. As indicated by n_zero, 4 sampling locations had no measurement trees.

Next, using the count trees, we estimate mean basal area (BA, in $\text{m}^2/\text{ha}$) using (13.48), standard error of the mean using (13.52), and standard error expressed as a percent using (13.55).

BA <- small_BAF_count %>%
    group_by(point_id) %>%
    summarize(m_i = sum(tree_count), # (trees per point).
              BA = m_i * BAF_small) %>% # (m^2/ha per point).
    summarize(n = n(), # Number of points.
              mean = mean(BA), # (m^2/ha).
              cv = 100 * sd(BA)/mean, # Coefficient of variation.
              se = sd(BA) / sqrt(n()), # (m^2/ha).
              se_percent = 100 * se / mean) # Std. err. as % of mean.

BA

#> # A tibble: 1 × 5
#>       n  mean    cv    se se_percent
#>   <int> <dbl> <dbl> <dbl>      <dbl>
#> 1    24  40.6  47.1  3.90       9.61

Now we estimate mean $\text{VBAR}$ ($\text{m}^3/\text{m}^2$) using (13.49), standard error using (13.53), and standard error expressed as a percent of the mean using (13.56).

Recall that in our cruise data, we included a tree_count of zero to track which sampling locations had no measurement trees. However, as noted below (13.49), these zeros should not be included in the calculation of $\overline{\text{VBAR}}$, so we filter them out using filter(tree_count > 0) in the subsequent piped sequence below.

After removing those rows, tree_count takes on values of 1 or 2. A value of 2 indicates the tree was double-counted based on the walkthrough method. Because we’re computing the mean across trees (not across sampling points), the cleanest way to handle this is to duplicate the row for any tree with tree_count == 2. We do this using uncount(tree_count) from the tidyr package, which replicates rows according to a count column.¹⁶⁰

VBAR <- big_BAF_measure %>%
  filter(tree_count > 0) %>%
    uncount(tree_count) %>% 
    mutate(BA = c * dbh_cm^2, # (m^2).
           VBAR = agb_Mg / BA) %>% # (Mg/m^2 per tree).
    summarize(m = n(), # Number of trees.
              mean = mean(VBAR), # (Mg/m^2).
              cv = 100 * sd(VBAR) / mean, # Coefficient of variation.
              se = sd(VBAR) / sqrt(m), # (Mg/m^2).
              se_percent = 100 * se / mean) # Std. err. as % of mean.

VBAR

#> # A tibble: 1 × 5
#>       m  mean    cv    se se_percent
#>   <int> <dbl> <dbl> <dbl>      <dbl>
#> 1    50  8.93  40.3 0.508       5.69

Lastly, given BA and VBAR, we compute the Big BAF estimate of AGB per unit area. The code below implements (13.50) and (13.54), and also converts the percent standard error into the standard error in AGB units using (13.57).

# Mean estimate (Mg/ha).
agb_Mg_ha <- VBAR$mean * BA$mean
agb_Mg_ha

#> [1] 362.69

# Standard error as percent of biomass.
se_agb_Mg_percent <- sqrt(BA$se_percent^2 + VBAR$se_percent^2)

# Standard error (Mg/ha).
se_agb_Mg <- agb_Mg_ha * se_agb_Mg_percent/100

se_agb_Mg

#> [1] 40.519

As discussed earlier, it’s expected for VBAR to be less variable than BA. This pattern holds for these data as well, with $\text{CV}_\text{VBAR} = 40.27$% and $\text{CV}_\text{BA} = 47.09$% reflecting the relative variability observed in the sample data, and $s_{\overline{\text{VBAR}}\%} = 5.69$ and $s_{\overline{\text{BA}}\%} = 9.61$ reflecting the relative variability of the sampling estimates. This suggests that greater precision could be achieved by adding more sample locations to improve the estimate of BA, rather than by selecting a smaller BAF$^b$ to increase the number of measurement trees.

We can then construct a confidence interval for the mean AGB estimate using the approximate degrees of freedom formula in (13.58) and the interval expression in (13.59).

BA_df <- BA$n-1
VBAR_df <- VBAR$m-1

df <- (BA$se_percent^2+VBAR$se_percent^2)^2/
  (BA$se_percent^4/BA_df+VBAR$se_percent^4/VBAR_df)

t <- qt(p = 1 - 0.05 / 2, df = df) 

c(agb_Mg_ha - t*se_agb_Mg, agb_Mg_ha + t*se_agb_Mg)

#> [1] 280.78 444.60

We see that Big BAF yields a mean of 362.69 ($\text{m}^3/\text{ha}$) with a 95% confidence interval from 280.78 to 444.6, which covers the true biomass of 341.17.

The code below shows what we would get using a traditional cruise that measures every “in” tree using BAF$^s$. These data are held in "datasets/BigBAF/HF_BAF_s_measure.csv".

measurement_trees <- read_csv("datasets/BigBAF/HF_BAF_s_measure.csv")

c <- pi / (10000 * 4)

measure_all <- measurement_trees %>%
    mutate(TF = ifelse(tree_count != 0, BAF_small/(c*dbh_cm^2), 0)) %>% 
    group_by(point_id) %>%
    summarize(agb_Mg = sum(TF*tree_count*agb_Mg), # (Mg/ha per point).
              m_i = sum(tree_count), # (trees per point).
              BA = m_i*BAF_small) %>% #(m^2/ha per point).
    summarize(n = n(), # Number of points.
              mean = mean(agb_Mg), # (Mg/ha).
              se = sd(agb_Mg)/sqrt(n), # (Mg/ha)
              se_percent = 100*se/mean) # Std. err. as % of mean.

agb_Mg_ha <- measure_all$mean
se_agb_Mg_percent <- measure_all$se_percent

agb_Mg_ha

#> [1] 349.77

se_agb_Mg_percent

#> [1] 8.3171

The Big BAF approach offers a clear tradeoff: by measuring 145 fewer trees (the difference between the number of trees counted with BAF$^s$ and BAF$^b$), we reduce field effort and associated costs. In exchange, we accept a modest increase in uncertainty—with percent standard error rising from 8.32 to 11.17, or standard error in AGB units from 29.09 to 40.52. If you don’t want that increased uncertainty, you can simply add more count trees to improve the BA estimate at very low cost—reinvesting part of the time saved by not measuring those 145 trees (see K. Iles (2003) for a detailed discussion of Big BAF cruise planning and associated software).

13.4.4.4 3P sampling

Probability proportional to prediction (3P) sampling directs detailed measurements toward the most influential units in a population. It begins by making a quick, inexpensive prediction for every unit—something easy to obtain, like an ocular estimate or a simple field measurement. Units with larger predictions are more likely to be selected for full measurement, focusing effort on those expected to have the greatest influence on population parameter estimates. Predictions and measurements are combined using a ratio estimator that accounts for the unequal selection probabilities, yielding an efficient design that requires relatively few measurements but still produces accurate, stable estimates under reasonable conditions.

Classical 3P sampling was introduced by Lewis R. Grosenbaugh in the 1960s (see, e.g., Grosenbaugh (1965)) and remains the most widely applied version of the method. While originally developed for estimating timber volume, 3P is useful in any setting where the variable of interest is costly to measure but can be predicted with reasonable accuracy. We’ll use total volume to motivate the discussion, but the same logic applies to other forest parameters. Classical 3P sampling involves the following steps.

Predict. Visit every tree and record a quick prediction $x$ that scales with the variable of interest $y$. Here, $x$ is the covariate. This might be an ocular estimate of volume or a simple dimensional measure like DBH or height. The goal isn’t precision, but consistency—$x$ should be roughly proportional (or transformable to be proportional) to the measurement $y$.
Select and measure. Use the predictions to assign each tree a selection probability proportional to $x$. Select a subsample accordingly, and for each selected tree, collect the measurement $y$.
Estimate. Combine the paired $x$ and $y$ values using a mean-of-ratios estimator. This yields an estimate of the population-level ratio between the measurement and the prediction. Scaling this ratio by the known mean or total of $x$ (recorded for all units) gives an estimate of the mean or total of $y$. The estimator naturally accounts for the unequal selection probabilities.

Classical 3P is most effective when the variable of interest is expensive to measure, predictions are available for all trees, and the measurement-to-prediction ratio is relatively stable—even if the variable itself is highly variable. The design concentrates measurement effort on the most influential trees while still visiting every tree, making it especially well suited for small areas or situations where all trees must be assessed (e.g., for marking).

In these settings, 3P offers several advantages:

straightforward implementation with minimal training;
every tree is visited, avoiding boundary complications;
high efficiency, often yielding small standard errors relative to effort;
detailed data that support stand and stock tables by species and diameter class;
better estimation for rare species.

Although our focus is on classical 3P, several useful variants exist. One example is point-3P sampling, which extends the idea of classical 3P to point sampling. At each point, trees determined to be “in” using the usual angle method (e.g., prism, angle gauge, or relascope) each receive a prediction $x$ (e.g., DBH or height), which then determines the probability they’re measured for $y$. It has all the advantages of classical 3P, but doesn’t require every tree in the population to receive a prediction. An overview of point-3P and related methods is provided by P. West (2011).

For a deeper treatment of 3P sampling, K. Iles (2003) provides a comprehensive coverage, including historical context, practical guidance on implementation, and common pitfalls with solutions. For a more theoretical perspective, see Gregoire and Valentine (2007), which situates 3P within the broader class of unequal-probability sampling designs.

The next section further develops the classical 3P design, detailing its implementation step-by-step.

13.4.4.4.1 Sampling design and implementation

General steps in implementing a 3P cruise are as follows.

Determine the sample size and selection threshold.

The sample size $n$ refers to the number of trees that will receive detailed measurements of $y$. As with SRS, $n$ is typically chosen based on available resources or an allowable error. The SRS sample size formula from Section 11.3.6 can be used, with the variance (or CV) reflecting anticipated variability in the measurement-to-prediction ratio $y / x$.
To implement the selection rule, define a threshold $\text{max}_z = \hat{t}_x / n$, where $\hat{t}_x$ is a pre-cruise estimate of the population total of $x$. The $i$-th tree is selected to measure $y_i$ with probability $x_i / \text{max}_z$.¹⁶¹ Choosing $\text{max}_z$ this way yields an expected sample size of $n$ under the 3P selection rule described below.

Traverse the population and apply the 3P selection rule.

Visit each tree in turn:
- record a quick prediction $x_i$;
- draw a random number $z_i$ from a uniform distribution $U(0, \text{max}_z)$;
- if $x_i \ge z_i$, select the tree for measurement and record $y_i$; otherwise, retain only $x_i$.
Continue until all trees have been visited.
The result is a complete set of predictions $\{x_1, \dots, x_N\}$ and a subsample of about $n$ trees with paired values $(x_i, y_i)$.

Estimation procedures for totals, means, and standard errors based on the 3P sample are presented in Section 13.4.4.4.2.

The actual sample size achieved after the cruise depends on your initial estimate $\hat{t}_x$. If this estimate is smaller than the true total $\tau_x = \sum^N_{i=1}x_i$, you’ll likely end up with fewer than $n$ sampled trees. If it’s larger, you’ll end up with more. A poor guess at $\tau_x$ won’t bias your estimates—it only affects the final sample size.

K. Iles (2003) outlines a practical approach for ensuring that you hit the desired sample size. He suggests deliberately selecting a large $\hat{t}_x$ that results in too many sampled trees—for example, using a value roughly twice the anticipated $\tau_x$. As you work through the population, use a random mechanism to measure only half of the selected trees, and flag the others (which he calls “insurance trees”). At the end of the cruise, if your actual sample size is below target, you can return to the insurance trees and randomly select enough of them for measurement to meet your target $n$.

As noted earlier, $x$ should be proportional to $y$, meaning we assume a relationship of the form $y = Rx$, where $R$ is a scalar to be estimated. This implies a linear relationship through the origin—when $x = 0$, then $y = 0$. For example, if $y$ is tree volume in m$^3$, a natural choice for $x$ is also volume, possibly in different units, with any mismatch absorbed into the estimate of $R$. More generally, $x$ can be any quantity that scales proportionally with volume.

A good choice for $x$ would be DBH, for several reasons. First, it’s often easier to give a quick and reasonably accurate guess of a tree’s DBH than its volume. Second, if DBH is recorded for every tree, we can produce detailed stand and stock tables by diameter class (and species, if recorded). This adds value beyond simple estimation of totals and means, especially in operational settings.

While DBH is a convenient and useful choice, it’s not itself proportional to volume, so it must be transformed prior to use as the 3P $x$ variable.

One option for this transformation is to use a DBH-volume table to get a volume value to use for $x$. For example, you might use the one in Section 13.4.4.1, which takes DBH and height as inputs. If height isn’t practical to estimate (or even guess) for every tree, a DBH-only volume equation can be used instead. A simple form is $\text{volume} = \alpha_0 \text{DBH}^{\alpha_1}$, where $\alpha_0$ and $\alpha_1$ are parameters taken from the literature or estimated from prior data. The exponent $\alpha_1$ ensures that $x$ and $y$ are proportional, and $\alpha_0$ scales the ratio so that $R$ is approximately one. While this scaling isn’t necessary for valid estimation, it can make it easier to choose $\text{max}_z$, since we usually have better intuition about total volume than about the total of DBH raised to a power.

If you read other references on 3P, you may come across the term “sure-to-be-measured” trees. In our notation, these are trees with predictions $x_i$ exceeding the threshold $\text{max}_z$. Given how we define $\text{max}_z$, such trees are unlikely unless your pre-cruise estimate $\hat{t}_x$ is far too low and/or the target sample size $n$ is very large. When $x_i > \text{max}_z$, the tree must be selected for measurement and flagged accordingly. These trees are treated as a separate stratum in which all units are measured (i.e., a census), so the standard error for that stratum is zero.¹⁶²

In some cases—such as for high-value trees or rare species—you might choose to measure all occurrences and treat them as their own stratum. This ensures complete accounting for critical trees while still allowing 3P to be applied to the remainder of the population.

If you anticipate wanting to summarize results by category—such as species, diameter class, or product type—be sure to record those variables during the cruise. These additional data allow estimates to be broken down by group in addition to providing overall totals.

Once all trees have been visited and predictions and measurements collected, the next step is to estimate the population mean, total, and associated standard errors, as outlined in the next section.

13.4.4.4.2 Estimation using 3P

Once predictions $x_i$ and a subset of paired measurements $y_i$ have been collected, the goal is to estimate the population mean and total of $y$. Because units with larger $x_i$ are more likely to be selected, the measured $y_i$ values don’t form a simple random sample and we can’t just average them. However, under the assumption that $y$ is proportional to $x$, the ratio $y_i / x_i$ is approximately constant and unaffected by the 3P selection probabilities.

Following the notation from Section 13.4.2.2, we use the mean-of-ratios estimator, expressed as \[\begin{equation*} \hat{R} = \frac{1}{n}\sum_{i=1}^n r_i, \end{equation*}\] where $r_i = y_i/x_i$ is the unit ratio.

Then, following (13.29) and (13.33), we multiply the estimated ratio $\hat{R}$ by the known population mean of predictions, $\mu_x = \tau_x / N$, to estimate the population mean and total \[\begin{equation*} \bar{y} = \hat{R} \mu_x, \end{equation*}\] \[\begin{equation*} \hat{t} = N \bar{y}. \end{equation*}\]

Standard errors, confidence intervals, and interpretation follow directly from the general mean-of-ratios case in Section 13.4.2.2. Because the sample is drawn from a known, finite population of size $N$ (the list of $N$ trees in the sampling frame), the standard error formula should include the FPC.

Notice that once we have $\hat{R}$, we could go directly to estimating the total from the total of predictions, i.e., $\hat{t} = \hat{R} \tau_x$, which is perfectly valid. However, we’re following the estimators in Section 13.4.2.2, which are written to work up the total from the mean $\mu_x$ instead.

13.4.4.4.3 Estimation using 3P by group

In many cases, you’ll want to summarize estimates by one or more categorical variables, such as species, DBH class, and/or stand. Even when you’re not specifically interested in stratified summaries, it may be that the relationship between predictions and measured values differs systematically across categories. In these cases, estimation accuracy and precision can often be improved by applying group-specific ratios.

To support such estimation, the relevant categorical variable(s) must be recorded for all trees. For example, if you want estimates by species, then species must be recorded for every tree in the population. The same applies for DBH class, stand, or any other grouping variable of interest.

Let the population be divided into $H$ groups (e.g., species), with known group sizes $N_h$ and mean predicted values $\mu_{x,h}$ (both are known after the cruise). Within each group, using the 3P rule a subsample of $n_h$ trees is measured to obtain paired $(x_{hi}, y_{hi})$ values. For each group $h = 1, \dots, H$, we compute the sample mean of unit ratios \[\begin{equation} \hat{R}_h = \frac{1}{n_h} \sum_{i=1}^{n_h} r_{hi}, \end{equation}\] with $r_{hi} = y_{hi} / x_{hi}$.

The estimated mean for group $h$ is \[\begin{equation} \bar{y}_h = \hat{R}_h \mu_{x,h} \end{equation}\] and the corresponding total is \[\begin{equation} \hat{t}_h = N_h \bar{y}_h. \end{equation}\]

The sample variance of unit ratios within each group is \[\begin{equation} s_{r,h}^2 = \frac{1}{n_h - 1} \sum_{i=1}^{n_h} \left( r_{hi} - \hat{R}_h \right)^2. \end{equation}\]

The standard error of $\bar{y}_h$ is estimated as \[\begin{equation} s_{\bar{y}_h} = \mu_{x,h} \sqrt{ \frac{s_{r,h}^2}{n_h} \left( 1 - \frac{n_h}{N_h} \right) } \end{equation}\] and the standard error for $\hat{t}_h$ is \[\begin{equation} s_{\hat{t}_h} = N_h s_{\bar{y}_h}. \end{equation}\]

To obtain overall estimates, we apply the standard stratified estimators \[\begin{equation} \bar{y} = \frac{1}{N} \sum_{h=1}^H N_h \bar{y}_h, \end{equation}\] \[\begin{equation} \hat{t} = N \bar{y} = \sum_{h=1}^H \hat{t}_h. \end{equation}\]

The standard error of the overall mean is \[\begin{equation} s_{\bar{y}} = \sqrt{ \frac{1}{N^2} \sum_{h=1}^H N_h^2 s_{\bar{y}_h}^2 } \end{equation}\] and the standard error of the total is \[\begin{equation} s_{\hat{t}} = N s_{\bar{y}}. \end{equation}\]

This approach allows 3P sampling to accommodate known groupings in the population, deliver category-specific summaries, and potentially improve estimates of population parameters.

13.4.4.4.4 Applying 3P

We’ll demonstrate 3P using a simulated dataset. Although synthetic, the tree values are derived from allometric equations and reflect the type of data encountered in practice. A key advantage of using simulated data is that the true population parameters are known, allowing us to assess the estimator’s performance directly.

Our goal is to estimate the total volume of a mixed-species stand in coastal British Columbia, Canada. We also want species-specific estimates, as well as stand and stock tables by species and DBH class. Although unknown at the outset, the stand contains $N = 1{,}000$ trees and two species.

For this analysis, the 3P cruise uses volume predicted from an ocular DBH estimate as the $x$ covariate, with $y$ also being volume. Both volume variables are expressed in cubic meters.

We begin by fitting a local DBH–volume equation using data from a mix of the anticipated species. In practice, these data might come from a pilot cruise of the stand or from a previous cruise in a similar stand. Once the DBH–volume equation is in place, we follow the 3P steps outlined in the previous section.

Our local DBH–volume equation takes the form \[\begin{equation} y = \alpha_0 \, x^{\alpha_1}, \tag{13.60} \end{equation}\] where $y$ is volume ($m^3$), $x$ is DBH (cm), and the parameters are estimated to be $\alpha_0 = 0.00034$ and $\alpha_1 = 2.3$.¹⁶³

Based on similar stands and prior experience, our pre-cruise estimate of total volume is 2,000 $m^3$. We’ll aim to collect $n = 25$ measurements to estimate $R$. Given this information, the maximum threshold is $z_{\text{max}} = 2000 / 25 = 80$.

We follow the 3P cruise steps.

Each tree in the population is visited, its species is recorded, and an ocular estimate of DBH (cm) is obtained. To speed the process, the ocular estimate places trees into 5 cm DBH classes. This DBH is used to compute a predicted volume using (13.60), which is also recorded.
A tree is selected for measurement if its predicted volume exceeds a random number drawn from $U(0, 80)$. If selected, its volume is accurately measured and recorded.

Both the predicted volume (i.e., given the DBH prediction) and random number can be generated on a tablet or smartphone as each tree is visited.

At the end of the cruise, we know $N = 1000$ and have predicted DBH and volume for all trees. We also have a subsample of measured trees. These data are stored in "datasets/BC_3P/predicted.csv" and "datasets/BC_3P/measured.csv", respectively. Shared columns in both files include tree_id, species, dbh_cm_predicted, and vol_m3_predicted. The measured file also contains vol_m3_measured. These files are read below, and the 3P estimator is applied under the assumption of a common $R$ across species.

predicted <- read_csv("datasets/BC_3P/predicted.csv")
measured <- read_csv("datasets/BC_3P/measured.csv")

predicted %>% glimpse()

#> Rows: 1,000
#> Columns: 4
#> $ tree_id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1…
#> $ species          <chr> "A", "A", "B", "A", "A", "B"…
#> $ dbh_cm_predicted <dbl> 52.5, 37.5, 37.5, 47.5, 52.5…
#> $ vol_m3_predicted <dbl> 2.7064, 1.2609, 1.2609, 2.15…

measured %>% glimpse()

#> Rows: 34
#> Columns: 5
#> $ tree_id          <dbl> 4, 43, 54, 57, 66, 81, 93, 9…
#> $ species          <chr> "A", "B", "B", "A", "B", "A"…
#> $ dbh_cm_predicted <dbl> 47.5, 42.5, 47.5, 57.5, 52.5…
#> $ vol_m3_predicted <dbl> 2.1564, 1.6752, 2.1564, 3.32…
#> $ vol_m3_measured  <dbl> 2.1970, 2.2316, 2.7591, 3.25…

As you can see above, the measured file has 34 rows, which is our actual sample size $n$ at the end of the cruise (just a bit off from our target of 25). This might be our first clue that the pre-cruise estimate of total volume was a little low. But that’s okay. It doesn’t bias our estimates, we just ended up measuring a few more trees than anticipated, and our reward is a slightly larger sample to inform the estimates worked up in the code below.

We first compute the population mean of predicted volume, $\mu_x$, then, within the summarize(), we estimate the mean-of-ratios estimator $\hat{R}$ (13.30), mean estimated volume $\bar{y}$ (13.29), sample variance of the ratios $s^2_r$ (13.32), standard error of the mean $s_{\bar{y}}$ (13.31), and the associated estimates for the total volume $\hat{t}$ and its standard error $s_{\hat{t}}$ using (13.33) and (13.34), respectively.

# Known from cruise.
N <- nrow(predicted)

# Compute mu_x (and tau_x for fun).
mu_x <- mean(predicted$vol_m3_predicted)
tau_x <- sum(predicted$vol_m3_predicted)

# Estimates for volume.
measured %>%
  summarize(
    n = n(),
    R_hat = mean(vol_m3_measured / vol_m3_predicted),
    y_bar = R_hat * mu_x,
    s_sq_r = var(vol_m3_measured / vol_m3_predicted),
    s_y_bar = mu_x * sqrt(s_sq_r / n * (1 - n / N)),
    t_hat = N * y_bar,
    s_t_hat = N * s_y_bar,
    t = qt(df = n - 1, p = 1 - 0.05 / 2),
    t_ci_lower = t_hat - s_t_hat * t,
    t_ci_upper = t_hat + s_t_hat * t,
  ) %>% select(n, R_hat, t_hat, s_t_hat, t_ci_lower, t_ci_upper)

#> # A tibble: 1 × 6
#>       n R_hat t_hat s_t_hat t_ci_lower t_ci_upper
#>   <int> <dbl> <dbl>   <dbl>      <dbl>      <dbl>
#> 1    34  1.05 2653.    62.5      2526.      2780.

So, how’d we do? Our estimated total, with associated 95% confidence interval, is 2,653 (2,526, 2,780) cubic meters. With only $n = 34$ measurements, we achieved an estimated total error of 2.4% (i.e., $100 \cdot s_{\hat{t}} / \hat{t}$)!

Because this is simulated data, we know the true volume is $\mu_y = 2,590$ cubic meters. For this sample, our 95% confidence interval covers the true value, and the estimated total is remarkably close to the actual volume.

As for the difference between our target and actual sample size $n$, we had initially guessed that $\tau_x$ was 2000. After the cruise, we see that this was an underestimate—the actual population total of $x$ is $2525$ (from the tau_x value in the code above). Again, this isn’t a problem; it just means we might end up with a few more measurement trees than intended, as we did here.

Now let’s work up species-specific estimates following the methods in Section 13.4.4.4.3. We simulated these data with two species, labeled A and B. Figure 13.12 shows the relationship between predicted and measured volume by species for the $n = 34$ trees used to estimate ratios. For exploratory analysis, we added best-fit regression lines by species. The plot suggests that species A and B have different ratios, which in turn suggests we might improve estimation by using species-specific ratios.¹⁶⁴

Predicted versus measured volume by species for the simulated 3P dataset. Lines are fit using least squares regression constrained to pass through the origin and included for visual exploration.

FIGURE 13.12: Predicted versus measured volume by species for the simulated 3P dataset. Lines are fit using least squares regression constrained to pass through the origin and included for visual exploration.

We’ll implement the estimator in steps, starting with species-specific summaries of the predictions, measurements, and estimates. We’ll then apply the stratified estimator to combine these summaries and produce stand-level estimates of the mean and total.

pred_summary <- predicted %>%
  group_by(species) %>%
  summarize(
    N_h = n(),
    mu_xh = mean(vol_m3_predicted),
    tau_xh = sum(vol_m3_predicted)
  )

pred_summary

#> # A tibble: 2 × 4
#>   species   N_h mu_xh tau_xh
#>   <chr>   <int> <dbl>  <dbl>
#> 1 A         734  2.62  1920.
#> 2 B         266  2.27   605.

From the prediction summary, we see that species A dominates the stand, with roughly three times as many stems ($N_h$) and three times the volume ($\tau_{xh}$) as species B. We’ll need $N_h$ and $\mu_{xh}$ in pred_summary for the subsequent estimation after computing species-specific ratios and variances below.

measure_summary <- measured %>%
  group_by(species) %>%
  summarize(
    n_h = n(),
    R_hat_h = mean(vol_m3_measured / vol_m3_predicted),
    s_sq_rh = var(vol_m3_measured / vol_m3_predicted),
  )

measure_summary

#> # A tibble: 2 × 4
#>   species   n_h R_hat_h s_sq_rh
#>   <chr>   <int>   <dbl>   <dbl>
#> 1 A          25   0.975 0.00336
#> 2 B           9   1.26  0.0103

Estimates for species-specific $\hat{R}_h$ suggest the relationship between predicted and measured volume varies by species—our predictions tend to overestimate volume for species A and underestimate it for species B. Again, this is okay as long as the over- or underestimation is proportional across the observed pairs of $x$ and $y$ values—and, by assumption, across the full population.

For example, $\hat{R}$ for species B is 1.26, so if we predict volumes of 1, 2, 3, and 4 cubic meters, we expect measured values of 1.26, 2.53, 3.79, and 5.05 cubic meters, respectively. In other words, $y$ and $x$ are proportional. These proportional relationships are evident for species A and B in the exploratory plot shown in Figure 13.12, supporting our assumptions about the linear relationship between predicted and measured volumes.

Below are the species-specific estimates, followed by stand-level means and totals obtained by combining across strata. We begin by joining the species population information in pred_summary to measure_summary.

measure_summary <- left_join(measure_summary,
  pred_summary,
  by = join_by(species)
)

# Species-specific estimates.
measure_summary %>%
  mutate(
    y_bar_h = R_hat_h * mu_xh,
    s_y_bar_h = mu_xh * sqrt(s_sq_rh / n_h * (1 - n_h / N_h)),
    t_hat_h = N_h * y_bar_h,
    s_t_hat_h = N_h * s_y_bar_h,
    t_h = qt(df = n_h - 1, p = 1 - 0.05 / 2),
    t_ci_lower_h = t_hat_h - s_t_hat_h * t_h,
    t_ci_upper_h = t_hat_h + s_t_hat_h * t_h,
  ) %>%
  select(n_h, t_hat_h, s_t_hat_h, t_ci_lower_h, t_ci_upper_h)

#> # A tibble: 2 × 5
#>     n_h t_hat_h s_t_hat_h t_ci_lower_h t_ci_upper_h
#>   <int>   <dbl>     <dbl>        <dbl>        <dbl>
#> 1    25   1871.      21.9        1826.        1916.
#> 2     9    764.      20.1         717.         810.

Using a sample size of $25$ and $9$ measurements for species A and B, respectively, we achieved RSE values of 1.2% and 2.6% for the corresponding species totals. As defined in (11.38), the RSE is computed as $100 \cdot s_{\hat{t}_h} / \hat{t}_h$.

Our estimated totals, with associated 95% confidence intervals, are 1,871 (1,826, 1,916) and 764 (717, 810) cubic meters for species A and B, respectively. The true volume totals are A = 1,851 and B = 740 cubic meters.

Now we’ll estimate the stand-level mean and total by combining the species-specific results using their population weights (again following estimators in Section 13.4.4.4.3).

measure_summary %>%
  mutate(
    y_bar_h = R_hat_h * mu_xh,
    s_y_bar_h = mu_xh * sqrt(s_sq_rh / n_h * (1 - n_h / N_h)),
  ) %>%
  summarize(
    N = sum(N_h),
    y_bar = 1 / N * sum(N_h * y_bar_h),
    s_y_bar = sqrt(1 / N^2 * sum(N_h^2 * s_y_bar_h^2)),
    t_hat = N * y_bar,
    s_t_hat = N * s_y_bar,
    t = qt(df = sum(n_h) - n(), p = 1 - 0.05 / 2),
    t_ci_lower = t_hat - s_t_hat * t,
    t_ci_upper = t_hat + s_t_hat * t
  ) %>%
  select(t_hat, s_t_hat, t_ci_lower, t_ci_upper)

#> # A tibble: 1 × 4
#>   t_hat s_t_hat t_ci_lower t_ci_upper
#>   <dbl>   <dbl>      <dbl>      <dbl>
#> 1 2635.    29.7      2575.      2696.

The species-specific stratified estimator produced an RSE of 1.1% for the total volume. The estimated total volume, with a 95% confidence interval, is 2,635 (2,575, 2,696) cubic meters. Recall, again, that the true volume is $\mu_y = 2,590$ cubic meters.

In the first analysis, we ignored potential species differences and estimated a single pooled ratio, which yielded an RSE of 2.4%. We then allowed for species-specific ratios and applied the stratified estimator, which substantially reduced the RSE to 1.1%—less than half the pooled-ratio value. This demonstrates the value of using category-specific ratios when there are indications of differences (e.g., as seen in exploratory analysis) and enough observations to support separate estimation.

13.4.5 Advantages and disadvantages of estimation with covariates

Estimation with covariates improves the efficiency of inference from both a statistical and field-effort perspective by incorporating auxiliary information—for example, remotely sensed data, ocular estimates, or other quick field measures. Below, we summarize key advantages and disadvantages of this approach.

13.4.5.1 Advantages

Improved precision: When the covariate is strongly related to the variable of interest, ratio or regression estimators can substantially reduce standard errors compared to estimators that ignore the covariate—often achieving better precision with the same or even smaller sample size.
Statistical and practical efficiency: Leveraging inexpensive covariate data allows more precise estimates with less field effort, improving cost-effectiveness without sacrificing reliability.
Flexible and widely applicable: Covariate-based estimators can be combined with other sampling designs such as systematic and stratified, and are especially well suited to two-phase sampling frameworks.
Supports small-area and category-specific inference: When covariates are available for the full population or a large phase-1 sample, estimates can be extended to subregions, species groups, or other categories without requiring additional measurements.
Variety of estimator options: Ratio and regression estimators provide flexible ways to incorporate covariates, depending on the form and strength of the relationship with the variable of interest.

13.4.5.2 Disadvantages

Assumptions: These estimators rely on a linear or proportional relationship between the covariate and the variable of interest. If this assumption is weak or violated, inference may be compromised.
Potential lack of design unbiasedness: Regression estimators are generally not design-unbiased and rely on model assumptions for validity, while ratio estimators may be design-unbiased under some designs.
Increased analytical complexity: Selecting the appropriate estimator, checking assumptions, and diagnosing model fit require more statistical expertise and judgement than simpler design-based approaches.
Sensitivity to covariate quality: Weak or poorly measured covariates may provide little gain in precision and can mislead estimation if their uncertainty is not accounted for.
Risk of extrapolation: Estimators that rely on covariates can produce misleading results if applied outside the range of observed data, particularly when the assumed relationship is weak or poorly supported by the data.

13.5 Cluster sampling

Cluster sampling is commonly used when it’s impractical or costly to sample widely dispersed locations. In this design, the primary sampling unit (PSU) is a cluster of smaller secondary sampling units (SSUs), where measurements are taken. The SSUs are nested within PSUs, introducing additional design considerations compared to the sampling designs introduced earlier.

Cluster sampling is particularly useful when:

the cost of reaching widely dispersed sampling locations is high;
operational constraints require grouping sampling locations for efficient data collection;
measurement of a sampling location is inexpensive relative to travel costs;
or the sampling frame is only available at the PSU level.

These conditions are common in forest inventory, where measurement costs are low relative to the cost of moving a field crew to a sampling location. Cluster designs help reduce travel time by grouping nearby SSUs into the same PSU, allowing crews to sample multiple locations with minimal movement. This approach is especially useful when sampling large areas. For example, cluster designs are widely used in national forest inventory (NFI) programs, where travel costs are high and field crews must cover broad regions efficiently.

The tradeoff, however, is a potential loss of statistical efficiency. Due to spatial autocorrelation—the tendency for observations located close together to be more similar than those farther apart—SSUs in the same PSU often exhibit less variability than single locations sampled independently across space, such as the SYS designs illustrated in Section 13.2. As a result, cluster sampling typically produces less precise estimates than a SRS of the same number of SSUs. This is a classic cost-precision tradeoff: cluster sampling may reduce field effort but increase sampling error.

13.5.1 Clusters, strata, and systematic sampling

Although clusters (PSUs) and strata both involve dividing the population into groups, their purpose and implementation differ in important ways. In stratified sampling, the population is partitioned into non-overlapping subgroups (strata) using known auxiliary variables, and sampling is conducted independently within each stratum. The goal is to increase precision by concentrating sampling effort within relatively homogeneous subgroups.

In cluster sampling, the goal is not to improve statistical efficiency directly, but to simplify or reduce the cost of data collection. PSUs are often based on geographic proximity or field logistics, not auxiliary variables. A subset of PSUs is selected (typically via SRS or SYS), and measurements are collected from all or some SSUs within each selected PSU.

The contrast can be summarized as follows.

Stratification aims to reduce within-group variability to improve estimation precision.
Clustering often results in within-group similarity, which can reduce precision but improve feasibility by reducing travel time and logistical burden.

While their motivations differ, both cluster and stratified designs are valid and often used together—for example, in stratified cluster sampling, where PSUs are selected independently within each stratum.

SYS, as described in Section 13.2, can be viewed as a form of cluster sampling with a single PSU made up of $n$ evenly spaced SSUs. This perspective reinforces why SYS does not have a design-based variance estimator (Section 13.2.5)—with only one randomly selected PSU, there is no way to estimate between-PSU variability.

13.5.2 Elements of cluster sampling design

When implementing a cluster sampling design, there are several key design elements to consider.

Cluster layout and spacing. PSUs may be naturally defined or imposed by the analyst. When imposed, PSU anchor locations are typically selected using SRS or placed using SYS. The spacing between PSUs affects both cost and precision: widely spaced PSUs reduce redundancy due to spatial autocorrelation but increase travel effort, while closely spaced PSUs improve efficiency but may provide less independent information.
Cluster shape and SSU configuration. If not naturally defined, SSUs within a PSU are typically arranged in a fixed geometric pattern around an anchor point. Common layouts include lines, triangles, squares, and L-shapes. This configuration influences how well the PSU captures local variability and affects edge effects, travel efficiency, and measurement consistency. SSU size can be scaled to efficiently measure different variables, and nested configurations (e.g., for regeneration sampling) can be used when needed.
Number of clusters and SSUs per cluster. In a finite-population setting, each PSU contains a fixed and known set of SSUs. In an areal or infinite-population setting, the number, size, and arrangement of SSUs associated with each PSU are chosen as part of the sampling design. In either case, increasing the number of PSUs generally improves statistical efficiency more than increasing the number of SSUs per PSU, since PSUs are more likely to provide independent information.
Method of selecting clusters. PSUs can be selected using any valid probability sampling method. SRS and SYS are most common. If auxiliary information is available, methods such as stratified or PPS sampling may be used to improve precision while maintaining logistical feasibility.

These elements interact to determine the balance between statistical precision and operational cost. In forest inventory, the goal is often to design a layout that captures sufficient spatial variability while remaining cost-effective and logistically manageable.

13.5.3 Choosing the number of PSUs and SSUs

Unlike stratified sampling, where well-established formulas exist for allocating sample sizes across strata, sample size planning in PSU designs is more heuristic and context-specific. The main tradeoff is between the number of PSUs and the number of SSUs per PSU. Adding SSUs within a PSU improves precision only up to a point—when observations are spatially correlated, the marginal benefit of additional SSUs diminishes. Increasing the number of PSUs is usually more effective for reducing sampling error, since PSUs tend to be more independent.

In most forest inventory settings, these decisions are shaped by cost and logistics. Analytical expressions for optimal allocation exist but require estimates of within- and between-PSU variability, which are rarely available before sampling begins. Some forest inventory programs use simulation to explore tradeoffs under realistic spatial and operational constraints. Additional discussion on cluster sampling design can be found, e.g., in Freese (1962), S. L. Lohr (2019) Section 5.5, and Cochran (1977) Section 9.6.

There are many variations of cluster sampling in both design and estimation. This section focuses on single-stage cluster sampling, where all SSUs within each selected PSU are measured. When measuring every SSU is impractical, a subset of SSUs is observed—a design referred to as two-stage (or multistage) sampling, covered in Section 13.6.

Here, we focus on a common and relatively simple form of single-stage cluster sampling: selecting PSUs with equal probability and measuring all SSUs within each selected PSU. This provides a useful baseline for understanding and implementing cluster-based designs.

13.5.4 Estimation using cluster sampling

We use the following notation:

$N$ number of PSUs in the population;
$M$ number of SSUs in the population;
$n$ number of PSUs in the sample;
$m_j$ number of SSUs in the $j$-th sampled PSU;
$\bar{m} = \frac{1}{n} \sum_{j=1}^n m_j$ mean number of SSUs within sampled PSU;
$y_{ji}$ value for the $i$-th SSU in the $j$-th PSU;
$y_j = \sum_{i=1}^{m_j} y_{ji}$ total for the $j$-th PSU.

An unbiased estimator for the population total $\tau$ is \[\begin{equation} \hat{t} = \frac{N}{n}\sum_{j=1}^n y_j. \tag{13.61} \end{equation}\]

The standard error for $\hat{t}$ is estimated using \[\begin{equation} s_{\hat{t}} = N\sqrt{\left(1 - \frac{n}{N}\right)\frac{s^2}{n}}, \tag{13.62} \end{equation}\] where \[\begin{equation} s^2 = \frac{1}{n - 1}\sum_{j=1}^n\left(y_j - \frac{\hat{t}}{N}\right)^2 \end{equation}\] is the sample variance among PSU totals.

The corresponding unbiased estimator for the population mean is \[\begin{equation} \bar{y} = \frac{\hat{t}}{M}, \tag{13.63} \end{equation}\] with standard error \[\begin{equation} s_{\bar{y}} = \frac{s_{\hat{t}}}{M}. \tag{13.64} \end{equation}\]

Using the unbiased estimator (13.63) requires knowledge of $M$ (i.e., the total number of SSU in the population). If $M$ is unknown (or unknowable as in an infinite population setting), the ratio estimator defined next can be used. In many cases, even when SSUs are assumed finite and number of SSUs are known, the ratio estimator provides more stable or efficient estimates than the unbiased estimator.

The cluster ratio estimator for the population mean is \[\begin{equation} \bar{y} = \frac{\sum_{j=1}^n y_j}{\sum_{j=1}^n m_j}. \tag{13.65} \end{equation}\]

Under a finite population, the standard error of $\bar{y}$ is \[\begin{equation} s_{\bar{y}} = \sqrt{\left(1- \frac{n}{N}\right)\frac{1}{n \bar{M}^2} \frac{\sum_{j=1}^n (y_j - \bar{y} m_j)^2}{n - 1}}, \tag{13.66} \end{equation}\] where $\bar{M}$ is the population mean number of SSUs per PSU (replaced in practice by $\bar{m}$ when unknown).

For areal or infinite-population settings, the finite population correction is omitted \[\begin{equation} s_{\bar{y}} = \sqrt{\frac{1}{n \bar{M}^2} \frac{\sum_{j=1}^n (y_j - \bar{y} m_j)^2}{n - 1}}. \tag{13.67} \end{equation}\]

The population total is then estimated as \[\begin{equation} \hat{t} = M\bar{y} \tag{13.68} \end{equation}\] with standard error \[\begin{equation} s_{\hat{t}} = Ms_{\bar{y}}. \tag{13.69} \end{equation}\]

In a finite population, $M$ is the total number of SSUs. In areal sampling, $M$ may represent area, provided that $\bar{y}$ is expressed per unit area (e.g., per acre or hectare).

Confidence intervals follow the usual form \[\begin{align*} \bar{y} &\pm t_{df,1-\frac{\alpha}{2}}s_{\bar{y}},\\ \hat{t} &\pm t_{df,1-\frac{\alpha}{2}}s_{\hat{t}}, \tag{13.70} \end{align*}\] with degrees of freedom $df = n - 1$, reflecting that PSUs—not SSUs—are the independent sampling units.

13.5.5 Stratified cluster sampling

Cluster sampling can be extended to a stratified framework, resulting in stratified cluster sampling. This approach is useful when both auxiliary information and logistical constraints are present. As described in Section 13.3.1, the first step is to divide the population into $H$ non-overlapping strata based on variables expected to influence the parameter of interest. A separate cluster sampling design is then implemented independently within each stratum, using the steps that follow.

Select $n_h$ PSUs in each stratum using a probability method such as SRS or SYS.
Within each sampled PSU $j = 1,\dots,n_h$, measure $m_{hj}$ SSUs and compute PSU-level means, followed by the stratum estimates $\bar{y}_h$ and $s_{\bar{y}_h}$ defined in Section 13.5.4.

Given stratum-level estimates, the stratified mean and its standard error are \[\begin{equation} \bar{y} = \frac{\sum_{h=1}^H M_h \bar{y}_h}{M} \tag{13.71} \end{equation}\] and \[\begin{equation} s_{\bar{y}} = \sqrt{\frac{\sum_{h=1}^H M_h^2 s_{\bar{y}_h}^2}{M^2}}. \tag{13.72} \end{equation}\]

Here, $M_h$ and $M$ represent either counts of SSUs (finite-population) or area (areal/infinite-population), consistent with the earlier interpretation.

Confidence intervals should use $df = n - H$, again reflecting the number of sampled PSUs.

Stratified cluster sampling is widely used in forest inventory because it combines logistical efficiency from clustering with precision gains from stratification.

13.5.6 Illustrative analyses

13.5.6.1 Cluster sampling

To illustrate cluster sampling, we return to the Harvard Forest dataset. This is again a toy-like example, but it allows us to walk through the full workflow using real data and compare sample-based estimates to the true population values.

The objective of the cruise is to estimate average AGB per hectare and total AGB for the entire forest. The population of interest includes all stems with DBH $\ge$ 12.7 cm.

We use systematic sampling to select the locations of our PSUs. Each PSU consists of three fixed-area circular SSUs, positioned 10 meters from the PSU center at bearings of $0^\circ$, $135^\circ$, and $225^\circ$. Each SSU has a radius of 5 meters and is used to tally all trees meeting the DBH threshold. See Section 12.3 for more on plot-based sampling.

As described in Section 13.2, we use the sys_grid() function to generate a systematic grid of PSU centers across the forest. This function constructs the grid around a randomly chosen starting point. For this example, the spacing is 100 by 100 meters, yielding $n = 35$ PSUs. These grid points serve as the anchor locations for each PSU.

FIGURE 13.13: Cluster sampling with primary sampling units (PSUs), i.e., clusters, placed using a systematic design in the Harvard Forest example.

The cruise tree data are stored in "datasets/HF/HF_cluster_sys.csv" and read into ssu_trees below. Use glimpse() to review the columns and indexing. Notice the SSU ID is nested within the PSU ID, which has implications for how we group the data in subsequent calculations.

ssu_trees <- read_csv("datasets/HF/HF_cluster_sys.csv")
ssu_trees %>% glimpse()

#> Rows: 586
#> Columns: 7
#> $ psu_id     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2…
#> $ ssu_id     <dbl> 1, 1, 2, 3, 3, 3, 3, 3, 1, 1, 2, 2…
#> $ tree_id    <dbl> 1, 2, 1, 1, 2, 3, 4, 5, 1, 2, 1, 2…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ agb_Mg     <dbl> 0.416741, 0.079753, 0.000000, 0.10…
#> $ dbh_cm     <dbl> 24.9, 13.3, 0.0, 14.8, 24.9, 15.8,…
#> $ species    <chr> "rubra", "rubra", "no trees", "gra…

Before proceeding with estimation, it’s good practice to confirm how the data are structured. The code below checks the number of PSUs and the number of SSUs per PSU.

# Number of PSUs.
ssu_trees %>% 
  summarize(n = n_distinct(psu_id))

#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    35

# Number of SSUs per PSU.
ssu_trees %>% 
  group_by(psu_id) %>%
  summarize(m_j = n_distinct(ssu_id)) %>%
  distinct(m_j)

#> # A tibble: 1 × 1
#>     m_j
#>   <int>
#> 1     3

As expected, each PSU contains three SSUs.

We’ll now walk through the estimation workflow using the estimators from Section 13.5.4. We begin by expanding AGB to a per-hectare basis using the tree factor for the 5-meter radius SSU plots. Then we compute SSU-level summaries. Since the SSU index is only unique within its PSU, we must group by both PSU and SSU in the code below.

# Add tree factor for the given plot area.
r <- 5 # SSU plot radius in meters.

ssu_trees <- ssu_trees %>%
  mutate(TF = 10000 / (pi * r^2))

# Compute the SSU plot-level AGB estimates.
ssu_summary <- ssu_trees %>%
  group_by(psu_id, ssu_id) %>%
  summarize(y_ji = sum(TF * tree_count * agb_Mg))

Next, on our way to computing the overall mean, we summarize to the PSU level by calculating $y_j$ and $m_j$ for each PSU. Although ssu_summary is already grouped by psu_id, we explicitly redefine the grouping in the code below for clarity.

psu_summary <- ssu_summary %>%
  group_by(psu_id) %>%
  summarize(
    y_j = sum(y_ji),
    m_j = n()
  )

Given psu_summary, we apply the estimators from (13.65) and (13.67) to compute the estimated population mean and its standard error, along with the associated 95% confidence interval. We then multiply by the total forest area to estimate forest-wide biomass at the end of the piped sequence.

M <- 35 # Hectare forest area.

psu_summary %>%
  summarize(
    y_bar = sum(y_j) / sum(m_j),
    n = n(),
    m_bar = mean(m_j),
    s_y_bar = sqrt(1 / (n * m_bar^2) * 
                     sum((y_j - y_bar * m_j)^2) / (n - 1)),
    t = qt(df = n - 1, p = 1 - 0.05 / 2),
    y_ci_lower = y_bar - t * s_y_bar,
    y_ci_upper = y_bar + t * s_y_bar
  ) %>%
  select(y_bar, y_ci_lower, y_ci_upper) %>%
  mutate(
    t_hat = M * y_bar,
    t_ci_lower = M * y_ci_lower, 
    t_ci_upper = M * y_ci_upper
  )

#> # A tibble: 1 × 6
#>   y_bar y_ci_lower y_ci_upper  t_hat t_ci_lower t_ci_upper
#>   <dbl>      <dbl>      <dbl>  <dbl>      <dbl>      <dbl>
#> 1  306.       253.       358. 10693.      8868.     12519.

The census-based AGB for the Harvard Forest is 11.94 Mg/ha, with a forest-wide total of 11,941 Mg. Our estimated mean and total, with 95% confidence intervals, were 305.53 (253.37, 357.69) Mg/ha and 305,528 (253,366, 357,689) Mg, respectively. Based on this sample, both confidence intervals successfully captured the true values.

13.5.6.2 Stratified cluster sampling

Next, let’s extend the cluster design to take advantage of the Harvard Forest strata introduced in Section 13.3.6 and shown in Figure 13.3.

We’ll use the same systematic sample of PSUs from the previous section. Now, with strata imposed, any SSU whose center falls in a different stratum than its PSU is removed. This is visible in Figure 13.14, where some PSUs near stratum boundaries have fewer than three SSUs.

One SSU has its boundary extend beyond its stratum boundary, so a boundary slopover correction is applied. As in the stratified sampling illustration, we use the mirage correction: if an SSU overlaps the stratum boundary and a tree falls within its mirage correction region, that tree is counted twice by assigning it a tree_count of two (see Section 12.3.3).

Stratified cluster sampling in the Harvard Forest example, with primary sampling units (PSUs), i.e., clusters, placed using a systematic design.

FIGURE 13.14: Stratified cluster sampling in the Harvard Forest example, with primary sampling units (PSUs), i.e., clusters, placed using a systematic design.

The cruise tree data are stored in "datasets/HF/HF_strat_cluster_sys.csv" and read into ssu_trees below. Use glimpse() to review the columns and indexing. Notice that SSU IDs are nested within PSU IDs, which are in turn nested within stratum IDs. This nesting structure has implications for how we group the data in subsequent calculations.

ssu_trees <- read_csv("datasets/HF/HF_strat_cluster_sys.csv")
ssu_trees %>% glimpse()

#> Rows: 556
#> Columns: 8
#> $ psu_id     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2…
#> $ ssu_id     <dbl> 1, 1, 2, 3, 3, 3, 3, 3, 1, 1, 2, 2…
#> $ tree_id    <dbl> 1, 2, 1, 1, 2, 3, 4, 5, 1, 2, 1, 2…
#> $ stratum_id <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ agb_Mg     <dbl> 0.416741, 0.079753, 0.000000, 0.10…
#> $ dbh_cm     <dbl> 24.9, 13.3, 0.0, 14.8, 24.9, 15.8,…
#> $ species    <chr> "Quercus rubra", "Quercus rubra", …

Before proceeding with estimation, let’s examine the data structure by summarizing the number of PSUs and the number of SSUs per PSU within each stratum.

# Number of PSUs within each stratum.
ssu_trees %>%
  group_by(stratum_id) %>%
  summarize(n = n_distinct(psu_id))

#> # A tibble: 4 × 2
#>   stratum_id     n
#>        <dbl> <int>
#> 1          1     2
#> 2          2     9
#> 3          3     3
#> 4          4    21

# Number of SSUs per PSU within each stratum.
ssu_trees %>%
  group_by(stratum_id, psu_id) %>%
  summarize(m_j = n_distinct(ssu_id)) %>%
  distinct(m_j)

#> # A tibble: 8 × 2
#> # Groups:   stratum_id [4]
#>   stratum_id   m_j
#>        <dbl> <int>
#> 1          1     2
#> 2          1     3
#> 3          2     2
#> 4          2     3
#> 5          3     2
#> 6          3     3
#> 7          4     3
#> 8          4     2

As seen in Figure 13.14, the number of SSUs per PSU varies within some strata.

We’ll now walk through the estimation workflow using the estimators from Section 13.5.5. We begin by expanding AGB to a per-hectare basis using the tree factor for the 5-meter radius SSU plots. Next, we compute stratum-specific summaries at the SSU and then PSU levels to prepare for applying the stratified mean and standard error estimators (13.2) and (13.5), respectively. Finally, we use these results to compute a 95% confidence interval for the population mean.

# Add tree factor for the given plot area.
r <- 5 # SSU plot radius in meters.

ssu_trees <- ssu_trees %>%
  mutate(TF = 10000 / (pi * r^2))

# Compute the SSU plot-level AGB estimates.
ssu_summary <- ssu_trees %>%
  group_by(stratum_id, psu_id, ssu_id) %>%
  summarize(y_hji = sum(TF * tree_count * agb_Mg))

Next, for each stratum, we summarize to the PSU level by calculating $y_{hj}$ and $m_{hj}$ then stratum level $\bar{y}_h$ and $s_{\bar{y}_{h}}$ via (13.65) and (13.67), respectively. Although ssu_summary is already grouped by stratum_id and psu_id, we explicitly redefine the grouping in the code below for clarity.

stratum_summary <- ssu_summary %>%
  group_by(stratum_id, psu_id) %>%
  summarize(
    y_hj = sum(y_hji),
    m_hj = n()
  ) %>%
  summarize(
    y_bar_h = sum(y_hj) / sum(m_hj),
    n_h = n(),
    m_bar_h = mean(m_hj),
    s_y_bar_h = sqrt(1 / (n_h * m_bar_h^2) *
      sum((y_hj - y_bar_h * m_hj)^2) / (n_h - 1))
  )

stratum_summary

#> # A tibble: 4 × 5
#>   stratum_id y_bar_h   n_h m_bar_h s_y_bar_h
#>        <dbl>   <dbl> <int>   <dbl>     <dbl>
#> 1          1   223.      2    2.5       3.58
#> 2          2   392.      9    2.89     57.0 
#> 3          3    55.4     3    2.33     25.0 
#> 4          4   314.     21    2.86     28.3

Given the stratum specific estimates for the mean and its standard error, we apply the stratified estimators (13.71) and (13.72) in the code below.

stratum_summary <- left_join(stratum_summary, stratum_area,
  by = "stratum_id"
) %>% rename(M_h = N_h) # For constancy with cluster estimator notation.

# Estimate population mean and 95% CI.
stratum_summary %>%
  summarize(
    M = sum(M_h),
    y_bar = sum(M_h * y_bar_h) / M,
    s_y_bar = sqrt(sum(M_h^2 * s_y_bar_h^2) / M^2),
    t = qt(df = sum(n_h - 1), p = 1 - 0.05 / 2),
    ci_lower = y_bar - t * s_y_bar,
    ci_upper = y_bar + t * s_y_bar
  ) %>%
  select(y_bar, ci_lower, ci_upper)

#> # A tibble: 1 × 3
#>   y_bar ci_lower ci_upper
#>   <dbl>    <dbl>    <dbl>
#> 1  314.     266.     362.

The width of this 95% confidence interval is narrower than that estimated using unstratified cluster sampling in Section 13.5.6.1 (by about 9 Mg/ha). However, this could simply be due to chance, since the comparison is based on a single sample. To more reliably assess whether stratification improves efficiency in this setting, we turn to a repeated sampling analysis in the next section.

13.5.6.3 Comparing stratified and unstratified cluster sampling

Let’s compare coverage rates and average 95% confidence interval widths for the unstratified and stratified cluster sampling examples from the two preceding sections. We performed a similar assessment for stratified versus unstratified sampling in Sections 13.3.6.1 and 13.3.6.2. As before, we repeatedly sample from the Harvard Forest population to empirically evaluate the performance of each estimator.

In this case, we fix the total number of PSUs and use SYS to generate 1,000 independent samples. For each sample, we compute the mean AGB and its 95% confidence interval using both unstratified and stratified cluster estimators. This allows us to compare both coverage properties and the precision of resulting estimates.

We expect roughly 95% of confidence intervals to capture the true population mean, and, if stratification is effective, we should see narrower intervals on average. The results bear this out: the empirical coverage rates were 98.1% and 95.1% for the unstratified and stratified estimators, respectively. The average confidence interval widths were 100 and 89 (Mg/ha), again for the unstratified and stratified cases. The percent gain in precision from stratification is 11%, reflecting the reduction in average interval width achieved by incorporating strata.

You can reproduce these results using the following replicate estimate files located in the datasets/HF/ directory:

clust_sys_agb_ests_1000reps.csv (unstratified),
strat_clust_sys_agb_ests_1000reps.csv (stratified).

Interestingly, the gains in precision from stratification are similar to those observed in Section 13.3.6.2, despite the use of clustered PSUs with differing plot sizes and configurations. These gains highlight the informational value of effective stratification, independent of the specific sampling design.

13.5.7 Advantages and disadvantages of cluster sampling

Cluster sampling is widely used in forest inventory because it can substantially reduce the cost and complexity of field data collection. It’s especially useful when measurement is inexpensive but travel between sampling locations is costly—a common situation in large-area surveys. For these reasons, cluster designs are a core feature of many NFI programs.

13.5.7.1 Advantages

Operational efficiency: By grouping nearby SSUs into PSUs, cluster sampling reduces travel time and limits the need to move frequently between dispersed sampling locations.
Simplified logistics: Field crews can complete several measurements within a single PSU, minimizing the time and effort required to reposition across the landscape.
Aligned with operational structure: Many forest monitoring programs are built around geographic or administrative PSUs, making cluster sampling a natural choice.

13.5.7.2 Disadvantages

Reduced statistical efficiency: SSUs in the same PSU are often spatially autocorrelated and provide less independent information than SSUs sampled across the full population.
Smaller effective sample size: Because randomization occurs at the PSU level, the effective sample size reflects the number of PSUs rather than the total number of SSUs, which can reduce precision.

Despite these limitations, cluster sampling remains a valuable option when logistical constraints are high and moderate precision is acceptable. It often offers the best tradeoff between field efficiency and statistical rigor in large-area inventory.

13.6 Multistage sampling

Multistage sampling is a generalization of cluster sampling in which selection occurs in two or more successive stages. Rather than drawing a sample directly from the population, units are chosen in stages: larger primary sampling units (PSUs) are selected first, and smaller secondary sampling units (SSUs) are then sampled within them.

While multistage designs can have any number of stages, two-stage sampling is by far the most common. In this design the stages are:

a sample of PSUs is selected (e.g., stands, management units, or other areal units used to partition the population of interest);
within each sampled PSU, a further sample of SSUs is selected (e.g., plots, points, or some other sampling unit delineation).

This two-stage structure offers flexibility, allowing tailoring of sampling intensity within PSUs, accommodating operational constraints, and reducing travel costs by concentrating sampling effort.

13.6.1 Connection to cluster sampling

Two-stage sampling is closely related to cluster sampling. The cluster design introduced in Section 13.5 is a special case of two-stage sampling where PSUs are selected using SRS or SYS, and all SSUs within each selected PSU are measured.

In two-stage sampling, instead of measuring every SSU in each selected PSU, a probability sample of SSUs is taken at the second stage. This distinction is important: in cluster sampling, sampling error reflects only variation among PSUs; in two-stage sampling, both stages involve probabilistic selection and contribute to the total variance, and estimation must account for this hierarchical structure.

13.6.2 Why use multistage sampling?

Multistage sampling is often chosen because it aligns with how forests are managed and how fieldwork is organized. For example, it:

takes advantage of existing natural, management, or administrative units (e.g., stands, ownerships) as convenient PSUs;
reduces travel time and cost by allowing crews to work intensively within a few selected PSUs rather than dispersing effort across the entire area;
uses auxiliary data, such as stand area or previous inventory estimates, often available at the PSU level, to guide sampling;
allows strategic selection of PSUs using sampling methods that better capture variability in the population and the contribution of each PSU to the parameter of interest.

This design adapts easily to real-world conditions. The number of SSUs per PSU can be varied, different sampling methods can be used at each stage, and unequal probabilities can be applied to match sampling effort with variation in the population.

13.6.3 Strategies for selecting PSUs

The first step in two-stage sampling is selecting PSUs. Common strategies include the following.

Simple random sampling (SRS): if no prior information is available, draw a random sample of PSUs, giving each an equal selection probability.
Stratification: if PSU-level variables are available (e.g., area, expected volume, productivity class), stratify PSUs and sample within each stratum to improve precision (see Section 13.3.1).
Systematic sampling: if PSU-level variables are known, sort PSUs by one of these variables (e.g., area, expected volume) and apply systematic sampling with a random start to ensure broad representation.
Probability proportional to size (PPS): if each PSU has a known measure of size or importance (e.g., area, expected volume), select PSUs with probabilities proportional to that measure.

These approaches can be combined. For example, PSUs might be stratified by size, then selected using equal probability, systematic, or PPS methods within each stratum. Stratum-specific two-stage estimates for means, totals, and standard errors can then be combined as described in Section 13.5.5.

Availability of auxiliary data usually guides the choice of PSU selection method and two-stage estimator. Without auxiliary data, equal probability sampling is the default. When auxiliary variables are available, they can be used to reduce within-stratum variance or concentrate sampling in strata that contribute most to the parameter of interest. When a size-related variable is available and correlated with the parameter, PPS methods can improve efficiency by allocating sampling effort in proportion to each PSU’s contribution.

The choice of two-stage estimator also depends on the sampling approach at the second stage. Some estimators assume a finite population of SSUs, while others treat the second-stage population as effectively infinite—a view that aligns with point and plot sampling in forest inventory.

We present the estimators’ more general forms, allowing for PSUs of differing sizes and varying numbers of SSUs per PSU. When PSU sizes are equal or the number of SSUs is constant across PSUs, simplified versions of these estimators can be found in Cochran (1977), Thompson (2012), S. L. Lohr (2019), and other survey sampling references.

Because two-stage designs involve both between-PSU and within-PSU variation, the resulting formulas can appear long and a bit daunting. Treated as step-by-step recipes and paired with the tools in Section 10.2, however, they’re straightforward to implement. In the end, each estimator delivers a single mean and total, along with their standard errors.

13.6.4 Estimation using two-stage sampling

We use the following notation:

$N$ number of PSUs in the population;
$M_j$ number of SSUs in the $j$-th PSU;
$M = \sum^N_{j = 1} M_j$ number of SSUs in the population;
$n$ number of PSUs in the sample;
$m_j$ number of sampled SSUs in the $j$-th sampled PSU;
$y_{ji}$ value for the $i$-th SSU in the $j$-th PSU;
$\bar{y}_j = \frac{1}{m_j} \sum_{i=1}^{m_j} y_{ji}$ sample mean within PSU $j$;
$s_j^2 = \frac{1}{m_j - 1} \sum_{i=1}^{m_j} (y_{ji} - \bar{y}_j)^2$ sample variance within PSU $j$.

13.6.4.1 Sampling PSUs with equal probability

An unbiased estimator for the population total $\tau$ is \[\begin{equation} \hat{t} = \frac{N}{n}\sum^n_{j=1}M_j\bar{y}_j. \tag{13.73} \end{equation}\]

Given $\hat{t}$, the population mean $\mu$ is estimated as
\[\begin{equation} \bar{y} = \frac{\hat{t}}{M}. \tag{13.74} \end{equation}\]

The standard error for $\hat{t}$ is estimated using
\[\begin{equation} s_{\hat{t}} = \sqrt{ \underbrace{\frac{N^2}{n}\left(1 - \frac{n}{N}\right)s^2}_{\text{between-PSU variance}} + \underbrace{\frac{N}{n} \sum_{j = 1}^n \frac{M_j^2}{m_j}\left(1 - \frac{m_j}{M_j}\right)s^2_j}_{\text{within-PSU variance}} }, \tag{13.75} \end{equation}\] where $s^2$ is the among-PSU sample variance, estimated as
\[\begin{equation} s^2 = \frac{1}{n - 1} \sum_{j = 1}^n \left(M_j \bar{y}_j - \frac{\hat{t}}{N}\right)^2. \end{equation}\]

Given $s_{\hat{t}}$, the standard error for $\bar{y}$ is
\[\begin{equation} s_{\bar{y}} = \frac{s_{\hat{t}}}{M}. \tag{13.76} \end{equation}\]

Equation (13.75) shows how total variance is partitioned into two components.

The first term (between-PSU variance) captures variation among PSU totals. It includes a finite population correction (FPC) because PSUs are sampled without replacement from a finite set.
The second term (within-PSU variance) captures variation among SSUs within each PSU. SSUs are also sampled without replacement from a finite set, so this term includes an FPC as well.

This estimator is unbiased for the population total and mean, as well as their variances.

If the number of SSUs in each PSU is unknown, or if the second-stage population is viewed as effectively infinite (as in point or plot sampling), the unbiased estimator becomes impractical. In these cases, we instead use a ratio estimator that does not depend on knowing $M$ or $M_j$ and aligns with the infinite-population viewpoint introduced in Section 13.5.4.

Although the ratio estimator presented next is biased, the bias is often small and represents a reasonable tradeoff for the gains in accuracy and precision it can provide. We’ll quantify this bias using an applied example in Section 13.6.5.

The population mean $\mu$ is estimated using
\[\begin{equation} \bar{y} = \frac{\sum^n_{j=1}M_j\bar{y}_j}{\sum^n_{j=1}M_j}, \tag{13.77} \end{equation}\]

Here $M_j$ may represent either the actual number of SSUs in PSU $j$ (finite-population view) or a positive measure of PSU size, such as area or predicted volume (infinite-population view).

As in the cluster-ratio setting, omitting the second-stage FPC reflects the assumption that SSUs represent draws from a large or continuous population rather than a fixed finite set.

and its standard error is estimated as

\[\begin{equation} s_{\bar{y}} = \bar{y}\sqrt{\left(\frac{n}{n-1}\right)\left(\frac{\sum^n_{j=1}M_j^2}{\left(\sum^n_{j=1}M_j\right)^2} + \frac{\sum^n_{j=1}\left(M_j\bar{y}_j\right)^2}{\left(\sum^n_{j=1}M_j\bar{y}_j\right)^2} - \frac{2\sum^n_{j=1}M_j^2\bar{y}_j}{\left(\sum^n_{j=1}M_j\right)\left(\sum^n_{j=1}M_j\bar{y}_j\right)}\right)\left(1-\frac{n}{N}\right)}. \tag{13.78} \end{equation}\]

Given the sample mean and its standard error, the total and its standard error are computed as
\[\begin{equation} \hat{t} = M\bar{y} \tag{13.79} \end{equation}\] and \[\begin{equation} s_{\hat{t}} = M s_{\bar{y}}, \tag{13.80} \end{equation}\] where again, $M = \sum^N_{j=1} M_j$ is the sum of PSU size measures across all PSUs.

For both the unbiased and ratio estimators, confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\bar{y}}, \\ \hat{t} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\hat{t}}, \end{align*}\] with degrees of freedom $df = n - 1$.

13.6.4.2 Sampling PSUs with probability proportional to size

When auxiliary information on PSU size is available before sampling, efficiency can be improved by selecting PSUs with PPS. Size is often measured as area, but any variable positively correlated with the parameter of interest can serve. This is the same idea you saw in point sampling (Section 12.4), where trees were selected with probability proportional to their DBH, and in 3P (Section 13.4.4.4) where trees were selected proportional to prediction. Here, we apply the concept to PSUs, selecting them with probability proportional to the number of SSU.

We assume the following design elements.

Each PSU $j$ has a selection probability $\pi_j = M_j / M$.
Sampling is done with replacement, with $n$ independent draws. If a PSU is selected more than once, a new SSU sample is drawn independently each time.

Because PSUs are sampled with replacement, each draw is independent and there is no finite population correction in PPS variance formulas.

The population total $\tau$ is estimated using \[\begin{equation} \hat{t} = \frac{1}{n} \sum_{j = 1}^n \frac{M_j \bar{y}_j}{\pi_j}, \tag{13.81} \end{equation}\] where the sum is taken over the $n$ PSUs selected with replacement. Some PSUs may appear more than once, and each appearance contributes an independent term to the sum.

The population mean is estimated as
\[\begin{equation} \bar{y} = \frac{\hat{t}}{M}. \tag{13.82} \end{equation}\]

The standard error for $\hat{t}$ is estimated using
\[\begin{equation} s_{\hat{t}} = \sqrt{\frac{1}{n(n - 1)} \sum_{j = 1}^n \left( \frac{M_j \bar{y}_j}{\pi_j} - \hat{t} \right)^2 }. \tag{13.83} \end{equation}\]

Given $s_{\hat{t}}$, the standard error for $\bar{y}$ is
\[\begin{equation} s_{\bar{y}} = \frac{s_{\hat{t}}}{M}. \tag{13.84} \end{equation}\]

Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\bar{y}}, \\ \hat{t} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\hat{t}}, \end{align*}\] with degrees of freedom $df = n - 1$.

This estimator is unbiased for both the total and mean and their associated variances. PPS sampling is especially useful when there is high variation in PSU size and that size is related to the parameter of interest. By allocating sampling effort in proportion to PSU contribution to the population parameter, PPS sampling can yield lower standard errors compared with equal-probability sampling.

13.6.5 Illustrative analysis

We’ll demonstrate the two-stage ratio estimator using a motivating example based on simulated data. While the dataset is synthetic, the scenario reflects the kind of feasibility analysis forestry consultants and analysts routinely perform.

Suppose we’re hired by a mass timber mill and fabrication firm exploring the viability of locating a facility in the region shown in Figure 13.15. The firm expects the state’s Department of Natural Resources (DNR) to be its primary supplier and wants to assess how much suitable biomass is available on DNR-managed lands around the proposed facility. This kind of pre-operational analysis is often referred to as a woodbasket study, where the woodbasket refers to the supply of accessible wood within a defined region.

Conifer and mixed hardwood/softwood stands managed by the Michigan Department of Natural Resources within a one-hour drive of the proposed mass timber mill and fabrication facility. These stands define the population of primary sampling units (PSUs). A simple random sample of 50 PSUs was selected for inventory.

FIGURE 13.15: Conifer and mixed hardwood/softwood stands managed by the Michigan Department of Natural Resources within a one-hour drive of the proposed mass timber mill and fabrication facility. These stands define the population of primary sampling units (PSUs). A simple random sample of 50 PSUs was selected for inventory.

To identify the region of interest, the firm determines that a one-hour travel time is the upper limit for economically feasible wood hauling. We use this threshold to compute an isochrone—a polygon showing the area reachable within a fixed travel time using the road network. In this case, the isochrone includes all locations accessible within one hour of the proposed facility.¹⁶⁵ The fabrication process relies primarily on softwoods, though the firm is experimenting with some hardwood inputs.

Our parameters of interest are the mean per-acre and total dry weight of aboveground biomass (AGB) on all DNR conifer and mixed hardwood/softwood stands within the one-hour isochrone.

The population consists of 8927 stands, and our cruising budget supports a sample of 50. Stands serve as the PSUs, and the SSUs are inventory plots placed within each one. We’ll use 1/10-acre circular plots laid out on a regular 7-by-7 chain (462 ft) grid—a systematic design commonly used in operational inventory work throughout the region.

Figure 13.15 shows the full PSU population along with 50 PSUs selected using SRS. Figure 13.16 highlights a region containing several selected PSUs, and Figure 13.17 provides a closer view of these PSUs and their SSUs.

FIGURE 13.16: Zoomed-in view from Figure 13.15, highlighting three selected primary sampling units (PSUs).

The three primary sampling units (PSUs) shown in Figure 13.16. PSU IDs are 1895, 1791, and 1948 for (a), (b), and (c), respectively. Secondary sampling units (SSUs) were placed using systematic sampling. Each circular inventory plot is colored by the observed aboveground biomass (AGB).

FIGURE 13.17: The three primary sampling units (PSUs) shown in Figure 13.16. PSU IDs are 1895, 1791, and 1948 for (a), (b), and (c), respectively. Secondary sampling units (SSUs) were placed using systematic sampling. Each circular inventory plot is colored by the observed aboveground biomass (AGB).

Because the data are simulated, the true population values are known: $\mu$ = 30.69 tons/ac and $\tau$ = 11,100,216 tons. This allows us to evaluate estimator performance against these known parameters.

We begin by estimating mean and total AGB using a sample of 50 PSUs selected using SRS. Under this design, each PSU in the population has an equal probability of selection. The following three data files are used in this analysis.

datasets/mass_timber/all_psu.csv contains the full set of PSUs in the population. Each row is a PSU, and columns hold a unique PSU index and area in acres.
datasets/mass_timber/sampled_equal_prob_psu.csv contains the sampled PSUs. These were drawn from the full set of PSUs using dplyr::slice_sample() with $n=50$ and all arguments set to their default values.
datasets/mass_timber/sampled_ssu_equal_prob_psu.csv contains the SSUs for each sampled PSU. Each row corresponds to a tree measured on the 1/10-acre circular plot, and columns are:
- psu_id and ssu_id, which together uniquely identify each tree;
- tree_count, which indicates how many trees are associated with each row (used to account for zero-tree plots and trees double-counted via the mirage method);
- agb_ton_tree, the dry weight AGB in tons for each tree—our variable of interest, the $y_{ji}$’s.

The code below reads and takes a quick look at each data file.

all_psu <- read_csv("datasets/mass_timber/all_psu.csv")
all_psu %>% glimpse()

#> Rows: 8,927
#> Columns: 2
#> $ psu_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,…
#> $ ac     <dbl> 44.565, 21.690, 51.851, 131.325, 75.43…

sampled_psu <- 
  read_csv("datasets/mass_timber/sampled_equal_prob_psu.csv")
sampled_psu %>% glimpse()

#> Rows: 50
#> Columns: 2
#> $ psu_id <dbl> 1017, 8004, 4775, 8462, 4050, 8789, 13…
#> $ ac     <dbl> 30.787, 18.785, 11.723, 16.064, 38.443…

ssu_trees <- 
  read_csv("datasets/mass_timber/sampled_ssu_equal_prob_psu.csv")
ssu_trees %>% glimpse()

#> Rows: 7,750
#> Columns: 4
#> $ psu_id       <dbl> 1017, 1017, 1017, 1017, 1017, 10…
#> $ ssu_id       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ tree_count   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ agb_ton_tree <dbl> 0.16786, 0.16999, 0.14626, 0.151…

We’ll now rename a few columns to align with our notation and create variables needed in subsequent steps.

all_psu <- all_psu %>% rename(A_j = ac)
sampled_psu <- sampled_psu %>% rename(A_j = ac)

N <- nrow(all_psu)
n <- nrow(sampled_psu)

A <- sum(all_psu$A_j)

# Add the tree factor for 1/10-acre plot.
r <- sqrt(0.1 * 43560 / pi)
ssu_trees <- ssu_trees %>% mutate(TF = 43560 / (pi * r^2))

The ratio estimator for the mean, total, and their associated standard errors—as defined in (13.77), (13.79), (13.78), and (13.80)—is implemented in two steps. First, we summarize SSU-level information within each PSU. Second, we compute the required quantities across sampled PSUs to produce the population estimates. Here, our measure of PSU size is area, i.e., $M_j = A_j$, which is joined to psu_summary in the final piped expression below.

psu_summary <- ssu_trees %>%
  group_by(psu_id, ssu_id) %>%
  summarize(y_ji = sum(TF * tree_count * agb_ton_tree)) %>%
  summarize(
    y_bar_j = mean(y_ji)
  ) %>%
  left_join(sampled_psu, by = join_by(psu_id))

psu_summary

#> # A tibble: 50 × 3
#>    psu_id y_bar_j   A_j
#>     <dbl>   <dbl> <dbl>
#>  1    465   54.2  145. 
#>  2    526   56.5  126. 
#>  3    878   10.9   33.8
#>  4   1017   33.0   30.8
#>  5   1069   56.4   29.6
#>  6   1129    8.76  26.2
#>  7   1222   26.3   11.1
#>  8   1301   48.7   27.9
#>  9   1530   53.3   41.7
#> 10   1791   49.4  148. 
#> # ℹ 40 more rows

pop_summary <- psu_summary %>%
  summarize(
    y_bar = sum(A_j * y_bar_j) / sum(A_j),
    s_y_bar = y_bar *
      sqrt(n / (n - 1) *
        (sum(A_j^2) / sum(A_j)^2 +
         sum((A_j * y_bar_j)^2) / sum(A_j * y_bar_j)^2 -
         2 * sum(A_j^2 * y_bar_j) / (sum(A_j) * sum(A_j * y_bar_j))
        ) * (1 - n / N)),
    
    tau_hat = A * y_bar,
    s_tau_hat = A * s_y_bar
  )

pop_summary

#> # A tibble: 1 × 4
#>   y_bar s_y_bar   tau_hat s_tau_hat
#>   <dbl>   <dbl>     <dbl>     <dbl>
#> 1  35.6    2.93 12860648.  1059993.

Our estimated mean and total, with 95% confidence intervals, were 35.56 (29.67, 41.45) tons/ac and 12,860,648 (10,730,514, 14,990,783) tons, respectively. Based on this sample, both confidence intervals captured the true values, which are $\mu$ = 30.69 tons/ac and $\tau$ = 11,100,216 tons.

13.6.5.1 Reanalysis using SYS of PSU

In Section 13.2 we applied SYS to areal sampling frames. The same design can also be used with a list sampling frame, where it’s generally referred to as systematic list sampling. When no auxiliary information is available, SYS from an unordered list produces results similar to SRS. However, if an auxiliary variable is available for all units and is correlated with the parameter of interest, sorting the list by that variable before sampling can improve efficiency. The gain comes from implicit stratification: a SYS sample from a sorted list tends to span the full range of the ordering variable, yielding balanced coverage without explicitly defining strata. Inclusion probabilities remain equal, so standard equal-probability estimators still apply.

Here we replace the equal-probability SRS of PSUs used for the ratio estimator with SYS applied to the PSU list sorted by area. We use circular systematic sampling (Section 13.6.5.1.1), which ensures one selection per interval and avoids duplicates. SYS can be applied to list data frames and when a relevant auxiliary variable exists, applying SYS along its gradient can yield more efficient estimates.

13.6.5.1.1 Circular systematic sampling for a list

In our analysis, the “list” refers to the population of PSUs, identified by the unique index psu_id in the all_psu data frame. Circular SYS treats this list as if it were arranged around a circle: choose a random start, take units at fixed intervals, and wrap to the beginning when you reach the end, continuing until $n$ distinct PSUs are selected. This preserves equal inclusion probabilities, avoids selecting the same unit twice, and—when the list is sorted by a relevant auxiliary variable—can produce a well-spread sample (see Cochran (1977), Section 8.1; K. Iles (2003)).

The function sys_circular(N, n) in scripts/utils.R implements this method, returning the row indices to select. In the code below, sample_index holds these indices. The piped sequence then orders all_psu by acres, selects the sampled rows, and assigns them to sampled_psu. SSUs would then be sampled within each PSU (virtually here) to form estimates.

set.seed(1)

sample_index <- sys_circular(N = nrow(all_psu), n = 50)

sampled_psu <- all_psu %>%
  arrange(A_j) %>% # 'A_j' is PSU area (acres).
  slice(sample_index)

13.6.5.1.2 Repeated-sampling comparison

Here we assess the performance of the two-stage ratio estimator when PSUs are selected using either SRS or circular SYS. For each of 1,000 replicate samples, we:

drew $n = 50$ PSUs using simple random sampling from the unsorted list;
drew $n = 50$ PSUs using circular systematic sampling from the list sorted by PSU area (see Section 13.6.5.1.1).

In both cases, we applied the two-stage ratio estimator to the sampled PSUs and recorded point estimates, standard errors, and confidence intervals. Estimator performance under each sampling approach is evaluated using the following summaries.

Bias the average difference between the estimate from each sample and the true population mean (see Section 11.2.3 for the formal definition). Closer to zero is better.
RMSE the square root of the mean squared error (see Section 11.2.3), combining both bias and variability into a single measure of accuracy. Smaller is better.

CI width the average width of the 95% confidence interval, providing a measure of estimator precision. Narrower is better, as long as CI cover is close to its nominal 95%.

CI cover the proportion of confidence intervals that contain the true population mean, expressed as a percentage. This measures how well the estimator’s uncertainty estimates are calibrated. Should be close to the nominal value of 95%.

TABLE 13.3: Summary of ratio estimator performance under two equal-probability primary sampling unit (PSU) selection schemes for the population mean, based on 1,000 samples of size $n = 50$. PSUs were selected using SRS and circular systematic sampling (SYS) from the PSU list sorted by area. Bias, root mean squared error (RMSE), and confidence interval (CI) width are in tons/ac; CI cover is in percent.
			Confidence interval
PSU selection	Bias	RMSE	Width	Cover
SRS	-0.39	2.6	9.2	92
SYS	-0.34	2.6	9.5	94

Summaries across replicates are presented in Table 13.3. As noted earlier, the ratio estimator is biased; however, the results show that this bias is relatively small (compared with the true mean biomass $\mu$ = 30.69254 tons/ac). Bias is an inherent property of the estimator, and since both sampling approaches (i.e., SRS and SYS) produce equal probability samples, it’s reassuring that they result in nearly identical estimates of bias. In this example, SYS produced slightly wider 95% confidence intervals, with coverage closer to the nominal 95%.

These modest performance gains afforded by SYS are not surprising, as per-acre biomass is not strongly correlated with stand size in this dataset. While that relationship is weak here, applying SYS to a list sorted by a relevant auxiliary variable can improve efficiency by spreading the sample more evenly across the range of that variable. When the ordering variable is closely related to the parameter of interest, this implicit stratification can substantially reduce variance and enhance confidence interval coverage relative to SRS.

This analysis used the two-stage ratio estimator, which permits the number of SSUs in the $j$-th sampled PSU to be represented by a relative size—in our case, the sampled PSU’s area $A_j$. Many two-stage estimators commonly found in the literature—including the unbiased and PPS-based methods presented here—assume a finite population of SSUs.

This limits their applicability when SSUs are sampled from an areal frame, where the population is effectively infinite, as is typical in operational forest inventories. Some extensions accommodate PPS sampling with infinite SSU populations within a design-based framework (e.g., Gregoire and Valentine (2007), Chapter 12), but many practical approaches rely on model-assisted or model-based methods (e.g., Särndal, Swensson, and Wretman (1992); Mandallaz (2007); Banerjee (2024)). These represent important extensions used in practice, though a full treatment is beyond the scope of this book.

13.6.6 Advantages and disadvantages of two-stage sampling

Two-stage sampling is a flexible design that can balance statistical efficiency with operational practicality. It’s especially useful when the population is naturally grouped into larger units (PSUs) that can be further subdivided into smaller units (SSUs).

13.6.6.1 Advantages

Fieldwork can be concentrated within a manageable set of PSUs, reducing travel time and logistical complexity compared to SRS of all SSUs.
Allows the use of equal probability selection or PPS at the first stage, depending on whether auxiliary size information is available.
Flexible first-stage selection strategies allow for balancing statistical efficiency with field logistics, such as ordering PSUs spatially to improve coverage or selecting PPS to better represent larger units.
Well suited for inventories covering large, heterogeneous regions where direct sampling of all SSUs is impractical.

13.6.6.2 Disadvantages

If the number of SSUs per PSU is small or if there is high within-PSU variation, precision can be lower than for single-stage designs with the same total sample size.
Equal probability designs can perform poorly when PSU sizes vary greatly and totals are correlated with size, leading to inflated variance and wider confidence intervals.
Many commonly used two-stage estimators assume a finite population of SSUs. These estimators are not appropriate for forest inventory data collected using an areal sampling frame, where the SSU population is effectively infinite. This limits the applicability of certain standard design-based estimators in forestry contexts.

13.7 Summary

This chapter introduced probability sampling designs and estimation strategies widely used in forest inventory, including systematic, stratified, regression-assisted, cluster, and multistage designs. These designs are not only tools for valid inference—they also provide opportunities to reduce variance, improve precision, and make better use of limited time and cost constraints.

We began with SYS, the most commonly used design in forestry. SYS places sampling locations on a regular grid to improve spatial balance and field efficiency. We showed how to generate and position grids using R, and noted that SYS can also be applied to list frames when sorted by an auxiliary variable.

Stratified sampling was presented as a way to improve precision by dividing the population into homogeneous groups. We reviewed allocation strategies and discussed how stratification supports both inference and logistical planning. Ratio and regression estimators were introduced as tools for incorporating auxiliary variables, with examples illustrating their application within and across strata.

We then applied these estimation strategies and design principles to specialized forestry contexts, including Big BAF and 3P sampling. Both designs follow the same theme introduced in the regression and ratio estimation section: using quick, inexpensive measurements across the full population (or a large subset), and correcting them with a smaller subset of accurate measurements to improve inference.

Finally, we introduced cluster and two-stage sampling designs, which are useful when the population is naturally grouped or when field effort must be concentrated in fewer locations. These designs help reduce travel and logistical costs by allowing multiple observations within a selected unit, while still supporting valid and efficient estimation through appropriate weighting and summarization.

Implementing these designs in R using tidyverse workflows offers major practical advantages. Code built around tools like group_by(), summarize(), and spatial operations from the sf package allows analysts to write modular, transparent routines that mirror the structure of the design itself. This leads to workflows that are not only efficient to execute, but also reproducible, auditable, and easily adapted to different forest inventory contexts—scaling from small studies to regional and national assessments.

13.8 Exercises

13.8.1 Mountain Creek tract analysis

These data come from managed lands in the southern Oregon Coast Range owned by the Cow Creek Band of Umpqua Tribe of Indians. The project area is the Mountain Creek tract, a one-square-mile section in Township 31 South, Range 9 West, Section 28 of the Public Land Survey System (see Figure 13.18). The aerial image is the Exhibit B map for Mountain Creek, the standard map used in project documents to show unit boundaries and other features.

The Mountain Creek tract is divided into discrete management units, which serve as the strata for sampling. Some units are intended for overstory harvest, while others are designated for regeneration or other management purposes. Estimates were needed for each unit as well as for the entire tract, including overstory summaries and regeneration counts to support future silvicultural planning.

There are 13 units, ranging in area from 3.16 acres to 91.41 acres. The total area across all units is 412.58 acres. The inventory followed a systematic (SYS) design, with $n$ = 73 sampling locations—about one point every 6 acres.

At each location, overstory trees (trees greater than 5 inches DBH) were tallied using a 20 BAF (English units). In addition, a 1/100-th acre fixed-area subplot was established to measure regeneration (trees less than or equal to 5 inches DBH).

While the intent was to implement a standard systematic design, a few minor deviations appear to have occurred during field layout. In operational inventory settings, it’s not uncommon to encounter such departures from the intended sampling plan. Although these modifications represent a shift from the original design and should be made with care, we’ll assume for this exercise that any changes were implemented in a way consistent with systematic sampling and remain valid for estimation. This lets us proceed with design-based inference under the assumption that the overall sampling plan remains approximately systematic. Situations like this reflect the realities of working with field data and provide a useful opportunity to think critically about how design assumptions align with the data we actually receive as analysts.

Figure 13.18 shows the unit boundaries and sampling locations for the Mountain Creek tract. A copy of the inventory map is also available as a PDF in "datasets/CCBUTI/CCBUTI_inventory_map.pdf".

FIGURE 13.18: Management unit boundaries and systematic sample layout for timber cruise and regeneration survey conducted on lands owned by the Cow Creek Band of Umpqua Tribe of Indians in the southern Oregon Coast Range. Overstory trees were tallied using a 20 BAF point sample, and regeneration was measured on a 1/100-th acre fixed-area subplot at each location.

The data are provided in two separate files located in the "datasets/CCBUTI" directory: "CCBUTI_trees.csv" contains tree-level overstory and regeneration measurements, and "CCBUTI_units.csv" contains unit information, including unit identifiers and areas. The code below reads each file and takes a quick look at their structure.

ccbuti_trees <- read_csv("datasets/CCBUTI/CCBUTI_trees.csv")
ccbuti_trees %>% glimpse()

#> Rows: 288
#> Columns: 9
#> $ unit_id    <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3…
#> $ point_id   <dbl> 6, 6, 6, 6, 6, 7, 7, 8, 8, 8, 20, …
#> $ type       <chr> "Overstory", "Regeneration", "Rege…
#> $ tree_count <dbl> 0, 6, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1…
#> $ tree_id    <dbl> NA, 1, 2, 3, 4, NA, 1, 1, 1, 2, NA…
#> $ spp_code   <chr> NA, "df", "df", "df", "df", NA, "d…
#> $ DBH_in     <dbl> NA, 2.0, 3.0, 4.0, 1.0, NA, 2.0, 6…
#> $ height_ft  <dbl> NA, 15, 17, 20, 15, NA, 15, 29, 16…
#> $ vol_cuft   <dbl> NA, NA, NA, NA, NA, NA, NA, 1.3461…

ccbuti_units <- read_csv("datasets/CCBUTI/CCBUTI_units.csv")
ccbuti_units %>% glimpse()

#> Rows: 13
#> Columns: 2
#> $ unit_id <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
#> $ acres   <dbl> 20.73, 56.33, 33.30, 91.41, 3.16, 35.…

Rows in ccbuti_trees correspond to measured trees (both overstory and regeneration), or placeholders for sampling locations without trees. Sampling locations (point_id) are labeled as shown in Figure 13.18; note that point_id is not nested within unit_id. Columns in ccbuti_trees are defined as follows.

unit_id unit identifier in which the sampling point falls.
point_id unique identifier for each sampling point (not nested within unit_id).
type tree type, either "Overstory" (BAF 20 at each point) or "Regeneration" (1/100-th acre fixed-area subplot co-located at each point).
tree_count number of trees a given row represents. A value of zero is a placeholder and means no trees of that type were observed. When count is zero, all subsequent columns have an NA value.
tree_id sequential tree number within each point_id and type.
spp_code species code matching the Code column in Table 13.4.
DBH_in tree diameter at breast height (DBH) in inches. Overstory trees are those with DBH $>$ 5 inches, and regeneration trees are those with DBH $\leq$ 5 inches.
height_ft tree height in feet.
vol_cuft overstory tree merchantable volume (ft$^3$) to a 4-in top. Regeneration trees have NA. Volume estimates are based on species, DBH, and height using the U.S. National Scale Volume and Biomass Estimators (NSVB) (Westfall et al. 2024).

TABLE 13.4: Species codes and associated species names recorded on the Mountain Creek tract in the southern Oregon Coast Range.
Species list
Code	Species name
wp	western white pine
wh	western hemlock
df	Douglas fir

Rows in ccbuti_units correspond to harvest units and columns are defined as follows.

unit_id unit identifier (common to both the tree and unit files).
acres Unit area in acres.

In the following exercises, you’ll use these datasets to compute overstory tree basal area, density (trees per unit area), and merchantable volume estimates, as well as regeneration density summaries at both the unit- and tract-level.

For most exercises, we provide the desired output—your task is to write the tidyverse code to reproduce it.

Exercise 13.1 What are the total number of points across all units? Confirm your answer using Figure 13.18 or the inventory map "datasets/CCBUTI/CCBUTI_inventory_map.pdf".

As you check the map, notice that points are not labeled consecutively. This is not an issue, just something to be aware of—we only care that each point has a unique ID value.¹⁶⁶

#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    73

Exercise 13.2 What are the distinct unit_id values and how many units are there? Confirm your answer using Figure 13.18 or the inventory map "datasets/CCBUTI/CCBUTI_inventory_map.pdf".

Note that there is no unit ID 1 in these data, but the units are consecutively numbered starting from ID 2.

#> # A tibble: 13 × 1
#>    unit_id
#>      <dbl>
#>  1       2
#>  2       3
#>  3       4
#>  4       5
#>  5       6
#>  6       7
#>  7       8
#>  8       9
#>  9      10
#> 10      11
#> 11      12
#> 12      13
#> 13      14

Exercise 13.3 What are the total number of points in each unit? Arrange the results in ascending order by unit_id. Confirm your answer using Figure 13.18.

When we apply the stratified estimator using units as strata, remember that we need at least two observations in each stratum to compute its sample variance. Do all units meet this requirement?

Hint, start with a pipe from ccbuti_trees and use a sequence of functions like group_by(), summarize(), and arrange(). You will also need a function in summarize() that counts the number of distinct points within each unit.

#> # A tibble: 13 × 2
#>    unit_id n_points
#>      <dbl>    <int>
#>  1       2        3
#>  2       3       10
#>  3       4        7
#>  4       5       15
#>  5       6        3
#>  6       7        5
#>  7       8        3
#>  8       9       10
#>  9      10        3
#> 10      11        3
#> 11      12        3
#> 12      13        4
#> 13      14        4

Exercise 13.4 Of the total number of sampling locations, how many overstory points and regeneration plots have no trees (i.e., tree_count of zero)?

Hint, one approach is to sum the occurrences of tree_count == 0 by type, since a tree_count of zero indicates no trees were measured at that location.

#> # A tibble: 2 × 2
#>   type         zero_trees
#>   <chr>             <int>
#> 1 Overstory            30
#> 2 Regeneration         22

Exercise 13.5 Using ccbuti_trees, compute overstory species abundance as a percent of total overstory trees, arranged in descending order. What is the most abundant overstory species?

Species percent abundance is defined as \[\begin{equation} p_j = 100 \times \frac{m_j}{\sum_{j=1}^{J} m_j}, \tag{13.85} \end{equation}\] where $m_j$ is the overstory stem count for species $j$ (the sum of tree_count for overstory trees only) and $J$ is the number of overstory species.

Getting the right answer is a bit of a process. We suggest the following steps.

Filter for overstory trees only.
Filter out rows with missing species codes (these are placeholders for points with no trees).
Group the data by species code.
Compute m_j = sum(tree_count), via a summarize().
Compute the total number of trees M = sum(m_j) and p_j = 100 * m_j / M, via mutate().
Arrange species in descending order of p_j.
Print a small table with species code, m_j, and p_j. As an internal check, make sure the p_j values sum to 100.

We suggest starting with a pipe from ccbuti_trees and using filter(), group_by(), summarize(), mutate(), and arrange(). Alternatively, count(spp_code, wt = tree_count, name = "m_j") can replace steps 3-4.¹⁶⁷

You should get the following output.

#> # A tibble: 3 × 4
#>   spp_code   m_j     M    p_j
#>   <chr>    <dbl> <dbl>  <dbl>
#> 1 df         104   113 92.0  
#> 2 wh           8   113  7.08 
#> 3 wp           1   113  0.885

Note, in the output above M is an intermediate value and is the denominator of (13.85), i.e., the total number of trees across all species.

Exercise 13.6 Modify your code from Exercise 13.5 to compute regeneration species abundance as a percent of total regeneration trees, arranged in descending order. What is the most abundant regeneration species?

You should get the following output.

#> # A tibble: 2 × 4
#>   spp_code   m_j     M   p_j
#>   <chr>    <dbl> <dbl> <dbl>
#> 1 df         134   144 93.1 
#> 2 wh          10   144  6.94

Exercise 13.7 The next several exercises work toward stratified estimates for unit- and tract-level overstory basal area, density, and volume.

Begin by creating a subset of overstory trees from ccbuti_trees. Then add a tree factor column so that each tree has its own tree factor. Recall from Section 12.4 that, for point sampling, the tree factor is a function of the BAF and the tree’s DBH.

Following the methods in Section 12.4 and the specific computing steps in Section 12.6.2, we called our overstory subset o_ccbuti_trees and our tree factor column TF. When adding the TF values, use an ifelse() within your mutate() call to ensure a TF value of zero when tree_count is zero.

For visualization, we placed the new TF column immediately after tree_id. Part of the resulting tibble is shown below.

o_ccbuti_trees

#> # A tibble: 141 × 10
#>    unit_id point_id type      tree_count tree_id    TF
#>      <dbl>    <dbl> <chr>          <dbl>   <dbl> <dbl>
#>  1       2        6 Overstory          0      NA   0  
#>  2       2        7 Overstory          0      NA   0  
#>  3       2        8 Overstory          1       1 102. 
#>  4       3       20 Overstory          0      NA   0  
#>  5       3       21 Overstory          0      NA   0  
#>  6       3       22 Overstory          0      NA   0  
#>  7       3       23 Overstory          1       1  89.5
#>  8       3       24 Overstory          0      NA   0  
#>  9       3       25 Overstory          0      NA   0  
#> 10       3       26 Overstory          0      NA   0  
#> # ℹ 131 more rows
#> # ℹ 4 more variables: spp_code <chr>, DBH_in <dbl>,
#> #   height_ft <dbl>, vol_cuft <dbl>

Exercise 13.8 Using o_ccbuti_trees from Exercise 13.7, compute point-level summaries for per-acre basal area, tree density, and volume. Call the resulting tibble point_summary.

In our output below, BA_hi, trees_hi, and vol_hi represent per-acre basal area, density, and volume, respectively. The _hi suffix indicates that these are values for the $i$-th point within the $h$-th unit (stratum). We grouped by both unit_id and point_id. While it would have been sufficient to group by point_id alone (because point IDs are unique across all units), we included unit_id so it’s available later when computing unit-level estimates in Exercise 13.9.

Hint, the general steps for generating these point-level summaries are illustrated in Section 12.6.2. Some care is needed when computing vol_hi because vol_cuft values associated with points that have no trees (i.e., tree_count == 0) are NA. We’d suggest using an ifelse() within mutate() in your piped sequence to change those vol_cuft NA values to 0; otherwise, the computation vol_hi = sum(TF * tree_count * vol_cuft) will include NA values and the result will be NA.

point_summary

#> # A tibble: 73 × 5
#> # Groups:   unit_id [13]
#>    unit_id point_id BA_hi trees_hi vol_hi
#>      <dbl>    <dbl> <dbl>    <dbl>  <dbl>
#>  1       2        6     0      0       0 
#>  2       2        7     0      0       0 
#>  3       2        8    20    102.    137.
#>  4       3       20     0      0       0 
#>  5       3       21     0      0       0 
#>  6       3       22     0      0       0 
#>  7       3       23    20     89.5   175.
#>  8       3       24     0      0       0 
#>  9       3       25     0      0       0 
#> 10       3       26     0      0       0 
#> # ℹ 63 more rows

Exercise 13.9 Using point_summary from Exercise 13.8, compute unit-level estimates for the per-acre sample mean ($\bar{y}_h$) and standard deviation ($s_h$) for basal area, density, and volume. Also compute the sample size within each unit (number of points $n_h$).

Estimation methods are laid out in Section 13.3, with estimator specifics given in Section 13.3.2 and illustration in Section 13.3.6.

We’ll call these unit-level summaries unit_summary and add the _h suffix to indicate that these are the unit estimates. In our output, the unit-level sample means are BA_bar_h, tree_bar_h, and vol_bar_h, and the sample standard deviations are s_BA_h, s_tree_h, and s_vol_h.

unit_summary %>% glimpse()

#> Rows: 13
#> Columns: 8
#> $ unit_id     <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1…
#> $ n_h         <int> 3, 10, 7, 15, 3, 5, 3, 10, 3, 3, …
#> $ BA_bar_h    <dbl> 6.6667, 8.0000, 0.0000, 20.0000, …
#> $ s_BA_h      <dbl> 11.547, 13.984, 0.000, 21.381, 0.…
#> $ trees_bar_h <dbl> 33.953, 41.612, 0.000, 94.422, 0.…
#> $ s_trees_h   <dbl> 58.808, 71.630, 0.000, 100.517, 0…
#> $ vol_bar_h   <dbl> 45.705, 74.946, 0.000, 184.072, 0…
#> $ s_vol_h     <dbl> 79.163, 133.023, 0.000, 231.089, …

Exercise 13.10 Given the unit-level sample estimates in unit_summary from Exercise 13.9, you can compute unit-specific standard errors and confidence intervals for both means and totals. This involves applying the standard error estimators for the mean and total, then calculating their corresponding confidence intervals as detailed in Section 11.3. We’ll return to that in Exercise 13.12. For now, let’s pause and build some intuition about this tract based on the unit-level estimates in unit_summary.

In addition to basal area, density, and volume estimates, compute the quadratic mean diameter (QMD) for each unit. In forestry, when we talk about mean or average tree diameter, we’re almost always referring to the quadratic mean, not the arithmetic mean.¹⁶⁸ Because it weights larger trees more heavily, QMD is preferred over the arithmetic mean diameter for describing stand structure and development. QMD represents the DBH of a tree with the stand’s average basal area.

Unlike basal area, density, or volume, which can be expressed on a per-unit-area basis, QMD is a tree-level mean summarizing the average size of individual trees in a stand.

Foresters use QMD, together with tree density, to understand stand structure—how large the trees are, how many occupy a given area, and where a stand might be in its development. These measures help evaluate growing space and competition, describe stand conditions, and guide management actions such as thinning.

Following from the quadratic mean in (11.13), we define the QMD as \[\begin{equation} \sqrt{\frac{\sum^m_{i = 1}DBH_i^2}{m}}, \tag{13.86} \end{equation}\] where $m$ is the number of trees.

If we have basal area and tree density expressed on a per-unit-area basis, then an equivalent expression to (13.86) is \[\begin{equation} \sqrt{\frac{\text{BA}}{(c\cdot m)}}, \tag{13.87} \end{equation}\] where BA is basal area, $m$ is tree density, and $c$ is a unit conversion constant (if BA is in square feet then $c=\pi/(144\cdot4)\approx 0.005454$, and if BA is in square meters then $c=\pi/(10000\cdot4)\approx 0.0000785$). See Curtis and Marshall (2000) and Ducey and Kershaw (2023) for details on QMD’s long history of use in forestry, its relationship with stand parameters like volume, and alternative formulations.

For this exercise, compute each unit’s QMD using the basal area and density estimates from Exercise 13.9. So following (13.87) we’re computing \[\begin{equation} \text{QMD}_h = \sqrt{\frac{\overline{\text{BA}}_{h}}{c\cdot\overline{\text{trees}}_{h}}}, \tag{13.88} \end{equation}\] where $\overline{\text{BA}}_{h}$ and $\overline{\text{trees}}_{h}$ are the per-acre mean basal area (ft$^2$/ac) and number of trees within the $h$-th unit, and $c$ is defined in (13.87). In unit_summary, $\overline{\text{BA}}_{h}$ and $\overline{\text{trees}}_{h}$ correspond to BA_bar_h and trees_bar_h, respectively.

Write a piped sequence to:

compute QMD_h using (13.88) (use mutate() and wrap the formula in ifelse() to avoid dividing by zero when trees_bar_h == 0);
select only unit_id, BA_bar_h, trees_bar_h, vol_bar_h, and the computed QMD_h;
print the results arranged in descending order of QMD_h.

Do not add QMD_h back into unit_summary—we only want to inspect it for this exercise. After running your code, compare your results to the output provided to check your work.

Based on density, volume, and QMD, what general patterns of stand development do you observe across the units?

#> # A tibble: 13 × 5
#>    unit_id BA_bar_h trees_bar_h vol_bar_h QMD_h
#>      <dbl>    <dbl>       <dbl>     <dbl> <dbl>
#>  1      11   127.         127.     3523.  13.5 
#>  2      10    46.7         58.3     932.  12.1 
#>  3      13    55           83.3    1092.  11.0 
#>  4      14    65          123.     1771.   9.84
#>  5       9    66          125.     1313.   9.83
#>  6      12    40          103.      643.   8.44
#>  7       7    16           64.9     186.   6.72
#>  8       5    20           94.4     184.   6.23
#>  9       2     6.67        34.0      45.7  6   
#> 10       3     8           41.6      74.9  5.94
#> 11       4     0            0         0    0   
#> 12       6     0            0         0    0   
#> 13       8     0            0         0    0

Exercise 13.11 Let’s push on with computing the tract-level estimates using unit_summary from Exercise 13.9. Join the unit area stored in ccbuti_units to unit_summary. Then, to stay consistent with our notation, rename the area column acres to N_h.

Here’s the result after the join and rename (we also relocated N_h to immediately follow unit_id, just for visualization).

unit_summary %>% glimpse()

#> Rows: 13
#> Columns: 9
#> $ unit_id     <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1…
#> $ N_h         <dbl> 20.73, 56.33, 33.30, 91.41, 3.16,…
#> $ n_h         <int> 3, 10, 7, 15, 3, 5, 3, 10, 3, 3, …
#> $ BA_bar_h    <dbl> 6.6667, 8.0000, 0.0000, 20.0000, …
#> $ s_BA_h      <dbl> 11.547, 13.984, 0.000, 21.381, 0.…
#> $ trees_bar_h <dbl> 33.953, 41.612, 0.000, 94.422, 0.…
#> $ s_trees_h   <dbl> 58.808, 71.630, 0.000, 100.517, 0…
#> $ vol_bar_h   <dbl> 45.705, 74.946, 0.000, 184.072, 0…
#> $ s_vol_h     <dbl> 79.163, 133.023, 0.000, 231.089, …

Exercise 13.12 Following Exercise 13.11, unit_summary has everything needed to apply the stratified estimators for means, standard errors, and confidence intervals for basal area, density, and volume.

To keep things simple and stay focused as we work toward the tract-level estimates, we’ll narrow our attention to volume.

From unit_summary, select the columns unit_id, N_h, n_h, vol_bar_h, and s_vol_h, and call the resulting subset unit_summary_vol.

unit_summary_vol

#> # A tibble: 13 × 5
#>    unit_id   N_h   n_h vol_bar_h s_vol_h
#>      <dbl> <dbl> <int>     <dbl>   <dbl>
#>  1       2 20.7      3      45.7    79.2
#>  2       3 56.3     10      74.9   133. 
#>  3       4 33.3      7       0       0  
#>  4       5 91.4     15     184.    231. 
#>  5       6  3.16     3       0       0  
#>  6       7 35.9      5     186.    219. 
#>  7       8 16.4      3       0       0  
#>  8       9 58.1     10    1313.    809. 
#>  9      10 21.0      3     932.    463. 
#> 10      11 13.9      3    3523.   3294. 
#> 11      12 18.6      3     643.    423. 
#> 12      13 13.2      4    1092.    718. 
#> 13      14 30.6      4    1771.   1516.

Exercise 13.13 Using unit_summary_vol from Exercise 13.12, apply the stratified mean and standard error estimators (13.2) and (13.5), respectively, and compute 95% confidence intervals for mean volume. These are per-acre estimates (ft$^3$/ac).

In the output below, we left the total tract area (N), standard error, and $t$-value so you can check your steps. The estimates show a mean volume of 615.72 ft$^3$/ac with a 95% confidence interval of (422.98, 808.46) ft$^3$/ac. Given that many units have little or no overstory, such a low tract-level estimate should not be surprising.

#> Rows: 1
#> Columns: 6
#> $ N            <dbl> 412.58
#> $ vol_bar      <dbl> 615.72
#> $ s_vol_bar    <dbl> 96.355
#> $ t            <dbl> 2.0003
#> $ vol_ci_lower <dbl> 422.98
#> $ vol_ci_upper <dbl> 808.46

Exercise 13.14 Make a few modifications to your code from Exercise 13.13 to apply the stratified total and standard error estimators in (13.3) and (13.6), respectively, and compute 95% confidence intervals for total volume. These are tract-level totals (ft$^3$).

#> Rows: 1
#> Columns: 6
#> $ N            <dbl> 412.58
#> $ vol_hat      <dbl> 254033
#> $ s_vol_hat    <dbl> 39754
#> $ t            <dbl> 2.0003
#> $ vol_ci_lower <dbl> 174514
#> $ vol_ci_upper <dbl> 333553

Exercise 13.15 For an internal check, assume there is no stratification and compute the tract-level estimate of mean volume with its standard error. Starting from o_ccbuti_trees, apply the point-sampling estimators from Section 12.4, using the computing steps in Section 12.6.2. Your output should match the output below. Recall, these are per-acre estimates (ft$^3$/ac).

How does this mean and its standard error compare with the stratified results from Exercise 13.13? In particular, how do the standard errors (i.e., s_vol_bar) compare, and why are there differences? Does stratification improve statistical efficiency here?

#> Rows: 1
#> Columns: 6
#> $ n            <int> 73
#> $ vol_bar      <dbl> 608.94
#> $ s_vol_bar    <dbl> 128.84
#> $ t            <dbl> 1.9935
#> $ vol_ci_lower <dbl> 352.11
#> $ vol_ci_upper <dbl> 865.78

Exercise 13.16 Let’s turn to the regeneration data. Recall, this survey was conducted on fixed-area 1/100-acre plots that were co-located with the overstory sampling locations. Our focus here is on regeneration within individual units, not on tract-level estimates, so we’ll stop after generating unit-level summaries and will not apply tract-level stratified estimators.

Begin by creating a subset of regeneration trees from ccbuti_trees. Then add a tree factor column so that each tree has its own tree factor. Recall from Section 12.3 that under plot sampling the tree factor is only a function of the plot area, and therefore will be the same for all trees.

Following methods in Section 12.3 and the computing steps in Section 12.6.1, we called our regeneration subset r_ccbuti_trees and our tree factor column TF. For visualization, we placed the new TF column immediately after tree_id. Part of the resulting tibble is shown below.

r_ccbuti_trees %>% glimpse()

#> Rows: 147
#> Columns: 10
#> $ unit_id    <dbl> 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3…
#> $ point_id   <dbl> 6, 6, 6, 6, 7, 8, 8, 20, 20, 21, 2…
#> $ type       <chr> "Regeneration", "Regeneration", "R…
#> $ tree_count <dbl> 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ tree_id    <dbl> 1, 2, 3, 4, 1, 1, 2, 1, 2, 1, 2, 3…
#> $ TF         <dbl> 100, 100, 100, 100, 100, 100, 100,…
#> $ spp_code   <chr> "df", "df", "df", "df", "df", "df"…
#> $ DBH_in     <dbl> 2.0, 3.0, 4.0, 1.0, 2.0, 1.0, 5.0,…
#> $ height_ft  <dbl> 15, 17, 20, 15, 15, 16, 26, 26, 26…
#> $ vol_cuft   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA…

Exercise 13.17 Using r_ccbuti_trees from Exercise 13.16, compute plot-level summaries for per-acre tree density. Call the resulting tibble plot_summary.

In our output below, trees_hi represents per-acre tree density. The _hi suffix indicates that these are values for the $i$-th plot within the $h$-th unit. We grouped by both unit_id and point_id. While grouping by point_id alone would have been sufficient (because point IDs are unique across all units), we included unit_id so it will be available later when computing unit-level estimates in Exercise 13.18.

plot_summary

#> # A tibble: 73 × 3
#> # Groups:   unit_id [13]
#>    unit_id point_id trees_hi
#>      <dbl>    <dbl>    <dbl>
#>  1       2        6      900
#>  2       2        7      100
#>  3       2        8      200
#>  4       3       20      200
#>  5       3       21      400
#>  6       3       22      200
#>  7       3       23      100
#>  8       3       24      200
#>  9       3       25      300
#> 10       3       26      200
#> # ℹ 63 more rows

Exercise 13.18 Using plot_summary from Exercise 13.17, compute the unit-level sample size and mean number of regeneration trees per acre, along with the associated standard error. Assign your output to unit_summary. It should match our output below (which we arranged in decreasing order of trees_bar_h).

The single regeneration tree in Unit 13 has a DBH of zero inches (it was only four feet tall). There are no regeneration trees in Units 10 and 11, which is why those units have NaN values.

While the survey data do provide an estimate of regeneration within each unit, the sample sizes are quite small, and the resulting standard error can be large. If more precise estimates are needed for management decisions, one might consider ways to increase the sample size without incurring substantial additional cost. For example, a cluster design like those discussed in Section 13.5 could be used to place more regeneration subplots at each sampling location.

unit_summary %>% arrange(desc(trees_bar_h))

#> # A tibble: 13 × 4
#>    unit_id   n_h trees_bar_h s_trees_bar_h
#>      <dbl> <int>       <dbl>         <dbl>
#>  1       7     5        620          193. 
#>  2       8     3        567.          88.2
#>  3       2     3        400          252. 
#>  4      12     3        267.          66.7
#>  5       4     7        229.          60.6
#>  6       3    10        210           27.7
#>  7       6     3        200            0  
#>  8       5    15        173.          38.4
#>  9      14     4        125           62.9
#> 10      13     4         25           25  
#> 11       9    10         10           10  
#> 12      10     3          0            0  
#> 13      11     3          0            0

Exercise 13.19 It’s interesting to examine the relationship between density and mean DBH for these data. We generally expect density to decrease as mean DBH increases, because stands in early successional stages lose individual trees to growth-based crowding and other sources of mortality.

The code below calculates the quadratic mean DBH for each unit using the individual tree measurements. You’ll note the code uses the quadratic mean DBH formula (13.86). We use weighted.mean() to appropriately weight each DBH measurement; the weight effectively repeats the given value according to the tree’s count and tree factor.¹⁶⁹

After computing these means, we join them to unit_summary from Exercise 13.18 and arrange by trees_bar_h. Your job is to follow the code and decide whether there is any apparent relationship between mean DBH and density.

tree_level <- r_ccbuti_trees %>% 
  group_by(unit_id) %>%
  summarize(
    QMD = sqrt(weighted.mean(DBH_in^2, w = tree_count * TF))
    )

unit_summary %>% 
  left_join(tree_level) %>% 
  arrange(desc(trees_bar_h))

#> Joining with `by = join_by(unit_id)`

#> # A tibble: 13 × 5
#>    unit_id   n_h trees_bar_h s_trees_bar_h    QMD
#>      <dbl> <int>       <dbl>         <dbl>  <dbl>
#>  1       7     5        620          193.    2.69
#>  2       8     3        567.          88.2   1.61
#>  3       2     3        400          252.    2.58
#>  4      12     3        267.          66.7   2.69
#>  5       4     7        229.          60.6   1.97
#>  6       3    10        210           27.7   3.65
#>  7       6     3        200            0     3.13
#>  8       5    15        173.          38.4   3.13
#>  9      14     4        125           62.9   3.20
#> 10      13     4         25           25     0   
#> 11       9    10         10           10     2   
#> 12      10     3          0            0   NaN   
#> 13      11     3          0            0   NaN

13.8.2 Northern California 3P cruise

Suppose you conducted a 3P cruise on a conifer stand in northern California with the goal of estimating the total volume (ft$^3$) with associated 95% confidence intervals. Because the landowner wanted all high-value trees fully accounted for, you defined any tree with a predicted volume exceeding 85 ft$^3$ as sure-to-be-measured. These trees are always selected for measurement.

Following the 3P sampling steps in Section 13.4.4.4.1, this 85 ft$^3$ cutoff also serves as the upper bound of the uniform distribution used to generate the random number $z_i$. As each tree was encountered in the field, you recorded its predicted volume $x_i$ and drew a corresponding $z_i \sim U(0, 85)$. A tree was selected for measurement when $x_i \ge z_i$, in which case its measured volume $y_i$ was recorded. Trees with $x_i < z_i$ were not measured and retain their prediction only.

This procedure produces three categories of trees:

census of predicted volume $x_i$ for trees with $x_i \le 85$ ft$^3$ (these are used to compute $\mu_x$ and $\tau_x$),
sample of measured volume $y_i$ with paired $(x_i, y_i)$ values (used to estimate $\hat{R}$ and other parameters),
census of sure-to-be-measured volume for trees with $x_i > 85$ ft$^3$ (added directly to the estimated total and confidence interval bounds).

As discussed in Section 13.4.4.4.2, sure-to-be-measured trees form a census stratum with zero sampling variance. Their volumes are added directly to the estimated total, and they contribute no sampling uncertainty.

Table 13.5 lists the 30 trees in the population along with their predicted values, random draws, and (when selected) measured volumes. For trees in the sure-to-be-measured stratum, only the measured volume is shown to avoid confusion with the unused prediction. As you inspect the table, note that under the 3P design, whenever $x_i \ge z_i$, the observed volume $y_i$ is recorded.

TABLE 13.5: Field data from a 3P cruise in northern California conifers. All values are in cubic foot volume.
Tree	$x_i$	$z_i$	$y_i$	Sure-to-be-measured	Tree	$x_i$	$z_i$	$y_i$	Sure-to-be-measured
1	54	57			16	48	48	54
2	48	48	51		17	60	51
3				99	18	54	48
4	48	54			19	63	57	60
5	57	54	63		20	51	66
6	51	60			21				96
7	60	69			22	66	57	69
8	51	45	54		23	57	66
9				93	24	69	75
10	45	51			25	51	63
11	63	72			26	57	48	60
12	60	54	66		27	66	66	69
13	72	75			28	51	63
14	57	63			29				102
15				93	30	66	57	63

We now turn to estimation. The first step is to read the three files that store the data in Table 13.5. After reading them in, run glimpse() on each to verify their column names and values, noting that the files use descriptive column names while the table uses the notation introduced for 3P sampling.

predicted <- read_csv("datasets/CA_3P/predicted.csv")
measured <- read_csv("datasets/CA_3P/measured.csv")
sure <- read_csv("datasets/CA_3P/sure_to_be_measured.csv")

Exercise 13.20 In this first exercise, ignore the sure-to-be-measured trees and apply the 3P estimators detailed in Section 13.4.4.4.2. Here are the steps.

Compute $N$, $\mu_x$, and $\tau_x$ from the $x_i$ in predicted.
Compute $n$, $\hat{R}$ (13.30), and $s^2_r$ (13.32) from the paired $(x_i, y_i)$ in measured.
Given values and estimates from Steps 1 and 2, compute $\bar{y}$ (13.29), $s_{\bar{y}}$ (13.31), $\hat{t} = N\bar{y}$ (13.33), and $s_{\hat{t}}$ (13.34).
Construct a 95% confidence interval for $\hat{t}$ using $s_{\hat{t}}$ and a $t$-value with $n-1$ degrees of freedom.

Your sample size should be $n$ = 10. After rounding, you should get an $\hat{R}$ of 1.05, $s^2_r$ of 0, and $\hat{t}$ of 1496.54 ft$^3$ with a standard error of 20.19 ft$^3$. Your 95% confidence interval for the total should be (1450.88, 1542.21) ft$^3$.

Exercise 13.21 Now modify your code to add in the sure-to-be-measured tree totals using the following steps.

Compute $t_{\text{sure}}$ by summing the measured volumes in sure_to_be_measured.
Compute $\hat{t}_{\text{adj}} = \hat{t} + t_{\text{sure}}$.
Shift both confidence interval endpoints by adding $t_{\text{sure}}$.

As discussed in Section 13.4.4.4.2, the standard error $s_{\hat{t}}$ does not change because the sure-to-be-measured trees are a census and hence contribute no sampling variance.

Your sure-to-be-measured total $t_{\text{sure}}$ should be 483. Your adjusted total $\hat{t}_{\text{adj}}$ should be 1979.54 ft$^3$, with 95% confidence interval (1933.88, 2025.21) ft$^3$.

Asner, GP, RF Hughes, TA Varga, DE Knapp, and T Kennedy-Bowdoin. 2009. “Environmental and Biotic Controls over Aboveground Biomass Throughout a Tropical Rain Forest.” Ecosystems 12 (2): 261–78.

Babcock, CR, JA Matney, AO Finley, A Weiskittel, and BD Cook. 2013. “Multivariate Spatial Regression Models for Predicting Individual Tree Structure Variables Using LiDAR Data.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 6 (1, SI): 6–14.

Banerjee, Sudipto. 2024. “Finite Population Survey Sampling: An Unapologetic Bayesian Perspective.” Sankhya A 86 (1): 95–124.

Beers, Thomas W. 1969. “Slope Correction in Horizontal Point Sampling.” Journal of Forestry 67 (3): 188–92. https://doi.org/10.1093/jof/67.3.188.

———. 1977. “Practical Correction of Boundary Overlap.” Southern Journal of Applied Forestry 1 (1): 16–18. https://doi.org/10.1093/sjaf/1.1.16.

Bell, Lucien B. ; Oregon State Board of Forestry, John F. ; Alexander. 1957. “Application of the Variable Plot Method of Sampling Forest Stands.” Salem, Or. : Oregon State Board of Forestry.

Bitterlich, W. 1952. “Die Winkelzählprobe.” Forstwissenschaftliches Centralblatt 71 (7): 215–25. https://doi.org/10.1007/BF01821439.

———. 1984. The Relascope Idea. Relative Measurements in Forestry. Farnham Royal, UK: Commonwealth Agricultural Bureaux.

Bravo, Adriana, Ana Porzecanski, Eleanor Sterling, Nora Bynum, Michelle Cawthorn, Denny S. Fernandez, Laurie Freeman, et al. 2016. “Teaching for Higher Levels of Thinking: Developing Quantitative and Analytical Skills in Environmental Science Courses.” Ecosphere 7 (4): e01290. https://doi.org/10.1002/ecs2.1290.

Bray, Tim, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and François Yergeau. 2008. “Extensible Markup Language (XML) 1.0 (Fifth Edition).”

Bruce, D. 1961. Prism Cruising in the Western United States and Volume Tables for Use Therewith. Mason, Bruce & Girard.

Bruce, David. 1955. “A New Way to Look at Trees.” Journal of Forestry 53 (3): 163–67. https://doi.org/10.1093/jof/53.3.163.

Brunsdon, Chris, and Lex Comber. 2019. An Introduction to R for Spatial Analysis and Mapping (Spatial Analytics and GIS) Second Edition. SAGE Publications Ltd.

Bryan, Mackay B. 1956. “A Simplified Method of Correcting for Slope on Circular Sample Plots.” Journal of Forestry 54 (7): 442–45. https://doi.org/10.1093/jof/54.7.442.

Bunn, Andrew G. 2010. “Statistical and Visual Crossdating in r Using the dplR Library.” Dendrochronologia 28 (4): 251–58.

Burkhart, H. E., T. E. Avery, and B. P. Bullock. 2018. Forest Measurements: Sixth Edition. Waveland Press.

Burrill, Elizabeth A., Andrea M. Wilson, Jeffery A. Turner, Scott A. Pugh, James Menlove, Glenn Christiansen, Barbara L. Conkling, and Winnie David. 2018. “The Forest Inventory and Analysis Database: Database Description and User Guide Version 8.0 for Phase 2.” U.S. Department of Agriculture, Forest Service. 946 p. http://www.fia.fs.fed.us/library/database-documentation/.

Chang, W. 2018. R Graphics Cookbook: Practical Recipes for Visualizing Data. O’Reilly.

Claes-Magnus, C., S. H. Davis, C. E. Särndal, J. H. Wretman, and Sarndal C-E. 1977. Foundations of Inference in Survey Sampling. A Wiley-Interscience Publication. Wiley.

Cochran, William G. 1977. Sampling Techniques, 3rd Edition. John Wiley.

Cornfield, Jerome. 1944. “On Samples from Finite Populations.” Journal of the American Statistical Association 39 (226): 236–39. https://doi.org/10.1080/01621459.1944.10500680.

Curtis, Robert O., and David D. Marshall. 2000. “Technical Note: Why Quadratic Mean Diameter?” Western Journal of Applied Forestry 15 (3): 137–39. https://doi.org/10.1093/wjaf/15.3.137.

Davis, Eric C., Brent Sohngen, and David J. Lewis. 2022. “The Effect of Carbon Fertilization on Naturally Regenerated and Planted US Forests.” Nature Communications 13 (1): 5490.

Ducey, Mark J., Jeffrey H. Gove, and Harry T. Valentine. 2004. “A Walkthrough Solution to the Boundary Overlap Problem.” Forest Science 50: 427–35.

Ducey, Mark J., and John A. Kershaw. 2023. “Alternative Expressions for Stand Diameter in Complex Forests.” Forest Ecosystems 10: 100114. https://doi.org/https://doi.org/10.1016/j.fecs.2023.100114.

Ellison, Aaron M. 2010. “Repeatability and Transparency in Ecological Research.” Ecology 91 (9): 2536–39.

Environmental Systems Research Institute. 1998. ESRI Shapefile Technical Description. Environmental Systems Research Institute, Inc. https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf.

Ericson, W. A. 1969. “Subjective Bayesian Models in Sampling Finite Populations.” Journal of the Royal Statistical Society. Series B (Methodological) 31 (2): 195–233.

Fehrmann, Lutz. 2024. Dendrometer III Instruction Manual. Forest Inventory; Remote Sensing, Georg-August-Universität Göttingen. https://doi.org/10.25625/USXNYO.

Finley, Andrew O., Hans-Erik Andersen, Chad Babcock, Bruce D. Cook, Douglas C. Morton, and Sudipto Banerjee. 2024. “Models to Support Forest Inventory and Small Area Estimation Using Sparsely Sampled LiDAR: A Case Study Involving g-LiHT LiDAR in Tanana, Alaska.” Journal of Agricultural, Biological and Environmental Statistics, March. https://doi.org/10.1007/s13253-024-00611-3.

Foster, J. R., A. W. D’Amato, and J. B. Bradford. 2014. “Looking for Age-Related Growth Decline in Natural Forests: Unexpected Biomass Patterns from Tree Rings and Simulated Mortality.” Oecologia 175 (1): 363–74.

Freese, F. 1962. Elementary Forest Sampling. Agriculture Handbook. U.S. Department of Agriculture, Forest Service.

FVS. 2021. “Northeast (NE) Variant Overview - Forest Vegetation Simulator.” Internal Rep. Fort Collins, CO: U.S. Department of Agriculture, Forest Service, Forest Management Service Center. 56p. https://www.fs.fed.us/fvs/documents/guides.shtml.

Ghosh, M., and G. Meeden. 1997. Bayesian Methods for Finite Population Sampling. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis.

Gove, Jeffrey H, Timothy G Gregoire, Mark J Ducey, and Thomas B Lynch. 2020. “A Note on the Estimation of Variance for Big BAF Sampling.” Forest Ecosystems 7 (1): 62.

Green, E. J., A. O. Finley, and W. E. Strawderman. 2020. Introduction to Bayesian Methods in Ecology and Natural Resources. Springer.

———. 2021. Introduction to Bayesian Methods in Ecology and Natural Resources. Springer.

Gregoire, T. G., and H. T. Valentine. 2007. Sampling Strategies for Natural Resources and the Environment. Chapman & Hall/CRC Applied Environmental Statistics. Taylor & Francis.

Grosenbaugh, L. R. 1952. “Plotless Timber Estimates—New, Fast, Easy.” Journal of Forestry 50: 32–37.

———. 1958. “Point-Sampling and Line-Sampling Probability Theory, Geometric Implications, Synthesis.” USDA Forest Service, Southern Forest Experiment Station, Occasional Paper 160.

———. 1965. Three-Pee Sampling Theory and Program ’THRP’ for Computer Generation of Selection Criteria. U.s. Forest Service Research Paper PSW. Pacific Southwest Forest; Range Experiment Station, Forest Service, U.S. Department of Agriculture.

Grosenbaugh, L. R., and W. S. Stover. 1957. “Point-Sampling Compared with Plot-Sampling in Southeast Texas.” Forest Science 3 (1): 2–14.

Hanks, Leland F. 1976. “Hardwood Tree Grades for Factory Lumber.” Res. Pap. NE-333 Upper Darby, PA: U.S. Department of Agriculture, Forest Service, Northeastern Forest Experiment Station. 81 p. https://www.fs.usda.gov/research/treesearch/3946.

Hendrey, George R, David S Ellsworth, Keith F Lewin, and JohN Nagy. 1999. “A Free-Air Enrichment System for Exposing Tall Forest Vegetation to Elevated Atmospheric CO2.” Global Change Biology 5 (3): 293–309.

Hogg, R. V., E. Tanis, and D. Zimmerman. 2015. Probability and Statistical Inference. Pearson Education.

Honer, T. G. 1967. “Standard Volume Tables and Merchantable Conversion Factors for the Commercial Tree Species of Central and Eastern Canada.” Can. Dept. Forestry Rural Devel., For. Mgmt. Res. And Serv. Inst. Info. Rep. FMR-X-5.

IEEE. 2008. “IEEE Standard for Floating-Point Arithmetic.” IEEE Std 754-2008, 1–70. https://doi.org/10.1109/IEEESTD.2008.4610935.

Iles, K. 2003. A Sampler of Inventory Topics: A Practical Discussion for Resource Samplers, Concentrating on Forest Inventory Techniques. Kim Iles & Associates.

Iles, Kim, and Mike Fall. 1988. “Can an Angle Gauge Really Evaluate "Borderline Trees" Accurately in Variable Plot Sampling?” Canadian Journal of Forest Research 18 (6): 776–83. https://doi.org/10.1139/x88-118.

Iles, Kim, and William H. Wilson. 1988. “Changing Angle Gauges in Variable Plot Sampling: Is There a Bias Under Ordinary Conditions?” Canadian Journal of Forest Research 18 (6): 770–75. https://doi.org/10.1139/x88-117.

Itter, Malcolm S., Andrew O. Finley, Anthony W. D’Amato, Jane R. Foster, and John B. Bradford. 2017. “Variable Effects of Climate on Forest Growth in Relation to Climate Extremes, Disturbance, and Forest Dynamics.” Ecological Applications 27 (4): 1082–95.

Jenkins, J. C., D. C. Chojnacky, L. S. Heath, and R. A. Birdsey. 2002. “Comprehensive Database of Diameter-Based Biomass Regressions for North American Tree Species.” Gen. Tech. Rep. NE-319. Newtown Square, PA: U.S. Department of Agriculture, Forest Service, Northeastern Research Station.

———. 2003. “National-Scale Biomass Estimators for United States Tree Species.” Forest Science 49 (1): 12–35.

Johnson, E. W. 2000. Forest Sampling Desk Reference. CRC Press.

Johnson, R. 1969. Grammatical Commentaries, 1706. English Linguistics 1500-1800. Scolar P.

Jones, Matthew, Margaret O’Brien, Bryce Mecum, Carl Boettiger, Mark Schildhauer, Mitchell Maier, Timothy Whiteaker, Stevan Earl, and Steven Chong. 2019. “Ecological Metadata Language Version 2.2.0.” https://doi.org/10.5063/f11834t2.

Kangas, A., and M. Maltamo. 2006. Forest Inventory: Methodology and Applications. Managing Forest Ecosystems. Springer Netherlands.

Karnosky, David F., John M. Skelly, Kevin E. Percy, and Art H. Chappelka. 2007. “Perspectives Regarding 50 Years of Research on Effects of Tropospheric Ozone Air Pollution on US Forests.” Environmental Pollution 147 (3): 489–506. https://doi.org/https://doi.org/10.1016/j.envpol.2006.08.043.

Kenefic, Laura S., Nicole S. Rogers, Joshua J. Puhlick, Justin D. Waskiewicz, and John C. Brissette. 2015. “Overstory Tree and Regeneration Data from the "Silvicultural Effects on Composition, Structure, and Growth" Study at Penobscot Experimental Forest. 2nd Edition. Fort Collins, CO: Forest Service Research Data Archive.”

Kershaw, J. A., M. J. Ducey, T. W. Beers, and B. Husch. 2016. Forest Mensuration. Wiley.

Kubiske, M. E., V. S. Quinn, P. E. Marquardt, and D. F. Karnosky. 2007. “Effects of Elevated Atmospheric CO2 and/or O3 on Intra- and Interspecific Competitive Ability of Aspen.” Plant Biology 9 (2): 342–55.

Kubiske, Mark E. 2013. “Tree Height and Diameter Data from the Aspen FACE Experiment, 1997-2008. Newtown Square, PA: USDA Forest Service, Northern Research Station.”

Laar, A. van, and A. Akça. 2007. Forest Mensuration. Managing Forest Ecosystems. Springer Netherlands.

Little, Roderick J. 2004. “To Model or Not to Model? Competing Modes of Inference for Finite Population Sampling.” Journal of the American Statistical Association 99 (466): 546–56.

Lohr, Sharon L. 2019. Sampling: Design and Analysis: Design and Analysis. Chapman & Hall/CRC Texts in Statistical Science. CRC Press.

Lohr, Steve. 2014. “For Big-Data Scientists, ’Janitor Work’ Is Key Hurdle to Insights.” The New York Times. https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html.

Luoma, V., N. Saarinen, M. A. Wulder, J. C. White, M. Vastaranta, M Holopainen, and Hyyppä J. 2017. “Assessing Precision in Conventional Field Measurements of Individual Tree Attributes.” Forests 8 (2): 38.

Mandallaz, D. 2007. Sampling Techniques for Forest Inventories. Chapman & Hall/CRC Applied Environmental Statistics. CRC Press.

Marshall, David D, Kim Iles, and John F Bell. 2004. “Using a Large-Angle Gauge to Select Trees for Measurement in Variable Plot Sampling.” Canadian Journal of Forest Research 34 (4): 840–45. https://doi.org/10.1139/x03-240.

McLaughlin, S. B., M. Nosal, S. D. Wullschleger, and G. Sun. 2007. “Interactive Effects of Ozone and Climate on Tree Growth and Water Use in a Southern Appalachian Forest in the USA.” New Phytologist 174 (1): 109–24. https://doi.org/https://doi.org/10.1111/j.1469-8137.2007.02018.x.

Mehtätalo, L., and J. Lappi. 2020. Biometry for Forestry and Enviornmental Data: With Examples in r. CHAPMAN & HALL CRC.

Mesavage, Clement, and James W. Girard. 1956. Tables for Estimating Board-Foot Volume of Timber. U.S. Government Printing Office. https://books.google.com/books?id=dpBnwwEACAAJ.

Næsset, E. 2011. “Estimating Above-Ground Biomass in Young Forests with Airborne Laser Scanning.” International Journal of Remote Sensing 32 (2): 473–501.

Neigh, CSR, RF Nelson, KJ Ranson, HA Margolis, PM Montesano, G Sun, V Kharuk, E Næsset, MA Wulder, and H Andersen. 2013. “Taking Stock of Circumboreal Forest Carbon with Ground Measurements, Airborne and Spaceborne LiDAR.” Remote Sensing of Environment 137: 274–87.

Nigh, Gordon D. 2016. “Total and Merchantable Volume Equations for Common Tree Species in British Columbia: By Region and Biogeoclimactic Zone.” Prov. B.C., Victoria, B.C. Tech. Rep. 106. http://www.for.gov.bc.ca/hfd/pubs/Docs/Tr/Tr106.html.

Oderwald, Richard G. 1981. “Point and Plot Sampling–The Relationship.” Journal of Forestry 79 (6): 377–78.

OED. 2023. “The Oxford English Dictionary.” Oxford University Press. https://doi.org/10.1093/OED/2306046169.

Omule, Stephen A. Y. 1980. “Personal Bias in Forest Measurements.” The Forestry Chronicle 56 (5): 222–24. https://doi.org/10.5558/tfc56222-5.

Orwig, David A., Jonathan R. Thompson, Nicholas A. Povak, Megan Manner, Donald Niebyl, and David R. Foster. 2012. “A Foundation Tree at the Precipice: Tsuga Canadensis Health After the Arrival of Adelges Tsugae in Central New England.” Ecosphere 3 (1): art10.

Orwig, David, David Foster, and Aaron Ellison. 2015. “Harvard Forest CTFS-ForestGEO Mapped Forest Plot Since 2014.” Harvard Forest Data Archive: HF253. https://harvardforest1.fas.harvard.edu/exist/apps/datasets/showData.html?id=hf253.

Palley, Marshall N., and Leah G. Horwitz. 1961. “Properties of Some Random and Systematic Point Sampling Estimators.” Forest Science 7 (1): 52–65.

Pebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With applications in R. Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016.

Pedersen, Thomas Lin. 2024. Patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.

Pezoa, Felipe, Juan L Reutter, Fernando Suarez, Martı́n Ugarte, and Domagoj Vrgoč. 2016. “Foundations of JSON Schema.” In Proceedings of the 25th International Conference on World Wide Web, 263–73. International World Wide Web Conferences Steering Committee.

Powers, Stephen M., and Stephanie E. Hampton. 2019. “Open Science, Reproducibility, and Transparency in Ecology.” Ecological Applications 29 (1): e01822. https://doi.org/10.1002/eap.1822.

Python Core Team. 2015. Python: A Dynamic, Open Source Programming Language. Python Software Foundation. https://www.python.org/.

R Core Team. 2023. Foreign: Read Data Stored by ’Minitab’, ’s’, ’SAS’, ’SPSS’, ’Stata’, ’Systat’, ’Weka’, ’dBase’, ... https://CRAN.R-project.org/package=foreign.

———. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Radtke, P. J., D. M. Walker, A. R. Weiskittel, J. Frank, J. W. Coulston, and J. A. Westfall. 2015. “Legacy Tree Data: A National Database of Detailed Tree Measurements for Volume, Weight, and Physical Properties.” Gen. Tech. Rep. PNW-GTR-931. Portland, OR: US Department of Agriculture, Forest Service, Pacific Northwest Research Station: 25-30. (Vol. 931, Pp. 25-30).

Rao, J. N. K., and I. Molina. 2015. Small Area Estimation. Wiley Series in Survey Methodology. Wiley.

Robinson, A. P., and J. D. Hamann. 2011. Forest Analytics with r: An Introduction. Springer.

Roesch, Francis A., Edwin J. Green, and Charles T. Scott. 1993. “An Alternative View of Forest Sampling.” Survey Methodology, 199–204.

RStudio Team. 2020. RStudio: Integrated Development Environment for r. Boston, MA: RStudio, PBC. http://www.rstudio.com/.

Sandve, Geir Kjetil, Anton Nekrutenko, James Taylor, and Eivind Hovig. 2013. “Ten Simple Rules for Reproducible Computational Research.” PLoS Computational Biology 9 (10): e1003285.

Särndal, C. E., B. Swensson, and J. H. Wretman. 1992. Model Assisted Survey Sampling. Springer Series in Statistics. Springer-Verlag.

Schmid, Paul. 1969. “Stichproben Am Waldrand” 45 (3): 234–303.

Schreuder, H. T., Ernst R., and H. Ramirez-Maldonado. 2004. “Statistical Techniques for Sampling and Monitoring Natural Resources.” Gen. Tech. Rep. RMRS-GTR-126. Fort Collins, CO: U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station. 111 p.

Shiver, B. D., and B. E. Borders. 1995. Sampling Techniques for Forest Resource Inventory. Wiley.

Thompson, Steven K. 2012. Sampling. 3rd ed. Wiley Series in Probability and Statistics. Hoboken, NJ: John Wiley & Sons.

UNFCCC. 2015. “The Paris Agreement.” https://unfccc.int/process-and-meetings/the-paris-agreement/the-paris-agreement.

UN-REDD. 2008. “Reducing Emissions from Deforestation and Forest Degradation Programme.” http://www.un-redd.org.

U.S. DOE. 2020. “U.s. Department of Energy Free-Air CO2 Enrichment Experiments: FACE Results, Lessons, and Legacy, DOE/SC–0202.” U.S. Department of Energy Office of Science. DOI:10.2172/1615612. https://ess.science.energy.gov/face/.

USDA. 2006. National Forest Log Scaling Handbook. FSH; 2409.11 U.S. Department of Agriculture, Forest Service.

———. 2021. “The PLANTS Database.” U.S. Department of Agriculture, Natural Resources Conservation Service, National Plant Data Team, Greensboro, NC USA. http://plants.usda.gov.

USFS. 1996. Forest Service Handbook, Washington : FSH 2409.12, Timber Cruising Handbook. Washington, DC: The Forest Service, 1996-.

Waskiewicz, J. D., L. S. Kenefic, N. S. Rogers, J. J. Puhlick, J. C. Brissette, and R. J. Dionne. 2015. “Sampling and Measurement Protocols for Long-Term Silvicultural Studies on the Penobscot Experimental Forest.” Gen. Tech. Rep. NRS-147. Newtown Square, PA: U.S. Department of Agriculture, Forest Service, Northern Research Station. 32 p.

Weiskittel, A. R., D. W. Hann, J. A. Kershaw, and J. K. Vanclay. 2011. Forest Growth and Yield Modeling. Wiley.

Welch, B. L. 1947. “The Generalization of Student’s Problem When Several Different Population Variances Are Involved.” Biometrika 34 (1-2): 28–35. https://doi.org/10.2307/2332510.

Wensel, Lee C., Jack Levitan, and Klaus Barber. 1980. “Selection of Basal Area Factor in Point Sampling.” Journal of Forestry 78 (2): 83–84. https://doi.org/10.1093/jof/78.2.83.

West, P. W. 2004. Tree and Forest Measurement. Springer.

West, Philip. 2011. “Potential for Wider Application of 3P Sampling in Forest Inventory.” Canadian Journal of Forest Research 41 (July): 1500–1508. https://doi.org/10.1139/x11-062.

Westfall, James A., John W. Coulston, Andrew N. Gray, John D. Shaw, Philip J. Radtke, David M. Walker, Aaron R. Weiskittel, et al. 2024. “A National-Scale Tree Volume, Biomass, and Carbon Modeling System for the United States.” General Technical Report WO-104. Washington, DC: U.S. Department of Agriculture, Forest Service, Northern Research Station. https://doi.org/10.2737/WO-GTR-104.

Wiant, Harry V. 1986. “Computer Corner: Formulas for Mesavage and Girard’s Volume Tables.” Northern Journal of Applied Forestry 3 (3): 124–24. https://doi.org/10.1093/njaf/3.3.124.

Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” Journal of Computational and Graphical Statistics 19 (1): 3–28. https://doi.org/10.1198/jcgs.2009.07098.

———. 2019. Advanced r. CRC press.

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. O’Reilly Media, Inc.

Wickham, Hadley, Evan Miller, and Danny Smith. 2023. Haven: Import and Export ’SPSS’, ’Stata’ and ’SAS’ Files. https://CRAN.R-project.org/package=haven.

Wickham, Hadley, Danielle Navarro, and Thomas Lin Pedersen. 2023. Ggplot2: Elegant Graphics for Data Analysis. 3rd ed. Springer New York. http://had.co.nz/ggplot2/book.

Wilke, Claus O. 2024. Cowplot: Streamlined Plot Theme and Plot Annotations for ’Ggplot2’. https://CRAN.R-project.org/package=cowplot.

Wilkinson, L., D. Wills, D. Rope, A. Norton, and R. Dubbs. 2005. The Grammar of Graphics. Statistics and Computing. Springer New York.

Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K Teal. 2017. “Good Enough Practices in Scientific Computing.” PLoS Computational Biology 13 (6): e1005510.

Wood, Frederica, James N. Kochenderfer, and Mary B. Adams. 2016. “Felled Tree Biomass for Four Hardwood Species of the Central Appalachians, West Virginia. Fort Collins, CO: Forest Service Research Data Archive.”

Young, H. E. 1957. Additional Volume Tables for Maine. Maine Agricultural Experiment Station Miscellaneous Publication 627.

References