Chapter 13 Sampling designs, implementation, and analysis
In Chapters 11 and 12, we focused on inference using sample data collected under an SRS design. In this chapter, we introduce several additional probability sampling designs commonly used in forestry. Recall that a sampling design is a set of rules for selecting units from a population and using the resulting data to generate statistically valid estimates for parameters of interest. We focus on several sampling designs: 1) systematic, 2) stratified, 3) regression, 4) cluster sampling, and 5) multistage. These designs differ in how they configure and select observations and in the estimators they prescribe. In most cases, a design is chosen to improve efficiency from a statistical, logistical, or combined standpoint.
As discussed in Section 12.1, forest inventory typically relies on an areal sampling frame, where sampling locations are identified by randomly placing points within the frame’s spatial extent. Under plot sampling, the location might mark the center of a circular plot, a corner of a rectangular plot, or an anchor point for more complex plot shapes. Under point sampling, a location might serve as the point from which a forester projects the discerning angle to identify measurement trees. Unless stated otherwise, the sampling designs described in this chapter can be used with either plot or point sampling—that is, the design governs the selection of sampling locations, while a separate set of rules determines how trees are selected around each location.
As in other mensuration texts (Kershaw et al. 2016; Burkhart, Avery, and Bullock 2018), we present estimators associated with each design. The text by Iles (K. Iles 2003) is especially valuable for readers seeking practical guidance: it expands on these methods, discusses their implementation in the field, and provides real-world perspectives and advancements. Our added contribution is to show how these estimators can be implemented using tidyverse
tools and other R packages. Emphasis is placed on building computing workflows for efficient and repeatable analysis.
13.1 A gentle introduction to sf
Because sampling in forestry is inherently spatial—we’re selecting locations within an areal sampling frame—it’s helpful to start working with spatially aware data structures. In R, the primary tool for handling spatial vector data is the sf
package.
If you’ve used GIS software before, many of the concepts will feel familiar. Spatial vector data come in three main types: points (e.g., sampling locations), lines (e.g., transects or travel paths), and polygons (e.g., stands, management unit, or forest extents). The sf
package represents each of these as a special kind of data frame with an attached spatial geometry column.
You can work with sf
objects much like regular data frames, using familiar tidyverse
tools such as filter()
, select()
, and mutate()
. We’ll introduce key sf
data structures and functions as needed throughout the chapter. For more background, see the sf
documentation at https://r-spatial.github.io/sf or the package vignette via vignette("sf1")
.
13.2 Systematic sampling
Systematic sampling (SYS) is the most widely used probability sampling design in forest inventory. In SYS, sampling locations are arranged in a fixed pattern across the population. To introduce randomness and ensure unbiased estimation of means and totals, the pattern—typically a rectangular grid—is positioned using a randomly chosen starting point. As illustrated in Figure 1.2, sampling locations are usually aligned along straight lines. The distance between locations on a line, known as the sampling interval, and the distance between lines can differ, creating a rectangular grid. This is commonly referred to as line-plot sampling, which is our focus here.148
SYS designs are popular in forestry for several reasons. Compared to SRS, their advantages include the following.
Sampling locations are evenly distributed across the population, improving the reliability of mean and total estimates. In contrast, SRS locations may, by chance, miss important spatial features.
Sampling locations are easier and faster to find, reducing cost. The cruiser can follow straight lines between locations, which minimizes travel time and simplifies relocation during remeasurement.
The grid is typically positioned using GIS or spatial software (e.g., R), and the resulting points are uploaded to a GPS device for navigation. Alternatively, the cruiser may locate the first sampling point in the field and lay out the remainder using the sampling interval and grid orientation.
Key considerations when implementing a SYS design include:
the pattern and orientation of the grid;
the method used to randomly position the grid.
These points are covered in the next few sections.
13.2.1 Grid pattern, number of grid points, and orientation
In practice, either a rectangular or square grid is used. A rectangular grid defines the spacing between sampling locations along a line, denoted \(D_p\) (with the subscript \(p\) reminding us this spacing is for “plots” or “points”), and the spacing between lines, \(D_l\). A square grid is a special case where \(D_{\text{sq}} = D_p = D_l\). Unless there’s a compelling reason to use unequal spacing, a square grid is a reasonable and efficient default.
Given a sample size \(n\) and total area \(A\), the spacing between locations in a square grid is given by
\[\begin{equation} D_{\text{sq}} = \sqrt{\frac{cA}{n}}, \end{equation}\]
where \(c\) is a conversion factor to express area in square distance units. For example, use \(c = 10,\!000\ \text{m}^2/\text{ha}\) if \(A\) is in hectares and \(D_{\text{sq}}\) is in meters, or \(c = 43,\!560\ \text{ft}^2/\text{ac}\) if \(A\) is in acres and \(D_{\text{sq}}\) is in feet.
As introduced in Section 11.3.6, sample size \(n\) should be based on the level of precision needed to meet the survey’s objectives. However, in practice, it’s common for experienced foresters to adopt a sampling intensity that works well for their forest type and inferential objectives. For example, the forester who conducted the Elk County cruise described in Section 1.2.2 typically uses a seven-by-seven chain SYS grid.
If you do use a sample size calculation, as recommended in Section 11.3.6, it’s likely that the number of grid points won’t exactly match the computed \(n\) due to how the grid falls within the boundary. In such cases, we agree with the following sage words.
My advice is simply to take the number of points that are indicated [by the grid], and live with the slight variability. Sample sizes that are “needed” … are just estimated anyway.
— Kim Iles
If you really must have a specific sample size, then the best approach is to specify a denser grid than needed and randomly or systematically thin points until the target sample size is reached (see, e.g., K. Iles (2003)).
The grid orientation can be adjusted to improve field logistics. For example, aligning lines with a forest boundary may simplify navigation. This was the approach used in the Elk County cruise (Figure 1.2).
Although rare, it’s possible for a population to exhibit a systematic pattern that aligns with the grid’s spacing and orientation, which could introduce bias. For example, a grid aligned with plantation rows might consistently place sampling locations between rows, leading to an unrepresentative sample. In such cases, rotating the grid to avoid alignment with periodic features can mitigate bias.
If a known gradient exists in the population—for example, productivity increases with elevation—orienting the grid so that its lines run perpendicular to the gradient can improve sample representativeness.
13.2.2 Positioning the grid
To avoid bias, the grid must be randomly placed. This is typically done by selecting a random point start within the areal sampling frame and extending the grid in each direction from that point. If rotation is applied, it should be performed about this random point start.
The random point start can be one of the sampling locations or used as an anchor point from which to extend the grid. In the development that follows, we treat the starting point as a location on the sampling grid.
In the next section, we introduce a function that generates systematic sampling grids in R. This function allows users to specify grid spacing, orientation, and how sampling locations are ordered. The code and examples also help highlight some practical considerations when laying out systematic designs.
13.2.3 Creating a systematic sampling grid with sys_grid()
This section is presented in the style of a condensed R help page to document the sys_grid()
function. For a deeper understanding, you’re encouraged to read through the function source code in "scripts/utils.R"
(see Section 5.4). The code comments walk through each step, and many of the elements should already be familiar.
Description
Creates a systematic grid of points within a polygonal boundary using a random start. The grid is defined by a fixed spacing between lines (D_l
) and between points along each line (D_p
). The grid can be rotated and the points ordered from a specified origin to facilitate field navigation.
Usage
Arguments
boundary
: Ansf
polygon object defining the sampling boundary. Must be projected with linear units.D_l
: Distance between sampling lines (x-direction) in the same units asboundary
.D_p
: Distance between points along a line (y-direction) in map units asboundary
.angle
: Optional rotation angle (in degrees) applied to the grid. Default is0
.origin
: Optional character string indicating the origin corner used to order points. One of"se"
,"sw"
,"ne"
, or"nw"
(default is"se"
).
Details
The sys_grid()
function creates a rectangular grid of points with spacing determined by D_l
and D_p
, starting from a randomly placed point within the boundary polygon. Here’s how it works.
- A random point start is placed within the
boundary
. - A full grid is generated around that point, extended enough to accommodate possible rotation.
- Points are ordered in a snaking pattern from the specified
origin
. - If specified, the grid is rotated by
angle
degrees about the random point start. - The grid is clipped to the polygon, and the resulting points are labeled and returned.
Ordering by origin
can help plan field navigation, especially when following lines. The choice of origin affects how line_id
, point_id_in_line
, and point_id
are assigned, but not the design itself.
Value
Returns a list with the following components149.
grid
: Ansf
point object with columns:point_id
: unique ID, ordered from the origin;line_id
: identifier for each line;point_id_in_line
: position along each line.
boundary
are appended.path
: Ansf
LINESTRING
connecting the points in sampling order.random_point_start
: The initial random point start (ansf
point object).boundary
: The input polygon(s) used to clip the grid.type
: A string used internally to verify the object when plotting.
Use plot_sys_grid()
to visualize the resulting grid and path.
13.2.4 Systematic sampling grid illustration with sys_grid()
Let’s apply sys_grid()
to set up a SYS design for the Elk County cruise described in Section 1.2.2. In that cruise, the forester used a square grid with sampling locations spaced every seven chains (recall that one chain equals 66 feet).
The sys_grid()
function expects the boundary polygon to be in a projection that uses the same distance units as D_l
and D_p
. The Elk County property is stored in the "datasets/PA/Elk_bounds"
ESRI shapefile (Environmental Systems Research Institute 1998), which uses a geographic coordinate system (i.e., decimal degrees of latitude and longitude)—which isn’t an intuitive measurement system for field work. Instead, we want the boundary in a projected coordinate system that allows us to specify D_l
and D_p
in feet or meters.
After reading the shapefile using st_read()
, the code below reprojects the boundary to a projection appropriate for northern Pennsylvania and consistent with the cruise protocol. This is done using st_transform()
.150
# Read in the bounding polygon.
bounds <- st_read("datasets/PA", "Elk_bounds", quiet = TRUE)
bounds <- st_transform(bounds, crs = st_crs(2271))
Once reprojected, bounds
has units in feet, so we specify D_l
and D_p
in feet when calling sys_grid()
. Because we don’t specify values for angle
and origin
, the defaults of 0
and "se"
, respectively, are used. To ensure the same random point start is generated each time the code is run, we set the random seed prior to calling the function.
We can visualize the resulting SYS design by passing the sys
object to the plot_sys_grid()
function, defined in the "scripts/utils.R"
script. The resulting map is shown in Figure 13.1. We encourage you to explore the plot_sys_grid()
function to see how it builds the graphic using ggplot()
. Compared to the simpler examples in Chapter 9, this plot is a bit more involved, with a customized legend and multiple spatial layers.151
FIGURE 13.1: Systematic sampling design for an Elk County property cruise.
Let’s create a few variations on the design shown in Figure 13.1 by rotating the grid and choosing a different origin—while keeping the same random point start. This isolates the effect of orientation and ordering. To reduce code clutter, we don’t show the calls to plot_sys_grid()
or the patchwork
functions used to produce the composite graphic shown in Figure 13.2.
# (a) Rotate grid 45 degrees.
set.seed(1)
sys_a <- sys_grid(bounds,
D_l = 7 * 66, D_p = 7 * 66, angle = 45)
# (b) Northwest corner start.
set.seed(1)
sys_b <- sys_grid(bounds,
D_l = 7 * 66, D_p = 7 * 66, origin = "nw")
# (c) Rotation and northwest corner start.
set.seed(1)
sys_c <- sys_grid(bounds,
D_l = 7 * 66, D_p = 7 * 66,
angle = 45, origin = "nw")
FIGURE 13.2: Systematic sampling design for an Elk County cruise under different grid rotations and starting origins.
Notice in Figures 13.1 and 13.2 that the only random component of systematic sampling is the initial point start. Once that point is chosen, all other sampling locations are fixed by the grid’s spacing and rotation. Thus, the layout looks highly structured, but it still retains a probabilistic foundation because the sample depends on a randomly chosen start. This single source of randomness is what allows SYS to be used for statistical inference, subject to the assumptions and caveats outlined in the next section.
With the sampling grid in place, we now turn to estimation using data collected under systematic sampling.
13.2.5 Estimation using SYS
The SYS estimators for the population mean \(\mu\) (11.1) and total \(\tau\) (11.6) are the same unbiased sample mean \(\bar{y}\) (11.15) and total \(\hat{t}\) (11.22) used under SRS (Section 11.3). Both are unbiased under SYS.
There is, however, no valid estimator for the sample variance (and thus no estimator for the standard error of the mean) under SYS. The common workaround is to apply the SRS estimators for variance (11.17) and standard error (11.21).
In theory, these SRS-based estimators can either overstate or understate variability, and it’s not straightforward to know which case applies without additional information about the population. In practice, however, they usually perform well enough to be useful and are widely adopted.
The reason SYS lacks a valid variance estimator is fairly intuitive. Recall from Section 11.3 that the sample variance estimator for SRS assumes the \(n\) observations are drawn independently and at random from the population. This independence is critical: it justifies the use of \(n - 1\) in the denominator of the variance estimator (11.17) (the subtraction of one reflects that one piece of information has already been used to estimate the sample mean).
By contrast, under SYS a single random start point determines the entire sample. Once the start is fixed, all subsequent locations are determined by the grid layout, so the observations are not independent. From this perspective, SYS effectively behaves like a sample of size one. Plugging \(n = 1\) into the SRS variance estimator (11.17) leads to division by zero, which is undefined. This highlights the core issue: SYS does not yield a set of independent observations, so the usual SRS variance estimators do not apply.
Despite these limitations, the SRS variance estimator is commonly used as a practical approximation for SYS. In most applications it provides a reasonable estimate of sample variance and, by extension, the standard error of the mean. Still, interpretation requires care: how the population is structured, and how that structure interacts with the systematic grid, strongly influences the reliability of variance estimates. See Kershaw et al. (2016) and K. Iles (2003) for further discussion, including cautions and justification for why the SRS variance estimator is often considered acceptable for SYS data.
13.2.6 Advantages and disadvantages of systematic sampling
Systematic sampling offers several practical benefits and is widely used in forestry, but like all designs, it involves trade-offs. Here we summarize its main strengths and limitations, particularly in comparison to SRS.
Advantages
Efficient field layout: SYS designs are easy to implement in the field. Sampling locations fall on a predictable grid, allowing cruisers to follow straight lines, reducing travel time and simplifying navigation.
Improved spatial coverage: Grid-based sampling ensures sampling locations are evenly distributed across the population, helping to capture spatial heterogeneity and reduce the risk of missing important features—a common concern with SRS.
Simplifies remeasurement: Because sampling locations follow a fixed, repeatable pattern, they’re easier to relocate during future inventory cycles.
Cost-effective: The combination of reduced travel time and simplified logistics typically results in lower per-unit sampling cost compared to SRS.
Good performance in many populations: SYS often yields more precise estimates than SRS for the same sample size, particularly when the population lacks strong periodicity that aligns with the grid. When a broad spatial trend is present, careful grid orientation can actually improve precision—for example, by laying the grid across the direction of the trend to ensure full coverage.
Disadvantages
No valid variance estimator: SYS lacks a design-based estimator for the sample variance and standard error. While the SRS estimator is commonly used as an approximation, it may over- or underestimate uncertainty, depending on the population structure.
Potential bias from periodicity: If the population exhibits spatial patterns that align with the grid spacing or orientation (e.g., plantation rows, windrow lines, soil type bands), SYS can produce biased estimates.
Requires spatial data infrastructure: Implementing SYS effectively typically relies on GIS tools and spatial data, which may be a barrier in some settings.
Despite these drawbacks, SYS remains a go-to design in operational forestry due to its simplicity, spatial balance, and low implementation cost. When used thoughtfully—with attention to population structure and proper randomization of the starting point—SYS can yield highly effective inventory data.
13.3 Stratified sampling
In forest inventory, it’s common to estimate population parameters not only for the entire area of interest but also for smaller subareas, or strata. Stratification refers to dividing the population into non-overlapping subareas, either to improve statistical efficiency or to ensure estimates are available for specific subareas of interest. Ideally, strata are defined using characteristics expected to influence the parameter being estimated, making them internally homogeneous but meaningfully different from one another. Common stratification variables include species, age, productivity, and management unit.
Stratified sampling involves drawing a probability sample independently within each stratum. While sampling within strata is most often done using SRS or SYS, other probability sampling methods can be used. The main benefits of stratification are potential gains in statistical efficiency, i.e., reduced standard errors, and the ability to produce both overall and stratum-specific estimates.
Stratification is often implemented in one of two ways.
Pre-stratification: Strata are delineated before sampling begins, letting us allocate stratum-specific sample sizes based on stratum size, expected variability, and sampling costs. This gives full control over the number of observations placed in each stratum. In this case, the area of each stratum must be known ahead of time.
Post-stratification: Here, sampling proceeds without regard to strata, and strata are assigned after the fact based on information collected during or after sampling. This still supports stratum-specific estimation, but the per-stratum sample sizes aren’t under our control. The area of each stratum must be determined after sampling.
Both approaches are valid. The main advantage of pre-stratification is the ability to apply optimum allocation—assigning more observations to strata that are large or exhibit greater variability, and fewer to those that are small, more homogeneous, or costly to sample. This improves statistical efficiency while controlling sampling costs.
There are substantial statistical and cost-saving benefits to stratification. Even a poor choice of strata yields unbiased estimates and, in almost all practical cases, performs no worse than unstratified sampling. At worst, you’ll end up with about the same standard error as you’d get without stratification; at best, you can dramatically improve statistical and cost efficiency.
Stratified sampling proceeds in two steps.
Assignment: Before any measurements are taken (in the case of pre-stratification), areas of the population—such as stands or management units—are assigned to strata using available information.
Measurement: A sample is then drawn within each stratum. At least two observations per stratum are needed to compute within-stratum variance.
13.3.1 Choosing stratum membership
Stratification begins by subdividing the population into distinct strata based on available information. Assignment can be guided by prior inventory data (e.g., species, age, height, volume, or biomass), expert judgment, or ancillary information like topography, canopy structure, or remote sensing-based spectral or structural measurements. Strata may also align with management units or areas where a separate estimate is desired.
The key requirement is that strata don’t overlap. For example, if strata are defined using polygons—where one or more polygons comprise a stratum—those polygons must be spatially distinct, i.e., their boundaries don’t overlap. Each location in the population must belong to one and only one stratum.
Useful strata group together areas that are similar to each other and different from the rest. Ideally, variance within a stratum is low, and variance between strata is large. These characteristics improve statistical efficiency.
A stratum can consist of spatially contiguous or disjoint areas. For example, a stratum may comprise spatially adjacent stands or stands scattered across the forest. It’s the internal consistency of each stratum that counts—not whether the areas are next to each other on the map.
In the next section, we introduce the estimators used in stratified sampling.
13.3.2 Estimation using stratified sampling
Throughout this section, and in the subsequent illustration, we assume that sampling within each stratum is done using SRS or SYS, and that SRS estimators are used for the stratum means and variances. Other within-stratum designs are possible but may require different estimators.
We introduce the following notation:
- \(H\) number of strata in the population;
- \(N_h\) population size in the \(h\)-th stratum;
- \(N = \sum_{h = 1}^H N_h\) total population size across all strata;
- \(n_h\) sample size in the \(h\)-th stratum;
- \(n = \sum_{h = 1}^H n_h\) total sample size across all strata.
If we’re working from an areal sampling frame, then \(N\) and \(N_h\) are area units, e.g., acres or hectares.
The stratum population mean \(\mu_h\) is estimated using the sample mean \[\begin{equation} \bar{y}_h = \frac{\sum_{i = 1}^{n_h} y_{hi}}{n_h}, \tag{13.1} \end{equation}\] where \(y_{hi}\) is the \(i\)-th observation in the \(h\)-th stratum.
The overall population mean \(\mu\) (i.e., across all strata) is estimated using \[\begin{equation} \bar{y} = \frac{\sum_{h = 1}^H N_h \bar{y}_h}{N}. \tag{13.2} \end{equation}\] This is a weighted sum of the stratum-specific sample means, with weights given by each stratum’s proportion of the total population size (or area), i.e., \(N_h / N\).
The population total \(\tau\) is estimated using \[\begin{equation} \hat{t} = \sum_{h = 1}^H N_h \bar{y}_h = N \bar{y}. \tag{13.3} \end{equation}\]
The sample variance within each stratum, \(s^2_h\), is estimated using the SRS variance estimator (11.17). If standard errors and confidence intervals are desired for individual strata, use (11.20) or (11.21) and (11.31), respectively.
Assuming a finite population, the standard error of the overall mean \(\bar{y}\) is \[\begin{equation} s_{\bar{y}} = \sqrt{ \frac{1}{N^2} \sum_{h = 1}^H \left[ \frac{N_h^2 s_h^2}{n_h} \left( \frac{N_h - n_h}{N_h} \right) \right] }. \tag{13.4} \end{equation}\] If we’re working from an areal sampling frame (i.e., an infinite population), the FPC is omitted (see Section 12.2), giving \[\begin{equation} s_{\bar{y}} = \sqrt{ \frac{1}{N^2} \sum_{h = 1}^H \frac{N_h^2 s_h^2}{n_h} }. \tag{13.5} \end{equation}\]
The standard error for the total estimate \(\hat{t}\) is \[\begin{equation} s_{\hat{t}} = N s_{\bar{y}}. \tag{13.6} \end{equation}\]
Confidence intervals for the mean follow the usual form \[\begin{equation} \bar{y} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\bar{y}}, \tag{13.7} \end{equation}\] with degrees of freedom computed as \[\begin{equation} df = \sum_{h = 1}^H (n_h - 1) = n - H. \tag{13.8} \end{equation}\]
Confidence intervals for the total \(\hat{t}\) follow the same form using \(s_{\hat{t}}\).
13.3.2.1 Sampling in stratum and independence
In the current setup, we assume \(n_h\) observations within each stratum are drawn independently and at random from the stratum population. In practice, however, SYS is more commonly used. For example, you might place a separate systematic grid within each stratum using a random point start, as described in Section 13.2.2. The grids may differ in spacing or rotation across strata.
The usual concern with SYS remains: with only one random start per stratum, the \(n_h\) observations are not independent. This has implications for inference. For example, the degrees of freedom in (13.8) assume independent observations within each stratum. If a single systematic grid is used to sample the stratum population, that assumption is violated. In such cases, (13.8) may be overstated, and resulting confidence intervals may be misleading.
It’s also common to use a single SYS grid that spans all strata. This is often convenient because the cruiser can follow continuous straight sampling lines that cross stratum boundaries. In this case, only one random point start is used to position the SYS grid over the entire population. Here, we’re even further from the assumption of \(n\) independent units of information.
SYS is widely used in practice and is adopted in our illustration in Section 13.3.6. As discussed in Section 13.2, it’s important to rotate the grid to avoid alignment with periodic features. In most applications, such alignment is rare, and the risk of bias or distortion of standard errors is negligible.
13.3.3 Sample size determination
After identifying strata (in the case of pre-stratification) and selecting a sampling design, the next step is to determine the total sample size \(n\). This sample size can either be fixed in advance based on available resources or determined based on a desired level of precision for estimating a population mean or total.
Stratified sampling supports two common allocation strategies: proportional allocation and optimum allocation. Proportional allocation assigns sample size to each stratum based solely on its relative area. Optimum allocation also considers variability, and if cost is taken into account, balances that against the relative cost of sampling. The goal is to allocate effort where it provides the greatest reduction in the variance of the estimated mean (and hence its standard error) per unit cost.
When the total sample size \(n\) is unknown, it can be estimated using equations similar to those for SRS (see Section 11.3.6). These equations have the same circularity issue as before: \(n\) appears on both sides of the equation. As with SRS, this can be resolved using the iterative algorithm described in Section 11.3.6.
Importantly, these sample size estimates are intended to meet allowable error targets for the overall population, not for individual strata.
13.3.3.1 Total sample size with proportional allocation
As introduced in Section 11.3.6, we can estimate sample size given an allowable error \(E\) that is in the same units as \(\bar{y}\) and an estimate for each stratum’s sample variance \(s^2_h\). Recall, because we’ve not taken the sample, the value for \(s^2_h\) must come from somewhere else, e.g., from prior experience or inventory data from a similar population.
The total sample size required under proportional allocation for a finite population is given by \[\begin{equation} n = \frac{N \sum_{h=1}^H N_h s_h^2}{\frac{N^2 E^2}{t^2} + \sum_{h=1}^H N_h s_h^2}, \tag{13.9} \end{equation}\] where \(t\) is the \(t\)-value corresponding to the desired confidence level (\(1 - \alpha\)).
If the population is assumed infinite (i.e., under an areal sampling frame), the simplified formula is \[\begin{equation} n = \frac{t^2 \sum_{h=1}^H N_h s_h^2}{N E^2}. \tag{13.10} \end{equation}\]
13.3.3.2 Total sample size with optimum allocation
Optimum allocation minimizes the variance of the stratified mean by accounting for both stratum size and variability. The total sample size for a finite population is \[\begin{equation} n = \frac{\left( \sum_{h=1}^H N_h s_h \right)^2}{\frac{N^2 E^2}{t^2} + \sum_{h=1}^H N_h s_h^2}. \tag{13.11} \end{equation}\]
For an infinite population, this becomes \[\begin{equation} n = \frac{t^2 \left( \sum_{h=1}^H N_h s_h \right)^2}{N^2 E^2}. \tag{13.12} \end{equation}\]
13.3.4 Allocation of samples to strata
Given the total sample size \(n\), the next step is to allocate observations among the strata using equations specific to proportional or optimum allocation. Whichever approach was used to determine \(n\) should also be used to allocate samples to the strata.
13.3.4.1 Proportional allocation
In proportional allocation, each stratum receives a share of the total sample based solely on its size (e.g., area). The sample size in stratum \(h\) is given by \[\begin{equation} n_h = n \frac{N_h}{N}. \tag{13.13} \end{equation}\] This approach is easy to apply but ignores differences in variability across strata.
13.3.4.2 Optimum allocation
In optimum allocation, sample sizes are chosen to minimize the variance of the stratified mean. The number of samples in stratum \(h\) is computed as \[\begin{equation} n_h = n \frac{N_h s_h}{\sum_{h=1}^H N_h s_h}. \tag{13.14} \end{equation}\] This method improves efficiency by assigning more samples to strata that are larger or more variable.
13.3.4.3 Optimum allocation with cost constraints
If sampling costs differ between strata, we can modify the optimum allocation formula to incorporate cost per sample. Let \(c_h\) denote the cost of sampling a unit in stratum \(h\). The cost-adjusted optimum allocation is \[\begin{equation} n_h = n \frac{N_h s_h / \sqrt{c_h}}{\sum_{h=1}^H N_h s_h / \sqrt{c_h}}. \tag{13.15} \end{equation}\]
This formula reduces sample sizes in strata that are more expensive to survey, while still accounting for size and variability. It’s particularly useful in operational forest inventories where travel cost, terrain, or accessibility varies across the landscape.
13.3.5 Creating a stratified systematic sampling grid
Once strata have been defined and sample sizes allocated (either proportionally or optimally), the next step is to lay out a systematic sampling grid within each stratum. This can be done using functions that extend the sys_grid()
approach to stratified designs.
One option is to build stratum-specific grids with the sys_grid()
function described in Section 13.2.3. This gives full flexibility in grid spacing, orientation, and pattern, and can be useful when strata are large or widely separated.
In many applications, however, a separate grid for each stratum complicates field logistics. When strata are small or adjacent, it’s often easier for field crews to work from a single systematic grid that covers the entire population. The two functions below support this approach by generating a unified grid across all strata under either proportional or optimum allocation. As in Section 13.2.3, the documentation is presented in the style of condensed R help pages.
13.3.5.1 Grid generation for proportional allocation strat_prop_sys_grid()
Description
Generates a systematic sampling grid using fixed spacing between lines (D_l
) and points (D_p
) across the entire area. Points are assigned to strata based on spatial intersection with the input boundary. This is designed for proportional allocation, where grid density is uniform across strata. Built on sys_grid()
with added logic to track stratum membership and within-stratum ordering.
Usage
Arguments
boundary
: Ansf
polygon object defining the stratum boundaries. Must be projected with linear units.D_l
: Distance between sampling lines (x-direction) in the same units asboundary
.D_p
: Distance between points along a line (y-direction) in the same units asboundary
.angle
: Optional rotation angle (in degrees). Default is0
.origin
: Corner from which to order lines and points. One of"se"
,"sw"
,"ne"
, or"nw"
. Default is"se"
.strat_id_col
: Name of the column inboundary
that identifies strata.
Details
This function builds a systematic grid across the union of stratum polygons using sys_grid()
, then assigns each point to a stratum via spatial intersection. Each sampling location is tagged with its unique point ID as well as its position within each line and stratum. This approach simplifies logistics when a single, uniform grid is acceptable for all strata.
Value
Returns a list with the following components.
grid
: Ansf
object of all points, with columns for uniquepoint_id
,stratum_id
, and other useful indexing IDs. Columns inboundary
are appended.path
: Ansf
LINESTRING
connecting grid points in order.random_point_start
: The initial random point start.boundary
: The input polygon with stratum attributes.type
: A string used internally for validation when plotting.
Use plot_sys_grid()
to visualize the resulting grid and path.
13.3.5.2 Grid generation for optimum allocation strat_opt_sys_grid()
Description
Creates a stratified systematic sampling grid that supports optimum allocation, assigning different numbers of grid points to each stratum. Builds a dense grid using sys_grid()
, then intersects it with stratum polygons and thins the points in each stratum to approximate the target sample sizes. The final number of points per stratum may not match the targets exactly, but the function tries to get as close as possible. Supports both random and sequential thinning.
Usage
strat_opt_sys_grid(boundary, D_l, angle = 0, origin = "se",
n_col, strat_id_col, thinning_method = "random",
D_p = NULL)
Arguments
boundary
: Ansf
polygon object defining the stratum boundaries. Must be projected with linear units.D_l
: Distance between sampling lines (x-direction) in the same units asboundary
.angle
: Optional rotation angle (in degrees). Default is0
.origin
: Corner from which to order lines and points. One of"se"
,"sw"
,"ne"
, or"nw"
. Default is"se"
.n_col
: Name of the column inboundary
containing target sample sizes per stratum.strat_id_col
: Name of the column inboundary
that identifies strata.thinning_method
: Method used to thin grid points within each stratum. Either"random"
or"sequential"
. Default is"random"
.D_p
: Optional distance between points along a line. If not supplied, it’s computed automatically from the stratum with the densest sampling, using the same units asboundary
.
Details
This function is designed for optimum allocation, where sample size varies across strata. It uses sys_grid()
to build an initial grid, assigns each point to a stratum, then selects a subset of points from each stratum using the specified thinning method. Additional columns in the output identify both the full grid and the subset of points selected for measurement.
Value
Returns a list with the following components.
grid
: Ansf
object of all points, with logicalmeasure
column and columns for uniquepoint_id
,stratum_id
, and other useful indexing IDs. Columns inboundary
are appended.measure_path
: ALINESTRING
showing the path through the selected (measure = TRUE
) points.path
: The full path through all grid points.random_point_start
: The initial random point start.boundary
: The original input polygon.type
: A string used internally for validation when plotting.
Use plot_strat_opt_sys_grid()
to visualize the sampled grid and path.
13.3.6 Illustrative analysis
We use the Harvard Forest dataset introduced in Section 1.2.5 to illustrate the application of stratified estimators. While this is a small and somewhat toy-like example, it’s useful because a complete census of trees is available—allowing us to compare our estimates directly to the true population values.
Figure 13.3 shows the extent of the 35 ha forest, partitioned into four strata. Strata 1 and 4 are mixed conifer-hardwood forest, Stratum 2 is conifer-dominated, and Stratum 3 is a sparsely forested wetland.
FIGURE 13.3: Harvard Forest population extent partitioned into four strata.
Our goal is to estimate average AGB per hectare and total AGB for each stratum and for the entire forest. The population of interest includes all stems 12.7 cm DBH and larger. We’ll also break out estimates by species groups of interest.
The cruise will use fixed-area circular plots with a radius of 9 meters. At each selected sampling location, all trees meeting the DBH threshold will be measured. See Section 12.3 for additional details on plot-based sampling.
We’d like our AGB estimates to be within \(\pm\) 35 Mg/ha of the population value with 95% confidence. This allowable error corresponds to roughly 10% of the expected AGB.
The cruise will use a stratified systematic sampling design, with sample sizes determined by optimum allocation as described in the preceding section.
Stratum ID | \(N_h\) (ha) | \(s_h\) (Mg/ha) |
---|---|---|
1 | 1.24 | 100 |
2 | 8.93 | 100 |
3 | 2.25 | 75 |
4 | 22.58 | 125 |
We begin by computing the total sample size using the stratum-specific information in Table 13.1 and the optimum allocation formula (13.12). As discussed in Section 11.3.6, this formula involves a circular dependency: we’re solving for \(n\) on the left, but also need it to determine the appropriate \(t\)-value on the right, which depends on the degrees of freedom.
To resolve this circular dependency, we use the iterative algorithm described in Section 11.3.6. The code below applies that method using a for
loop, as introduced in Section 5.2.3, to iteratively solve for \(n\) until it converges.152
H <- 4 # Number of strata.
N_h <- c(1.24, 8.93, 2.25, 22.58) # Stratum area (ha).
N <- sum(N_h) # Total area (ha).
E <- 35 # Allowable error for AGB (Mg/ha).
alpha <- 0.05 # For 95% confidence level.
s_h <- c(100, 100, 75, 125) # Stratum standard deviation (Mg/ha).
n_start <- 25 # Initial guess for total sample size.
n_prev <- n_start
for (i in 1:3) {
t <- qt(p = 1 - alpha / 2, df = n_prev - H) # df = n - H
n_next <- ceiling((t / (N * E))^2 * sum(N_h * s_h)^2)
cat("Iteration", i, ": n =", n_next, "\n")
n_prev <- n_next
}
#> Iteration 1 : n = 47
#> Iteration 2 : n = 44
#> Iteration 3 : n = 44
The algorithm converged quickly to \(n\) = 44. In the code below, we allocate the total sample size among strata using the optimum allocation formula (13.14).
n_h <- n * (N_h * s_h) / sum(N_h * s_h)
n_h <- ceiling(n_h) # Round up to the nearest whole number.
n_h
#> [1] 2 10 2 31
Note that due to rounding, we end up with \(n\) = 45 plots in the final design.
Now, let’s build our sampling grid using strat_opt_sys_grid()
. As described in Section 13.3.5, this function requires an sf
polygon object that delineates the stratum boundaries. As briefly noted in Section 13.1, sf
objects are data frames with a spatial geometry column and can be manipulated using tidyverse
functions.
The code below reads in the Harvard Forest stratum boundaries as an sf
polygon object, adds a new column for the target stratum-specific sample sizes, and prints the data frame portion of the object so we can inspect its columns. The st_drop_geometry()
function removes the spatial geometry column to simplify printed output.
bounds <- st_read("datasets/HF", "HF_stratified", quiet = TRUE)
bounds <- bounds %>% mutate(n_h = n_h, .after = stratum_id)
bounds %>% st_drop_geometry()
#> stratum_id n_h ha
#> 1 1 2 1.2361
#> 2 2 10 8.9315
#> 3 3 2 2.2494
#> 4 4 31 22.5830
With a unique identifier column for each stratum (stratum_id
) and the target sample sizes (n_h
), we’re ready to pass bounds
to strat_opt_sys_grid()
. This and other sampling functions are defined in "scripts/utils.R"
.
Before calling strat_opt_sys_grid()
in the code below, we set a random seed to ensure reproducible results.
source("scripts/utils.R")
set.seed(1)
strat_sys <- strat_opt_sys_grid(bounds, D_l = 50, n_col = "n_h",
strat_id_col = "stratum_id",
thinning_method = "random")
The call to strat_opt_sys_grid()
above sets the distance between lines to 50 meters and allows the function to compute the sampling interval along lines (D_p
) automatically.153
FIGURE 13.4: Systematic sampling with stratification for the Harvard Forest example. Illustrates random removal of sampling locations to approximate the target optimum sample size.
Figure 13.4, created using plot_strat_opt_sys_grid(strat_sys, path = "lines")
, shows the full systematic grid, the selected sampling locations, and those removed from the grid to meet the target sample sizes (labeled as “Removed sampling location” in the figure legend).
If you only want to visualize the final sampling locations, set the path
argument in plot_strat_opt_sys_grid()
to "points"
as shown below in Figure 13.5.
FIGURE 13.5: Systematic sampling with stratification for the Harvard Forest example. Illustrates the path between sampling locations.
We can confirm the final sample size by stratum using the code below, which shows the target sample sizes were achieved. This won’t always be the case, depending on how the sampling grid intersects with each stratum.
strat_sys$grid %>%
filter(measure) %>%
group_by(stratum_id) %>%
summarize(realized_n_h = n()) %>%
st_drop_geometry()
#> # A tibble: 4 × 2
#> stratum_id realized_n_h
#> * <int> <int>
#> 1 1 2
#> 2 2 10
#> 3 3 2
#> 4 4 31
Given the sampling locations, we’re ready to head to the field (virtually, in this case). For our virtual cruise, we’ll use a 9 meter radius plot centered on the sampling locations. These plot areas are shown as gray circles in Figure 13.6.
As always, if any portion of a plot extends beyond a boundary, then a boundary correction method must be applied—this includes stratum boundaries. Figure 13.6 highlights the three plots with boundary slopover and applies the mirage method described in Section 12.3.3.
FIGURE 13.6: Systematic sampling with stratification for the Harvard Forest example. Fixed-area plots are shown at each sampling location, including those that extend beyond stratum boundaries and require boundary slopover correction. The mirage method is used for these plots; see Section 12.3.3 for details.
The cruise tree data are stored in "datasets/HF/HF_strat_sys_n45.csv"
and are read into plot_trees
below. Each row represents one or more trees. The dataset includes a plot_id
column that uniquely identifies plots across all strata, and a tree_id
column that identifies individual trees within each plot. The stratum_id
column indicates the stratum in which the plot falls. Tree counts are recorded in the tree_count
column, which equals 2 when a tree is double-counted using mirage correction. Additional columns include agb_Mg
(aboveground biomass in metric tons), dbh_cm
(diameter at breast height in centimeters), and species
(tree species).
#> Rows: 797
#> Columns: 7
#> $ plot_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2…
#> $ tree_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, …
#> $ stratum_id <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ agb_Mg <dbl> 0.086166, 1.853968, 2.227097, 1.27…
#> $ dbh_cm <dbl> 14.4, 44.6, 48.0, 38.4, 27.0, 41.8…
#> $ species <chr> "Acer rubrum", "Quercus rubra", "Q…
Given these tree data, and following the steps laid out in Section 12.6.1, our first task is to expand tree-level agb_Mg
measurements to a per-hectare basis using the appropriate tree factor, then summarize these values at the plot level. These plot-level values correspond to the \(y_{hi}\) terms in (13.1), i.e., the \(i\)-th plot measurement in the \(h\)-th stratum.
To compute these plot-level AGB values, we group by plot_id
and use summarize()
to calculate y_hi
, the plot-level AGB in Mg/ha. We also retain each plot’s stratum_id
using first(stratum_id)
.154
# Add tree factor for the given plot area.
r <- 9 # Plot radius in meters.
plot_trees <- plot_trees %>%
mutate(TF = 10000 / (pi * r^2))
# Compute plot-level AGB estimates.
plot_summary <- plot_trees %>%
group_by(plot_id) %>%
summarize(y_hi = sum(TF * tree_count * agb_Mg),
stratum_id = first(stratum_id))
plot_summary
#> # A tibble: 45 × 3
#> plot_id y_hi stratum_id
#> <dbl> <dbl> <dbl>
#> 1 1 468. 4
#> 2 2 197. 4
#> 3 3 253. 4
#> 4 4 419. 4
#> 5 5 174. 4
#> 6 6 388. 4
#> 7 7 579. 4
#> 8 8 346. 4
#> 9 9 364. 4
#> 10 10 158. 4
#> # ℹ 35 more rows
Next, we compute the stratum-level mean \(\bar{y}_h\) and standard deviation \(s_h\) by grouping plot_summary
by stratum_id
. We also compute the number of plots within each stratum for convenient reference in the next step.
stratum_summary <- plot_summary %>%
group_by(stratum_id) %>%
summarize(
y_bar_h = mean(y_hi),
s_h = sd(y_hi),
n_h = n_distinct(plot_id)
)
stratum_summary
#> # A tibble: 4 × 4
#> stratum_id y_bar_h s_h n_h
#> <dbl> <dbl> <dbl> <int>
#> 1 1 311. 38.3 2
#> 2 2 474. 140. 10
#> 3 3 63.0 51.0 2
#> 4 4 348. 108. 31
These stratum-level estimates are useful when reporting means, totals, and associated uncertainty for individual strata. For now, though, we’re focused on estimating the overall population across strata.
The estimators for the stratified population mean and its standard error are given in (13.2) and (13.5), respectively. Both require \(n_h\), \(N_h\), and the stratum-level summaries. To streamline the workflow, we join the stratum areas to stratum_summary
before applying the estimators and computing 95% confidence intervals.
# Join stratum areas.
stratum_area <- tibble(
stratum_id = 1:4,
N_h = c(1.24, 8.93, 2.25, 22.58)
)
stratum_summary <- left_join(stratum_summary, stratum_area,
by = "stratum_id")
# Estimate population mean and 95% CI.
stratum_summary %>%
summarize(
N = sum(N_h),
y_bar = sum(N_h * y_bar_h) / N,
s_y_bar = sqrt(1 / N^2 * sum((N_h^2 * s_h^2) / n_h)),
t = qt(df = sum(n_h - 1), p = 1 - 0.05 / 2),
ci_lower = y_bar - t * s_y_bar,
ci_upper = y_bar + t * s_y_bar
) %>%
select(y_bar, ci_lower, ci_upper)
#> # A tibble: 1 × 3
#> y_bar ci_lower ci_upper
#> <dbl> <dbl> <dbl>
#> 1 360. 326. 395.
To estimate the population total and its confidence interval in the code below, we adapt the code above using the total estimator (13.3) and its standard error (13.6).
stratum_summary %>%
summarize(
N = sum(N_h),
t_hat = N * sum(N_h * y_bar_h) / N, # Or just sum(N_h * y_bar_h).
s_t_hat = N * sqrt(1 / N^2 * sum((N_h^2 * s_h^2) / n_h)),
t = qt(df = sum(n_h - 1), p = 1 - 0.05 / 2),
ci_lower = t_hat - t * s_t_hat,
ci_upper = t_hat + t * s_t_hat
) %>%
select(t_hat, ci_lower, ci_upper)
#> # A tibble: 1 × 3
#> t_hat ci_lower ci_upper
#> <dbl> <dbl> <dbl>
#> 1 12615. 11411. 13819.
13.3.6.1 Checking coverage and allowable error
How well did our estimates perform in terms of uncertainty quantification? The census-based AGB for the Harvard Forest is 341.17 Mg/ha, with a forest-wide total of 11,941 Mg. Our estimated mean and total, with 95% confidence intervals, were 360.44 (326.03, 394.84) Mg/ha and 12,615 (11,411, 13,819) Mg, respectively. Based on this sample, both confidence intervals successfully captured the true values.
If the theory holds, we’d expect about 95% of confidence intervals from repeated samples to cover the true mean. Because we have a census, we can perform repeated sampling to empirically check that the coverage rate matches this expectation. We can also evaluate whether our optimum allocation strategy meets the target allowable error of \(E = 35\) Mg/ha with 95% confidence.
Using the estimated \(n\) and \(n_h\), we drew 1,000 samples from the population using stratified SYS. For each, we computed the mean AGB and its 95% confidence interval. These 1,000 replicate estimates are stored in "datasets/HF/strat_sys_agb_ests_1000reps.csv"
. The code below reads in these estimates and computes two key empirical rates: the proportion of intervals that covered the true mean, and the proportion with absolute error less than the allowable error \(E = 35\) Mg/ha. Both rates are very close to 95%, which is strong evidence that the design-based theory is holding up in this example!
agb_ests_reps <- read_csv("datasets/HF/strat_sys_agb_ests_1000reps.csv")
# Compute the true mean AGB (Mg/ha) from the census.
y_true <- read_csv("datasets/HF/HF_trees.csv") %>%
filter(dbh_cm > 12.7) %>%
summarize(y_true = sum(agb_Mg) / N) %>%
pull()
# Percent of confidence intervals that cover the true mean.
mean(y_true > agb_ests_reps$ci_lower_95 &
y_true < agb_ests_reps$ci_upper_95) * 100
#> [1] 98.1
# Percent of estimates within the allowable error E
mean(abs(agb_ests_reps$y_bar - y_true) < E) * 100
#> [1] 97.3
13.3.6.2 Efficiency gains from stratification
Stratified sampling is especially effective when variation within strata is smaller than variation between them, allowing for more precise estimates than an unstratified sample of the same size. In our Harvard Forest example, we defined four strata. With the exception of the small Stratum 3, the others were fairly similar, so we might not expect dramatic efficiency gains. As a reminder, efficiency here means achieving greater precision—that is, a smaller standard error—for the same number of plots compared to an unstratified design.
To compare estimator performance (i.e., unstratified vs. stratified), we repeated the procedure from Section 13.3.6.1, drawing 1,000 independent samples from the same population. For each sample, we computed the mean AGB and its 95% confidence interval using both unstratified and stratified estimators.
The unstratified replicate estimates are stored in "datasets/HF/sys_agb_ests_1000reps.csv"
. The stratified replicates are the same as those used in Section 13.3.6.1. For each estimator, we computed the average 95% confidence interval width across replicates. These were 84 and 73 (Mg/ha) for the unstratified and stratified cases, respectively. The percent gain in precision from stratification is 13%, reflecting the reduction in average interval width achieved by incorporating strata. This confirms that even modest stratification can yield measurable gains in precision, making it a worthwhile method when stratifying variables are available.
13.3.6.3 Group-level estimates from stratified data
So far, our focus has been on estimating AGB within strata and for the overall population. But we’re often also interested in estimates for subgroups of trees within each stratum and across the entire forest. For example, we might want to know how much AGB is tied up in specific species, age classes, forest types, or functional groups.
Let’s walk through one such example.
Eastern hemlock (Tsuga canadensis) is under threat throughout the northeastern U.S. due to the spread of hemlock woolly adelgid (HWA), an invasive insect causing widespread mortality. Since HWA is present in the region of Massachusetts where our data were collected, it’s reasonable to ask how much AGB would be lost if all Eastern hemlock trees in the forest were killed.155
We start by creating a new column in plot_trees
called species_groups
, which assigns each tree to one of two groups of interest:
"Eastern hemlock"
(species == "Tsuga canadensis"
)"Other"
(species != "Tsuga canadensis"
)
Both species_groups
and plot_id
are converted to factors. This prepares the data for use with complete()
, which relies on factor levels to insert rows for combinations of plot and group that may not appear in the data.
plot_trees <- plot_trees %>%
mutate(
species_groups = factor(if_else(species == "Tsuga canadensis",
"Eastern hemlock",
"Other")),
plot_id = factor(plot_id)
)
As noted in Chapter 12, our convention is to use a tree_count
of zero to retain plots with no trees. This keeps their plot indexes in the dataset, ensuring they’re included in summaries and downstream calculations. What’s important is that all plot indexes—even those with no trees—are retained when converting plot_id
to a factor. We used a similar approach when building stand and stock tables in Section 12.6.3.
Next, we compute AGB at the plot level for each species group.
plot_summary <- plot_trees %>%
group_by(plot_id, species_groups) %>%
summarize(
y_hi = sum(TF * tree_count * agb_Mg),
stratum_id = first(stratum_id),
.groups = "drop"
)
This yields one row per plot per group. If no trees from a group were observed in a plot, it won’t appear yet. We’ll fix that next.
The code below uses complete()
to add any missing combinations of plot and species group, inserting explicit zeros. We then use fill()
to recover stratum_id
for the new rows. To avoid crossing strata, we group by plot_id
before filling stratum_id
, ensuring correct propagation.
plot_summary_filled <- plot_summary %>%
complete(plot_id, species_groups, fill = list(y_hi = 0)) %>%
group_by(plot_id) %>%
fill(stratum_id, .direction = "downup") %>%
ungroup()
This is conceptually the same as our earlier use of complete()
in Section 12.6.3 to fill in missing combinations of species and DBH class. Here, we’re doing the same with species groups. After inserting rows, fill()
restores plot-level variables. Grouping by plot_id
ensures this is done within the correct plot.
Take a look at plot_summary_filled
to verify that it looks right—the number of rows should equal the number of plots times the number of species groups.
We now summarize to the stratum level, computing the mean AGB per hectare (\(\bar{y}_h\)) with associated 95% confidence intervals.
stratum_summary <- plot_summary_filled %>%
group_by(stratum_id, species_groups) %>%
summarize(
y_bar_h = mean(y_hi),
n_h = n(),
s_h = sd(y_hi),
s_y_bar_h = s_h/sqrt(n_h),
t = qt(df = n_h - 1, p = 1 - 0.05 / 2),
ci_lower_h = y_bar_h - t * s_y_bar_h,
ci_upper_h = y_bar_h + t * s_y_bar_h,
.groups = "drop"
)
stratum_summary %>%
select(stratum_id, species_groups, y_bar_h, ci_lower_h, ci_upper_h)
#> # A tibble: 8 × 5
#> stratum_id species_groups y_bar_h ci_lower_h ci_upper_h
#> <dbl> <fct> <dbl> <dbl> <dbl>
#> 1 1 Eastern hemlock 89.8 -678. 857.
#> 2 1 Other 221. -891. 1333.
#> 3 2 Eastern hemlock 255. 101. 410.
#> 4 2 Other 219. 110. 327.
#> 5 3 Eastern hemlock 0 0 0
#> 6 3 Other 63.0 -396. 522.
#> 7 4 Eastern hemlock 48.3 24.3 72.2
#> 8 4 Other 300. 258. 341.
These estimates show that Stratum 2 carries about 250 Mg/ha of Eastern hemlock, which is substantially more than any of the other strata. Not surprisingly, with a sample size of \(n_h = 2\), Strata 1 and 3 have very low precision (i.e., wide confidence intervals). No Eastern hemlock was observed on the plots in Stratum 3, so its estimates are zero. To better understand these differences across the forest, Figure 13.7 shows the estimated mean AGB for Eastern hemlock in each stratum (i.e., y_bar_h
computed above).
FIGURE 13.7: Estimated mean aboveground biomass (AGB) of Eastern hemlock (Tsuga canadensis) per hectare in each stratum. Estimates are based on plot-level measurements and stratified sampling.
To estimate forest-wide AGB by species group, along with standard errors and 95% confidence intervals, we apply the total estimator (13.3) and its associated standard error (13.6). The beauty of this approach is that it’s the same code used to estimate total AGB for all species in the previous section—the only difference is the added group_by(species_groups)
. This again illustrates a key strength of the tidyverse: once the logic for a summary is in place, it’s easy to extend it to subgroups with minimal changes.
# Join the stratum areas.
stratum_summary <- stratum_summary %>%
left_join(stratum_area, by = "stratum_id")
stratum_summary %>%
group_by(species_groups) %>%
summarize(
N = sum(N_h),
t_hat = N * sum(N_h * y_bar_h) / N,
s_t_hat = N * sqrt(1 / N^2 * sum((N_h^2 * s_h^2) / n_h)),
t = qt(df = sum(n_h - 1), p = 1 - 0.05 / 2),
ci_lower = t_hat - t * s_t_hat,
ci_upper = t_hat + t * s_t_hat,
.groups = "drop"
) %>%
select(species_groups, t_hat, ci_lower, ci_upper)
#> # A tibble: 2 × 4
#> species_groups t_hat ci_lower ci_upper
#> <fct> <dbl> <dbl> <dbl>
#> 1 Eastern hemlock 3482. 2134. 4831.
#> 2 Other 9133. 7838. 10428.
So, to answer our initial question, how much AGB would be lost if all Eastern hemlock trees in the forest were killed by HWA? We estimate a total loss of 3,482 Mg, with a 95% confidence interval of (2,134, 4,831) Mg, with the majority of this loss coming from Stratum 2.
13.3.7 Advantages and disadvantages of stratified sampling
Stratified sampling is widely used in forest inventory, especially when auxiliary information is available to define meaningful strata. Below, we summarize some key advantages and disadvantages of this design.
13.3.7.1 Advantages
Improved precision: When strata are well defined, stratified sampling increases efficiency by grouping similar areas together. This structure reduces variability within each group and leads to more precise overall estimates.
Supports subregion estimation: Stratification enables valid estimation not just for the entire population, but also for each individual stratum. This is useful when separate summaries are needed for forest types, disturbance regimes, management units, ownership classes, or other subdivisions.
Enables efficient allocation: With optimum allocation, more samples can be directed to strata that are larger or more variable, improving efficiency without increasing cost.
Compatible with other designs: Different sampling strategies—such as SRS, SYS, or others—can be used within strata, depending on operational needs or design goals. See, for example, Section 13.5.5.
Robust to poor strata choices: Even if strata aren’t optimal, stratified sampling rarely performs worse than unstratified sampling.
Supports post-stratification: If stratum membership isn’t known at the time of sampling, post-stratification can still be used. This allows for flexibility when auxiliary data become available after the field campaign.
13.3.7.2 Disadvantages
Area requirements for pre-stratification: A key requirement of pre-stratification is that the area of each stratum must be known prior to sampling in order to support design-based estimation. This can be a limitation if strata are based on incomplete or evolving information.
Limited allocation flexibility in post-stratification: Post-stratified designs don’t allow proportional or optimum allocation, since sample sizes per stratum aren’t controlled in advance. They may also result in too few samples in some strata to estimate variance reliably.
Added bookkeeping: Stratified sampling requires tracking stratum membership, managing stratum-specific calculations, and combining results with appropriate weights.
13.4 Estimation using covariates
Up to this point, our estimators have relied primarily on direct measurements of the variable of interest, \(y\), collected from sampled units. In many settings, however, we also have access to auxiliary information—additional variables that are already available or inexpensive to obtain. When such auxiliary data are correlated with \(y\), they can be used to improve estimation and increase efficiency. A covariate \(x\) is one form of auxiliary information: it’s a variable that is related to, but distinct from, \(y\). When \(x\) is predictive of \(y\), incorporating it into estimation can reduce standard errors or lower field measurement costs.
We have already seen covariates at work indirectly. For example, in stratified sampling (Section 13.3), a stratifying variable improved estimation by grouping similar units together, which reduced within-stratum variability and led to more precise estimates. Here, we treat \(x\) as an explicit part of estimation.
We consider two common settings.
Regression sampling: The population mean of \(x\) (i.e., \(\mu_x\)) is known and used to improve the estimate of the population mean of \(y\) (i.e., \(\mu_y\)).
Double sampling (two-phase): The population mean of \(x\) is unknown, but \(x\) is inexpensive to measure. A large phase-1 sample is used to measure \(x\) only, followed by a smaller nested phase-2 sample where \(y\) is measured. This yields a large set of \(x\) values (\(x_1\)) and a smaller set of paired observations (\(x_2\), \(y_2\)). The phase-1 information is then used to improve the estimate of \(\mu_y\).
Within each setting, the choice of estimator depends on the nature of the relationship between \(y\) and \(x\).
Regression estimation is preferred when the relationship is approximately linear with a nonzero intercept, and the variability of \(y\) remains roughly constant across values of \(x\) (Figure 13.8(a)).
Ratio estimation is also based on a linear relationship, but one that passes through (or near) the origin. It’s most appropriate when the variability of \(y\) tends to increase proportionally with \(x\) (Figure 13.8(b,c)).
We’ll work with two common ratio estimators.
Ratio-of-means: Estimate a single multiplier \(\hat{R} = \bar{y} / \bar{x}\) and apply it to a known (or well-estimated) population mean of \(x\).
Mean-of-ratios: Estimate the average of unit ratios as \(\hat{R} = \sum_{i=1}^n r_i / n\), where \(r_i = y_i / x_i\), then apply \(\hat{R}\) to a known (or well-estimated) population mean of \(x\).
The ratio-of-means works well when variability in \(y\) increases roughly in proportion to \(x\) (Figure 13.8(b)), while the mean-of-ratios is useful when variability in \(y\) increases with \(x^2\) (Figure 13.8(c)) or when unit ratios are of direct interest.
FIGURE 13.8: Three linear relationship patterns. (a) Nonzero intercept with roughly constant variance. (b) Passes through the origin with variance increasing in proportion to \(x\). (c) Passes through the origin with variance increasing in proportion to \(x^2\).
13.4.1 Regression estimation
A common way to describe the relationship between two variables is with a straight line. Suppose we have a variable of interest \(y\) and a covariate \(x\). The general equation for a line is \[\begin{equation} y = b_0 + b_1x, \tag{13.16} \end{equation}\] where \(b_0\) is the point where the line crosses the \(y\)-axis (the intercept) and \(b_1\) is the slope, which tells us how much \(y\) changes for a one-unit increase in \(x\).
Suppose we have a population of paired values \((x_i, y_i)\) for \(i = 1, \ldots, N\). From this population we select a sample of \(n\) pairs, which could produce a scatter plot like Figure 13.8(a). With these data we can fit a straight line using ordinary least squares regression. This method finds the values of \(b_0\) and \(b_1\) that minimize the squared vertical distances between the observed points and the line. In R, this is done with the lm()
function, which returns estimates for \(b_0\) and \(b_1\). This is simple linear regression in a nutshell.
Our goal here isn’t to build a predictive regression model or test hypotheses about regression coefficients. Instead, we’re after a practical estimator for \(\mu_y\) that borrows strength from a linear relationship with \(x\). We treat the regression equation as a stepping stone: by fitting a line through the sample data, we derive an estimator that incorporates known information about the population mean of \(x\). Let’s walk through how this works.
To see how regression leads to an estimator, note that in ordinary least squares, the fitted regression line always passes through the point formed by the sample means, \((\bar{x}, \bar{y})\). Substituting these values into (13.16) gives \[\begin{equation} \bar{y} = b_0 + b_1 \bar{x}. \tag{13.17} \end{equation}\]
This can be rearranged to express the intercept in terms of sample quantities \[\begin{equation} b_0 = \bar{y} - b_1 \bar{x}. \tag{13.18} \end{equation}\]
Now replace the sample mean \(\bar{x}\) by the known population mean \(\mu_x\) and step through the substitution explicitly. Start with the fitted line at \(x=\mu_x\) \[\begin{equation} \bar{y}_{\text{R}} = b_0 + b_1 \mu_x. \end{equation}\] Substitute (13.18) \[\begin{equation} \bar{y}_{\text{R}} = (\bar{y} - b_1 \bar{x}) + b_1 \mu_x. \end{equation}\] Collect terms to obtain the regression estimator \[\begin{equation} \bar{y}_{\text{R}} = \bar{y} + b_1(\mu_x - \bar{x}). \tag{13.19} \end{equation}\]
The subscript on the regression estimator \(\bar{y}_{\text{R}}\) distinguishes it from the usual sample mean \(\bar{y}\). We’ll carry this subscript forward when referring to associated totals and standard errors.
From (13.19) we see the logic of regression estimation. The usual sample mean \(\bar{y}\) serves as the baseline. The adjustment term \(b_1(\mu_x - \bar{x})\) shifts this baseline according to how far the sample mean of \(x\) is from the known population mean, scaled by the slope. If \(\bar{x}\) is less than \(\mu_x\), the estimate is adjusted upward; if greater, it’s adjusted downward. In this way, the regression estimator incorporates auxiliary information from \(x\) to improve our estimate of \(\mu_y\). The extent of improvement depends on how strong and linear the relationship is between \(x\) and \(y\).
The slope in (13.19) is computed from the sample as \[\begin{equation} b_1 = \frac{S_{xy}}{S_{xx}}, \tag{13.20} \end{equation}\] where \(S\) denotes a sum of squares (or cross-products). In particular, \[\begin{equation} S_{xy} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \end{equation}\] and \[\begin{equation} S_{xx} = \sum_{i=1}^n (x_i - \bar{x})^2. \end{equation}\]
The standard error of \(\bar{y}_{\text{R}}\) can be estimated as \[\begin{equation} s_{\bar{y}_{\text{R}}} = \sqrt{s^2_{y|x} \left(\frac{1}{n} + \frac{(\mu_x - \bar{x})^2}{S_{xx}}\right)\left(1-\frac{n}{N}\right)}, \tag{13.21} \end{equation}\] where the residual variance of \(y\) about the fitted regression line is \[\begin{equation} s^2_{y|x} = \frac{S_{yy} - \frac{S_{xy}^2}{S_{xx}}}{n - 2} \tag{13.22} \end{equation}\] and \[\begin{equation} S_{yy} = \sum_{i=1}^n (y_i - \bar{y})^2. \end{equation}\]
The FPC term should be dropped from (13.21) when appropriate, see Section 12.2.1.
To estimate the population total we scale by \(N\) \[\begin{equation} \hat{t}_{\text{R}} = N \bar{y}_{\text{R}} \tag{13.23} \end{equation}\] and its standard error is \[\begin{equation} s_{\hat{t}_{\text{R}}} = N s_{\bar{y}_{\text{R}}}. \tag{13.24} \end{equation}\]
Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y}_{\text{R}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\bar{y}_{\text{R}}},\\ \hat{t}_{\text{R}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{t}_{\text{R}}}, \end{align*}\] with degrees of freedom \(df = n - 2\).
Like regression itself, regression estimation works best when two assumptions hold: (1) a roughly linear relationship exists between \(y\) and \(x\), and (2) the variability of \(y\) around the regression line is approximately constant across the range of \(x\). If the first assumption fails, a transformation of \(x\) may help. If the second fails and variance increases with \(x\), a variance-stabilizing transformation may be considered, or ratio estimation (Section 13.4.2) might be more appropriate.
13.4.2 Ratio estimation
Ratio estimation assumes a proportional relationship between \(y\) and \(x\), so that \(y \approx R x\) for some constant multiplier \(R\). This implies the line relating \(y\) and \(x\) passes through (or near) the origin. We estimate \(R\) from the sample as \(\hat{R}\) and then use it to estimate the population mean as \(\bar{y}_{\text{ratio}} = \hat{R} \mu_x\), and the population total as \(\hat{t}_{\text{ratio}} = \hat{R} \tau_x\).
There are two common ways to define \(\hat{R}\), leading to two distinct ratio estimators. To keep things clear, we use the subscript RM for the ratio-of-means estimator and MR for the mean-of-ratios estimator in the expressions that follow.
13.4.2.1 Ratio-of-means
The estimator of the population mean is \[\begin{equation} \bar{y}_{\text{RM}} = \hat{R}\,\mu_x, \tag{13.25} \end{equation}\] where \[\begin{equation} \hat{R} = \frac{\bar{y}}{\bar{x}}. \end{equation}\] Equivalently, this ratio can be written \[\begin{equation} \hat{R} = \frac{\sum_{i=1}^n y_i}{\sum_{i=1}^n x_i}, \end{equation}\] since the factors of \(n\) in the sample means cancel out.
The standard error of \(\bar{y}_{\text{RM}}\) can be estimated as \[\begin{equation} s_{\bar{y}_{\text{RM}}} = \sqrt{\left(\frac{s^2_y + \hat{R}^2 s_x^2 - 2\hat{R}s_{xy}}{n}\right)\left(1-\frac{n}{N}\right)}, \tag{13.26} \end{equation}\] where \(s^2_y\) and \(s^2_x\) are the sample variances of \(y\) and \(x\), and \[\begin{equation} s_{xy} = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \end{equation}\] is the sample covariance between \(x\) and \(y\). The FPC term should be dropped from (13.26) when appropriate, see Section 12.2.1.
The population total is estimated using \[\begin{equation} \hat{t}_{\text{RM}} = N \bar{y}_{\text{RM}} \tag{13.27} \end{equation}\] and its standard error is \[\begin{equation} s_{\hat{t}_{\text{RM}}} = N s_{\bar{y}_{\text{RM}}}. \tag{13.28} \end{equation}\]
Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y}_{\text{RM}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\bar{y}_{\text{RM}}},\\ \hat{t}_{\text{RM}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{t}_{\text{RM}}}, \end{align*}\] with degrees of freedom \(df = n - 1\).
13.4.2.2 Mean-of-ratios
The estimator of the population mean is \[\begin{equation} \bar{y}_{\text{MR}} = \hat{R}\mu_x, \tag{13.29} \end{equation}\] where \[\begin{equation} \hat{R} = \frac{1}{n}\sum_{i=1}^n r_i, \tag{13.30} \end{equation}\] and \(r_i = y_i/x_i\) is the unit ratio.
The standard error of \(\bar{y}_{\text{MR}}\) can be estimated as \[\begin{equation} s_{\bar{y}_{\text{MR}}} = \mu_x \sqrt{\frac{s_r^2}{n}\left(1 - \frac{n}{N}\right)}, \tag{13.31} \end{equation}\] where \[\begin{equation} s_r^2 = \frac{1}{n-1}\sum_{i=1}^n \left(r_i - \hat{R}\right)^2. \tag{13.32} \end{equation}\] The FPC term should be dropped from (13.31) when appropriate, see Section 12.2.1.
The population total is estimated using \[\begin{equation} \hat{t}_{\text{MR}} = N \bar{y}_{\text{MR}} \tag{13.33} \end{equation}\] and its standard error is \[\begin{equation} s_{\hat{t}_{\text{MR}}} = N s_{\bar{y}_{\text{MR}}}. \tag{13.34} \end{equation}\]
Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y}_{\text{MR}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\bar{y}_{\text{MR}}},\\ \hat{t}_{\text{MR}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{t}_{\text{MR}}}, \end{align*}\] with degrees of freedom \(df = n - 1\).
13.4.3 Double sampling (Two-phase)
Double sampling is useful when \(x\) is inexpensive to measure but its population mean \(\mu_x\) is unknown. The design proceeds in two phases. In phase 1, we draw a large sample of size \(n_1\) and measure only \(x\). In phase 2, we draw a nested subsample of size \(n_2\) from the phase-1 units and measure \(y\) on those. As a result, we have a large sample of \(x\) values (\(x_1\)) and a smaller set of paired observations (\(x_2\), \(y_2\)).
The two phases play complementary roles. Phase 1 provides a precise estimate of the population mean of \(x\), anchoring the estimation. Phase 2 focuses resources on measuring \(y\), allowing us to estimate how \(y\) varies with \(x\)—for example, through a fitted slope \(b_1\) or a ratio \(\hat{R}\). Just as in regression or ratio estimation, the idea is to start with the mean of \(y\) from the smaller sample and adjust it based on the relationship between \(x\) and \(y\) and how far the phase-2 sample mean \(\bar{x}_2\) deviates from the phase-1 mean \(\bar{x}_1\). In essence, phase 1 tells us where \(x\) is centered, and phase 2 tells us how to translate \(x\) into $y. When \(n_1\) is large, most of the uncertainty in the estimator arises from residual variation around the fitted relationship, rather than from estimating \(\bar{x}_1\).
13.4.3.1 Regression estimation
The regression estimator of the population mean under double sampling is \[\begin{equation} \bar{y}_{\text{Rd}} = \bar{y}_2 + b_1\left(\bar{x}_1 - \bar{x}_2\right), \tag{13.35} \end{equation}\] where the slope \(b_1\) is defined in (13.20), but computed here using only the \(n_2\) phase-2 observations.
The standard error of \(\bar{y}_{\text{Rd}}\) can be estimated as \[\begin{equation} s_{\bar{y}_{\text{Rd}}} = \sqrt{s^2_{y|x} \left(\frac{1}{n_2} + \frac{(\bar{x}_1 - \bar{x}_2)^2}{S_{xx}}\right)\left(1 - \frac{n_2}{n_1} \right) + \frac{s^2_y}{n_1}\left(1 - \frac{n_1}{N}\right)}, \tag{13.36} \end{equation}\] where \(s^2_{y|x}\) is the residual variance defined in (13.22), again computed using the phase-2 data.
Note that (13.36) includes two FPC terms: one for the phase-2 sample nested within phase 1, and one for the phase-1 sample drawn from the population. As before, the second FPC can be dropped under an areal frame (see Section 12.2.1).
The population total is estimated as \[\begin{equation} \hat{t}_{\text{Rd}} = N \bar{y}_{\text{Rd}} \tag{13.37} \end{equation}\] with standard error \[\begin{equation} s_{\hat{t}_{\text{Rd}}} = N s_{\bar{y}_{\text{Rd}}}. \tag{13.38} \end{equation}\]
Confidence intervals follow the usual form \[\begin{align*} \bar{y}_{\text{Rd}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\bar{y}_{\text{Rd}}},\\ \hat{t}_{\text{Rd}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{t}_{\text{Rd}}}, \end{align*}\] with degrees of freedom \(df = n_2 - 2\).
13.4.3.2 Ratio-of-means
The estimator of the population mean is \[\begin{equation} \bar{y}_{\text{RMd}} = \hat{R}\bar{x}_1, \tag{13.39} \end{equation}\] where \[\begin{equation} \hat{R}=\frac{\bar{y}_2}{\bar{x}_2}. \tag{13.40} \end{equation}\]
The standard error of \(\bar{y}_{\text{RMd}}\) can be estimated as \[\begin{equation} s_{\bar{y}_{\text{RMd}}} = \sqrt{\left(1 - \frac{n_2}{n_1}\right)\left(\frac{\bar{x}_1}{\bar{x}_2}\right)^2\left(\frac{s_y^2 + \hat{R}^2s_x^2-2\hat{R}s_{xy}}{n_2}\right) + \frac{s^2_y}{n_1}\left(1 - \frac{n_1}{N}\right)}. \tag{13.41} \end{equation}\] If you’re using an areal sampling frame, it’s appropriate to drop the second FPC term (see Section 12.2.1 for discussion).
The population total is estimated using \[\begin{equation} \hat{t}_{\text{RMd}}=N\bar{y}_{\text{RMd}} \tag{13.42} \end{equation}\] and its standard error is \[\begin{equation} s_{\hat{t}_{\text{RMd}}}=Ns_{\bar{y}_{\text{RMd}}}. \tag{13.43} \end{equation}\]
Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y}_{\text{RMd}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\bar{y}_{\text{RMd}}},\\ \hat{t}_{\text{RMd}} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{t}_{\text{RMd}}}, \end{align*}\] with degrees of freedom \(df = n_2 - 1\).
13.4.3.3 Mean-of-ratios
The estimator of the population mean is \[\begin{equation} \bar{y}_{\text{MRd}} = \hat{R}\,\bar{x}_1, \tag{13.44} \end{equation}\] where \[\begin{equation} \hat{R} = \frac{1}{n_2}\sum_{i=1}^{n_2} r_i, \end{equation}\] and \(r_i = y_i/x_i\) is the unit ratio.
The standard error of \(\bar{y}_{\text{MRd}}\) can be estimated as \[\begin{equation} s_{\bar{y}_{\text{MRd}}} = \sqrt{\bar{x}_1^2\left(\frac{s_r^2}{n_2}\right)\left(1 - \frac{n_2}{n_1}\right) + \frac{s_y^2}{n_1}\left(1 - \frac{n_1}{N}\right)}, \tag{13.45} \end{equation}\] where \(s_r^2\) was defined in (13.32). If you’re using an areal sampling frame, it’s appropriate to drop the second FPC term (see Section 12.2.1 for discussion).
The population total is estimated using \[\begin{equation} \hat{t}_{\text{MRd}} = N \bar{y}_{\text{MRd}} \tag{13.46} \end{equation}\] and its standard error is \[\begin{equation} s_{\hat{t}_{\text{MRd}}} = N s_{\bar{y}_{\text{MRd}}}. \tag{13.47} \end{equation}\]
Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y}_{\text{MRd}} \pm t_{n_2-1, 1 - \frac{\alpha}{2}}\, s_{\bar{y}_{\text{MRd}}},\\ \hat{t}_{\text{MRd}} \pm t_{n_2-1, 1 - \frac{\alpha}{2}}\, s_{\hat{t}_{\text{MRd}}}, \end{align*}\] with degrees of freedom \(df = n_2 - 1\).
13.4.4 Illustrative analysis and extensions
The next few sections illustrate how the estimators we’ve introduced are applied in forest inventory settings. Each example highlights how auxiliary information in the form of a covariate can improve efficiency, and how the choice of estimator depends on available data, relationship between variables, sampling designs, and field constraints.
We begin in Section 13.4.4.1 with a timber sale cruise, where ratio estimation offers a simple and efficient approach to estimating total volume. Section 13.4.4.2 turns to the challenge of quantifying insect damage using remotely sensed data, where double sampling proves useful. Finally, Sections 13.4.4.3 and 13.4.4.4 explore more advanced inventory designs built around ratio and regression estimators.
13.4.4.1 Using ratio estimators to estimate timber volume
This illustration is adapted from a short essay by Ken Desmarais, who at the time was the Assistant District Ranger in the White Mountain National Forest, New Hampshire. The piece, entitled Marcy and Ted Take a Crack at Ratio Estimators, is both fun and informative, and we encourage you to read the original (included in the datasets/WMNF
directory). Here we present the key elements, focusing on the methods and analysis.
13.4.4.1.1 Setting
In this example, the crew needed to mark every merchantable tree and estimate its volume for a spruce–fir timber sale in the White Mountain National Forest. This is a demanding task if DBH and height must be measured for every stem and then entered into a volume equation to calculate volume.
Rather than measure every tree, Marcy, the heroine of the story, suggests using ratio estimation. Her idea is to guess DBH and height for each tree, yielding quick but coarse volume estimates, and then measure a random subset of trees to get accurate volumes. The guesses were made by ocular assessment only—no instruments—just a quick guess at the tree’s DBH class and number of 16-foot logs (height), which were then run through a volume table to produce coarse volume estimates. The ratio estimator is used to correct these guesses, producing an accurate and precise estimate of the total sale volume.
Mapping this back to our regression sampling methods, the coarse (guessed) tree volumes play the role of \(x\). Since every tree is guessed, the population mean \(\mu_x\) is known. The \(y\) values come from trees whose DBH and height were measured and then run through a volume equation. The parameter of interest is \(\mu_y\), the true mean tree volume, which can then be scaled by \(N\) to estimate the total sale volume \(\tau_y = N \mu_y\).
13.4.4.1.2 Sampling design and implementation
Each tree in the sale was marked and assigned an ocular estimate of 2-inch DBH class and number of 16-foot logs. These coarse values were run through a balsam-fir volume table to yield a guessed volume (in board feet).
A subsample of trees was selected for measurement. This subsample had to be chosen using a random mechanism (e.g., SRS or SYS). To simplify implementation in the field, a SYS design was used. After a random start, every 200-th tree received a DBH and height measurement. The crew figured they needed about 20 observations to estimate the relationship between guessed and measured volume, i.e., \(n = 20\) \((x, y)\) pairs should give a good estimate of \(R\). They estimated about 4,000 trees in the sale, so a SYS sampling interval of 200 should yield the desired sample size \(n\).
To start the systematic sample, a random start between 1 and 200 was chosen; the draw produced 97. Starting from the first tree, each was assigned a guessed volume. When the crew reached the 97-th tree, they guessed as usual but also took a measurement. The same was done for the 297-th tree, the 497-th, and so on, until all trees were marked.
13.4.4.1.3 Data and analysis
In the end, \(N\) = 4,181 trees were marked and included in the sale. Table 13.2 shows the \(n = 20\) trees that received both a guess and a measurement. Guessed volumes are coarse values from a volume table; measured volumes are based on DBH and height entered into a volume equation, providing more accurate estimates.156 The mean guessed volume across all \(N\) trees was \(\mu_x = 62.1\) board feet per tree.
Tree | Species | DBH (in) | Logs (16-ft) | Vol. (bd ft) | DBH (in) | Height (ft) | Vol. (bd ft) |
---|---|---|---|---|---|---|---|
1 | rs | 10 | 1.5 | 48 | 10.4 | 27 | 57.9 |
2 | bf | 8 | 1.0 | 24 | 8.6 | 18 | 27.4 |
3 | rs | 12 | 1.5 | 74 | 11.8 | 28 | 79.9 |
4 | rs | 10 | 1.5 | 48 | 10.2 | 25 | 52.4 |
5 | rs | 12 | 2.0 | 92 | 12.4 | 36 | 107.0 |
6 | rs | 12 | 2.0 | 92 | 12.1 | 38 | 104.9 |
7 | bf | 10 | 1.5 | 48 | 10.7 | 24 | 57.0 |
8 | rs | 14 | 2.5 | 153 | 14.8 | 43 | 182.7 |
9 | rs | 12 | 2.0 | 92 | 11.5 | 31 | 80.8 |
10 | rs | 10 | 1.5 | 48 | 10.7 | 28 | 63.6 |
11 | rs | 8 | 1.0 | 24 | 7.7 | 18 | 20.6 |
12 | bf | 8 | 1.0 | 24 | 8.6 | 19 | 28.5 |
13 | rs | 8 | 1.0 | 24 | 8.2 | 16 | 22.2 |
14 | rs | 10 | 1.0 | 36 | 10.8 | 18 | 47.6 |
15 | rs | 8 | 1.0 | 24 | 8.1 | 14 | 19.5 |
16 | bf | 12 | 1.5 | 74 | 12.2 | 27 | 84.2 |
17 | rs | 14 | 1.5 | 105 | 14.8 | 28 | 134.6 |
18 | bf | 12 | 2.0 | 92 | 11.7 | 30 | 82.3 |
19 | rs | 10 | 1.0 | 36 | 10.3 | 18 | 42.5 |
20 | rs | 10 | 1.5 | 48 | 9.7 | 25 | 46.5 |
The paired data in Table 13.2 are available in datasets/WMNF/sf_xy.csv
. We read them into sf_xy
and use a scatter plot (Figure 13.9) to characterize the relationship between guessed and measured volumes, helping inform our choice of estimator.
#> Rows: 20
#> Columns: 8
#> $ tree_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1…
#> $ species <chr> "rs", "bf", "rs", "rs", "rs", "r…
#> $ dbh_guess <dbl> 10, 8, 12, 10, 12, 12, 10, 14, 1…
#> $ logs_guess <dbl> 1.5, 1.0, 1.5, 1.5, 2.0, 2.0, 1.…
#> $ vol_guess <dbl> 48, 24, 74, 48, 92, 92, 48, 153,…
#> $ dbh_measure <dbl> 10.4, 8.6, 11.8, 10.2, 12.4, 12.…
#> $ ht_measure <dbl> 27, 18, 28, 25, 36, 38, 24, 43, …
#> $ vol_measured <dbl> 57.9, 27.4, 79.9, 52.4, 107.0, 1…
FIGURE 13.9: Guessed (\(x\)) versus measured (\(y\)) tree volume for the paired trees listed in Table 13.2. A simple linear regression line is drawn for exploratory analysis.
The scatter in Figure 13.9 shows a strong positive relationship between guessed (\(x\)) and measured (\(y\)) volumes, and the fitted line crosses the \(y\)-axis near the origin. With only 20 pairs it’s hard to assess variance patterns, but the spread looks roughly constant. This supports using the ratio-of-means estimator defined in Section 13.4.2.1.
The code below computes the ratio-of-means estimates for mean volume (13.25) and total volume (13.27), along with their respective standard errors (13.26) and (13.28), and a 95% confidence interval for the total.
# Some known values.
N <- 4181 # Trees.
mu_x <- 62.1 # (bd ft).
n <- nrow(sf_xy)
# Pull out variables.
x <- sf_xy %>% pull(vol_guess)
y <- sf_xy %>% pull(vol_measured)
# Ratio estimate.
R_hat <- sum(y) / sum(x)
# Mean estimate (bd ft/tree).
y_bar <- R_hat * mu_x
# Total estimate (bd ft).
t_hat <- N * y_bar
# Standard error of the mean.
s_y_bar <- sqrt((var(y) +
R_hat^2 * var(x) -
2 * R_hat * cov(x, y)) / n *
(1 - n / N))
# Standard error of the total.
s_t_hat <- N * s_y_bar
# 95% CI
t <- qt(df = n - 1, p = 1 - 0.05 / 2)
ci_lower <- t_hat - t * s_t_hat
ci_upper <- t_hat + t * s_t_hat
From the code above, after rounding, the computed \(\hat{R}\) is 1.11. This multiplier converts the population mean of \(x\) into an estimate of the population mean of \(y\). The estimated total sale volume is 288,941 board feet, with a 95% confidence interval of (270,274, 307,608).
These results come from just 20 measured trees. Guessing takes time, but it’s a fraction of what it would take to measure DBH and height on every stem. That’s the payoff of a good design: many quick, inexpensive values adjusted by a small set of careful measurements. We’ll return to variations on this theme in the sections that follow.
13.4.4.2 Double sampling for dead-tree totals
In the previous example, the auxiliary variable \(x\) was observed for every tree in the sale. In practice, we often can’t afford to collect \(x\) for the entire population. As described in Section 13.4.3, a common workaround is double sampling, where a relatively inexpensive measurement is taken on a large phase-1 sample, and a more accurate but costly measurement is taken on a smaller, nested phase-2 sample. Remote sensing is often used in phase 1 to provide broad, inexpensive coverage, while phase 2 relies on targeted field data collection for more detailed and accurate measurements.
This example involves a forest manager who wants to estimate the total number of dead trees in a 400-acre area under heavy insect infestation. The area is divided into \(N = 200\) equal-area tiles. Aerial photo counts are collected for a phase-1 sample of \(n_1 = 100\) tiles. A nested phase-2 subsample of \(n_2 = 10\) of those tiles is visited for a field-based mortality count survey. The aerial counts (\(x\)) are fairly accurate, but some subcanopy tree mortality is not detectable—the field survey measurements (\(y\)) are more accurate.
Our goal is to estimate total mortality across the full area using a double sampling estimator. For comparison, we’ll also estimate total mortality using only the phase-2 field measurements, ignoring the phase-1 data.
13.4.4.2.1 Data and setup
Here, the phase-1 and phase-2 data are stored in a single file, with one row per sampled tile.
Column names are:
tile_id
unique identifier for each sampled tile (\(n_1\) rows);x
aerial photo count (available for all \(n_1\) tiles);y
field-based mortality count (available for \(n_2\) tiles;NA
otherwise).
We begin by reading the data into mort
. From this, we extract three vectors: \(x_1\) contains all phase-1 counts; \(x_2\) is the subset of aerial counts paired with field measurements; and \(y_2\) holds the phase-2 field measurements. These align with the notation used in our estimators.
mort <- read_csv("datasets/insects/mortality_two_phase.csv")
# Extract the phase 1.
x_1 <- mort %>% pull(x)
# Extract phase 2.
x_2 <- mort %>% filter(!is.na(y)) %>% pull(x)
y_2 <- mort %>% filter(!is.na(y)) %>% pull(y)
FIGURE 13.10: Paired aerial (\(x_2\)) and field (\(y_2\)) mortality counts for ten tiles. The fitted regression line is used only to characterize the relationship and help choose an appropriate estimator. The gray dashed line marks the 1:1 reference.
Figure 13.10 shows the paired \((x_2, y_2)\) values for the ten tiles with both aerial and field-based counts. The fitted line has a \(y\)-intercept well above the origin, the relationship appears roughly proportional, and the scatter of points about that line has fairly constant variance across values of \(x\). Given these observations, the regression estimator is more appropriate than the ratio estimator. However, for comparison, we’ll apply both the regression and ratio-of-means estimators, and compare their results to a baseline SRS estimate based on the \(n_2\) field measurements alone.
As part of the comparison, we’ll compute the relative standard error (RSE), which is similar in form to the coefficient of variation (CV) but uses the standard error of the estimate rather than the sample standard deviation. The RSE is typically expressed as a percent and is defined as \[\begin{equation} \text{RSE} = 100 \cdot \frac{\text{standard error of the estimate}}{\text{estimate}}. \tag{13.48} \end{equation}\] All else equal, estimators with lower RSE are preferred because they reflect greater precision relative to the magnitude of the estimate.
In this case, since we’re estimating the total number of dead trees, the numerator in (13.48) is the standard error of the total, and the denominator is the corresponding estimated total.
The code below applies the three estimators and collects their results into a summary table.
# Known values.
N <- 200
n_1 <- length(x_1)
n_2 <- length(x_2)
# Means and variances.
x_bar_1 <- mean(x_1)
x_bar_2 <- mean(x_2)
y_bar_2 <- mean(y_2)
s_sq_x_2 <- var(x_2)
s_sq_y_2 <- var(y_2)
s_xy <- cov(x_2, y_2)
# Ratio-of-means (double sampling).
R_hat_RMd <- y_bar_2 / x_bar_2
y_bar_RMd <- R_hat_RMd * x_bar_1
s_y_bar_RMd <- sqrt((1 - n_2 / n_1) *
((x_bar_1 / x_bar_2)^2 *
((s_sq_y_2 +
R_hat_RMd^2 * s_sq_x_2 -
2 * R_hat_RMd * s_xy) / n_2)) +
(s_sq_y_2 / n_1) * (1 - n_1 / N))
t_hat_RMd <- N * y_bar_RMd
s_t_hat_RMd <- N * s_y_bar_RMd
RSE_t_hat_RMd = 100*s_t_hat_RMd/t_hat_RMd
# Regression (double sampling).
S_xx <- sum((x_2 - x_bar_2)^2)
S_yy <- sum((y_2 - y_bar_2)^2)
S_xy <- sum((x_2 - x_bar_2) * (y_2 - y_bar_2))
b1 <- S_xy / S_xx
s_sq_y_given_x <- (S_yy - S_xy^2 / S_xx) / (n_2 - 2)
y_bar_Rd <- y_bar_2 + b1 * (x_bar_1 - x_bar_2)
s_y_bar_Rd <- sqrt(s_sq_y_given_x *
(1 / n_2 + (x_bar_1 - x_bar_2)^2 / S_xx) *
(1 - n_2 / n_1) +
(s_sq_y_2 / n_1) *
(1 - n_1 / N)
)
t_hat_Rd <- N * y_bar_Rd
s_t_hat_Rd <- N * s_y_bar_Rd
RSE_t_hat_Rd = 100*s_t_hat_Rd/t_hat_Rd
# Phase-2 only, SRS.
y_bar_SRS <- y_bar_2
s_y_bar_SRS <- sqrt((s_sq_y_2 / n_2) * (1 - n_2 / N))
t_hat_SRS <- N * y_bar_SRS
s_t_hat_SRS <- N * s_y_bar_SRS
RSE_t_hat_SRS = 100*s_t_hat_SRS/t_hat_SRS
# Output summary.
tibble(
method = c("SRS (Phase-2)", "Ratio (RMd)", "Regression (Rd)"),
y_bar = c(y_bar_SRS, y_bar_RMd, y_bar_Rd),
s_y_bar = c(s_y_bar_SRS, s_y_bar_RMd, s_y_bar_Rd),
t_hat = c(t_hat_SRS, t_hat_RMd, t_hat_Rd),
s_t_hat = c(s_t_hat_SRS, s_t_hat_RMd, s_t_hat_Rd),
RSE_t_hat = c(RSE_t_hat_SRS, RSE_t_hat_RMd, RSE_t_hat_Rd)
)
#> # A tibble: 3 × 6
#> method y_bar s_y_bar t_hat s_t_hat RSE_t_hat
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 SRS (Phase-2) 24.5 2.99 4904. 598. 12.2
#> 2 Ratio (RMd) 23.9 2.21 4788. 442. 9.24
#> 3 Regression (Rd) 24.2 1.06 4836. 212. 4.39
The results in the table above highlight the benefit of incorporating auxiliary information from the phase-1 aerial counts. Both the regression and ratio estimators yield substantially smaller RSE compared with the phase-2-only SRS estimator.
Because these are simulated data, we know the true population values: \(\mu_y = 24.75\) dead trees per tile and \(\tau_y = 4951\) dead trees across the full 400 acres. Based on these known values, both the SRS and regression estimators provide accurate estimates, while the ratio estimator appears to slightly underestimate the total.
As suggested by the exploratory plot in Figure 13.10, the regression estimator is more appropriate than the ratio estimator due to the nonzero intercept. Ratio estimation assumes a linear relationship between \(y\) and \(x\) that passes through the origin, and when this assumption is violated, the estimator can be biased. While the low ratio estimate hints at this bias, a full repeated sampling analysis—like the one in Section 13.3.6.1—would be needed to quantify it. Still, given the evidence from Figure 13.10 and the substantially lower RSE, it’s clear that the regression estimator is both more appropriate and more efficient for these data.
This example underscores the importance of EDA when auxiliary variables are involved. A simple scatterplot with a fitted line can reveal whether a nonzero intercept is warranted, guiding the choice between ratio and regression estimation. In this case, visual inspection of the \((x, y)\) relationship provided an early and reliable indication that regression was the better choice.
13.4.4.3 Big BAF sampling
This section continues our focus on combining inexpensive measurements with a targeted subset of more costly measurements to improve parameter estimates in terms of both statistical and operational efficiency. Big BAF sampling fits squarely within this theme. It builds on the principles of horizontal point sampling (Section 12.4) and introduces an elegant way to obtain per-unit-area and total estimates by splitting the estimation process into two parts: a count-based estimate of basal area (\(\text{BA}\)) per unit area, and a tree-based estimate of volume-to-basal-area ratio (\(\text{VBAR}\)). While the method was originally devised to estimate volume, it can be applied more broadly to other parameters of interest. As initially introduced in Section 12.4.6, think of the “V” in VBAR more generically as the variable of interest.
The Big BAF idea is simple: conduct a tree count at a sampling location using a standard BAF (e.g., one that yields about 4 to 10 “in” trees). These tree counts provide an estimate of \(\text{BA}\) per unit area using point-level summaries, as described in Section 12.4. Then, using a second sweep at the same sampling location, apply a larger BAF to identify a subset of trees to measure. These measurement trees are used to calculate tree-level \(\text{VBAR}\) values by recording both DBH (used to compute \(\text{BA}\)) and volume (or another variable of interest). The average of these tree-level \(\text{VBAR}\) values is computed across all measurement trees—that is, pooled across sampling locations. This average \(\text{VBAR}\) is then multiplied by the \(\text{BA}\) estimate from the tree counts to yield an estimate of volume per unit area.
Counting trees with a standard-size BAF is fast and inexpensive, which makes it feasible to visit many sample locations. Measuring DBH and other tree variables to calculate VBAR, by contrast, is more time-consuming and costly. The Big BAF method balances this tradeoff by using a big BAF to identify a smaller subset of trees to measure at each location. For example, if a standard-size BAF yields an average of 8 count trees per point, a big BAF that is four times larger would result in about 2 measurement trees on average—just one-fourth as many.
Several studies and personal observations have shown that BA per unit area estimates are typically more variable than VBAR estimates, so it’s generally more efficient to invest in a larger number of sampling locations (for BA counts) and fewer tree measurements (for VBAR). The Big BAF design supports this strategy naturally and allows us to concentrate effort where it’s most needed.
Kim Iles conceived the Big BAF method, building on earlier related ideas and methods developed by John F. Bell and others, as well as foundational work by Lewis R. Grosenbaugh in the 1940s and 1950s. A thorough treatment of the topic—including historical background, estimator behavior, variations, and sample size versus cost considerations—can be found in K. Iles (2003), Marshall, Iles, and Bell (2004), Gove et al. (2020), and references therein. These sources provide practical guidance on selecting both the number of sampling locations and the number of measurement trees, along with deeper insights into estimator performance and design tradeoffs.
Big BAF is a form of double sampling with a ratio estimator, although it’s not like the double sampling introduced in Section 13.4.3, where phase-1 and phase-2 are a sample and subsample of sampling units (e.g., plots or points). Under Big BAF, the phases occur at the tree level: phase-1 can be seen as all count trees identified using the standard-size BAF and used to estimate \(\text{BA}\) per unit area through point-level summaries; phase-2 consists of the subset of trees measured at each sampling location using the big BAF and used to estimate \(\text{VBAR}\).157 As we’ll see in the next section, the estimator follows the same spirit as other ratio estimators—correcting a coarse, inexpensive estimate using a more accurate, but costly, measurement.
13.4.4.3.1 Estimation using Big BAF
As described above, the Big BAF estimator comprises two components: (1) the mean basal area per unit area (\(\overline{\text{BA}}\)), based on point-level estimates using a standard-size BAF, and (2) the mean volume-to-basal-area ratio (\(\overline{\text{VBAR}}\)), based on tree-level measurements using a big BAF. Throughout, we’ll use superscripts \(s\) and \(b\) to distinguish quantities associated with the small (standard-size) and big BAFs, respectively.
Following the methods in Section 12.4, the mean basal area per unit area is estimated using \[\begin{equation} \overline{\text{BA}} = \frac{1}{n} \sum_{i=1}^{n} \text{BA}_i, \tag{13.49} \end{equation}\] where \(n\) is the number of sampling locations and \(\text{BA}_i = m^s_i \cdot \text{BAF}^s\) is the observed basal area per unit area at the \(i\)-th location, with \(m^s_i\) being the number of count trees selected using the small BAF (BAF\(^s\)).
We define \(\text{VBAR}\) for the \(j\)-th measurement tree at the \(i\)-th sampling location as
\[\begin{equation}
\text{VBAR}_{ij} = \frac{v_{ij}}{BA_{ij}}
\end{equation}\]
and compute the mean across all measurement trees as
\[\begin{equation}
\overline{\text{VBAR}} = \frac{1}{m^b} \sum_{i=1}^n \sum_{j=1}^{m^b_i} \text{VBAR}_{ij},
\tag{13.50}
\end{equation}\]
where \(m^b_i\) is the number of measurement trees selected using the big BAF (BAF\(^b\)) at the \(i\)-th sampling location, \(n\) is the number of sampling locations, and \(m^b = \sum_{i=1}^n m^b_i\) is the total number of measurement trees across all locations. Note that, unlike the point-level summary of \(\text{BA}\), which includes locations with zero count trees, the mean \(\overline{\text{VBAR}}\) is defined only over measurement trees, so only locations with at least one measurement tree contribute to the sum (i.e., if the \(i\)-th sampling location has \(m^b_i = 0\), it’s excluded from the sum).
The Big BAF estimator for volume per unit area is then given by \[\begin{equation} \hat{v} = \overline{\text{BA}} \cdot \overline{\text{VBAR}}. \tag{13.51} \end{equation}\]
It’s useful to verify how the units work out when applying (13.51). For example, suppose volume is measured in cubic meters and basal area is reported in square meters per hectare. Then (13.51) with units becomes \[\begin{align} \hat{v}\left(\frac{\text{m}^3}{\text{ha}}\right) &= \overline{\text{BA}}\left(\frac{\text{m}^2}{\text{ha}}\right) \cdot \overline{\text{VBAR}} \left(\frac{\text{m}^3}{\text{m}^2}\right) \nonumber\\ &= \overline{\text{BA}}\left(\frac{\cancel{\text{m}^2}}{\text{ha}}\right) \cdot \overline{\text{VBAR}} \left(\frac{\text{m}^3}{\cancel{\text{m}^2}}\right). \end{align}\] The \(\text{m}^2\) units cancel, yielding the desired volume per unit area.
Because \(\hat{v}\) is the product of two sample means—each estimated from a different sample—its standard error must account for the variability in both components. A common and practical approach is to use Bruce’s formula, which assumes that \(\overline{\text{BA}}\) and \(\overline{\text{VBAR}}\) are uncorrelated and combines their uncertainty in quadrature \[\begin{equation} s_{\hat{v}} \approx \sqrt{ s_{\overline{\text{BA}}}^2 + s_{\overline{\text{VBAR}}}^2 }. \tag{13.52} \end{equation}\] This approach is well established in forestry sampling (Bell 1957; D. Bruce 1961) and is explained in the context of Big BAF estimation by Gove et al. (2020).
Each standard error is estimated from its respective sample data using (11.21).158
The standard error for \(\overline{\text{BA}}\) is
\[\begin{equation}
s_{\overline{\text{BA}}} = \frac{s_{\text{BA}}}{\sqrt{n}},
\tag{13.53}
\end{equation}\]
where
\[\begin{equation}
s_{\text{BA}} = \sqrt{\frac{1}{n - 1} \sum_{i=1}^{n} \left( \text{BA}_i - \overline{\text{BA}} \right)^2}.
\end{equation}\]
The standard error for \(\overline{\text{VBAR}}\) is
\[\begin{equation}
s_{\overline{\text{VBAR}}} = \frac{s_{\text{VBAR}}}{\sqrt{m^b}},
\tag{13.54}
\end{equation}\]
where
\[\begin{equation}
s_{\text{VBAR}} = \sqrt{ \frac{1}{m^b - 1} \sum_{i=1}^{n} \sum_{j=1}^{m^b_i} \left( \text{VBAR}_{ij} - \overline{\text{VBAR}} \right)^2 }.
\end{equation}\]
Because \(\overline{\text{BA}}\) and \(\overline{\text{VBAR}}\) are on wholly different scales (e.g., \(m^2/\text{ha}\) and \(m^3/\text{m}^2\), respectively), we apply (13.52) using percent standard errors. This avoids scale incompatibility and offers a clearer sense of relative precision. Rewriting (13.52) using percent standard errors gives
\[\begin{equation}
s_{\hat{v}\%} \approx \sqrt{s_{\overline{\text{BA}}\%}^2 + s_{\overline{\text{VBAR}}\%}^2},
\tag{13.55}
\end{equation}\]
where the percent standard errors for \(\overline{\text{BA}}\) and \(\overline{\text{VBAR}}\) are defined as
\[\begin{equation}
s_{\overline{\text{BA}}\%} = 100 \cdot \frac{s_{\overline{\text{BA}}}}{\overline{\text{BA}}}
\tag{13.56}
\end{equation}\]
and
\[\begin{equation}
s_{\overline{\text{VBAR}}\%} = 100 \cdot \frac{s_{\overline{\text{VBAR}}}}{\overline{\text{VBAR}}}.
\tag{13.57}
\end{equation}\]
In addition to aligning the two error components on a common scale, expressing the standard errors as percents in (13.55) highlights how each component contributes to the overall uncertainty in \(\hat{v}\). It also provides a clear framework for evaluating tradeoffs between count and measurement effort; see, e.g., K. Iles (2003) and Marshall, Iles, and Bell (2004).
To return the standard error to the original units of \(\hat{v}\), simply multiply the percent standard error by the estimate and divide by 100 \[\begin{equation} s_{\hat{v}} = \hat{v} \cdot \frac{s_{\hat{v}\%}}{100}. \tag{13.58} \end{equation}\]
If a confidence interval is desired, a \(t\)-value can be computed using the Welch–Satterthwaite approximation for degrees of freedom (Welch 1947). This approach is appropriate when combining uncertainty from two independent sample means with unequal variances, and it aligns with our application of Bruce’s formula to estimate the standard error. The approximate degrees of freedom are given by
\[\begin{equation}
\text{df} \approx \frac{
s_{\overline{\text{BA}}\%}^2 + s_{\overline{\text{VBAR}}\%}^2
}{
\frac{s_{\overline{\text{BA}}\%}^4}{n - 1} + \frac{s_{\overline{\text{VBAR}}\%}^4}{m^b - 1}
}.
\tag{13.59}
\end{equation}\]
With this degrees of freedom approximation, the confidence interval for \(\hat{v}\) is \[\begin{equation} \hat{v} \pm t_{\text{df}, 1 - \frac{\alpha}{2}} s_{\hat{v}}. \tag{13.60} \end{equation}\]
To estimate totals, simply multiply the mean estimate \(\hat{v}\) and its standard error \(s_{\hat{v}}\) by the population area.
Some general observations about Big BAF. As mentioned earlier, the estimates for BA and VBAR need not come from the same locations or sampling designs; it’s simply convenient to pair them. When sampling locations follow a SYS design, combining the two sampling efforts at the same points helps ensure good spatial coverage and representation of the population.
The method offers great flexibility in how BA and VBAR are sampled, and at what intensity. Working with percent standard errors makes optimal allocation of sampling resources a straightforward exercise, assuming costs can be assigned to sampling locations (i.e., cost per unit of \(n\)), tree counting (for BA), and tree measurement (for VBAR), and that reasonable percent standard error estimates are available (e.g., from prior inventories).
The standard error estimators presented here for \(\overline{\text{VBAR}}\) and \(\hat{v}\) assume independence between variables—that is, they assume that volume and basal area are not correlated. The standard error for \(\overline{\text{VBAR}}\) treats each tree-level ratio as a draw from a single random variable, ignoring any correlation between numerator and denominator. Similarly, Bruce’s formula assumes that \(\overline{\text{BA}}\) and \(\overline{\text{VBAR}}\) are independent and omits a third term in the general error propagation formula that would account for their covariance (see, e.g., Gove et al. (2020)). While these assumptions may not strictly hold, they’re generally reasonable and, even when violated, tend to have little practical impact on estimation in typical forest inventory applications.
As Marshall (Marshall, Iles, and Bell 2004) notes, a useful feature of Big BAF is that it provides an internal check on basal area estimates. Because the count and measurement data use different BAFs, they yield independent estimates of BA. If those estimates differ substantially, it may point to issues with field protocols that warrant review.
If estimates are to be broken down by species, product type, strata, or any other category requiring separate VBAR estimates, care must be taken to ensure enough measurement trees are sampled to support reliable estimation within each group. The more categories considered, the more measurement trees are needed to inform the VBAR estimates—which has cost implications.
13.4.4.3.2 Apply Big BAF
To illustrate Big BAF, we return once more to the Harvard Forest dataset. As in earlier examples, this is a simplified setting, but it allows us to walk through the full workflow using real data and compare sample-based estimates to known population values.
The objective is to estimate mean AGB per hectare for all stems with DBH \(\ge\) 12.7 cm across the 35 ha forest.
We again use the sys_grid()
function, introduced in Section 13.2, to generate a systematic grid of sampling locations across the forest. For this example, the spacing is 125 by 125 meters, yielding \(n = 24\) sampling locations (Figure 13.11).
FIGURE 13.11: Systematic sample locations for the Harvard Forest Big BAF example.
We’ll use metric BAF\(^s\) = 5 for count trees and BAF\(^b\) = 20.25 for measurement trees.159 This setup implies about one measurement tree for every four count trees (since 5/20.25 \(\approx\) 0.25).
For both count and measurement trees, the walkthrough method was used to address boundary slopover bias. Trees with walkthrough points falling outside the boundary received a count of two; see Section 12.3.3 for details. As in previous sections, sampling locations with no trees were assigned a tree count of zero.
In the analysis that follows, we’ll first apply the Big BAF mean and standard error estimators outlined in the previous section. Then, for comparison, we’ll repeat the analysis assuming measurements were taken on all count trees (i.e., all “in” trees under BAF\(^s\)), representing a more traditional cruise.
The data are located in the "datasets/BigBAF/"
directory. "HF_BAF_s_count.csv"
contains the count data collected using BAF\(^s\), and "HF_BAF_b_measure.csv"
contains the measurement data collected using BAF\(^b\). We begin by reading these files and inspecting their structure.
count_trees <- read_csv("datasets/BigBAF/HF_BAF_s_count.csv")
measurement_trees <- read_csv("datasets/BigBAF/HF_BAF_b_measure.csv")
glimpse(count_trees)
#> Rows: 194
#> Columns: 2
#> $ point_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1…
#> Rows: 54
#> Columns: 4
#> $ point_id <dbl> 1, 1, 2, 3, 4, 4, 4, 5, 6, 6, 6, 7…
#> $ dbh_cm <dbl> 48.0, 48.6, 22.6, 0.0, 15.9, 33.6,…
#> $ agb_Mg <dbl> 2.22710, 2.29694, 0.44771, 0.00000…
#> $ tree_count <dbl> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1…
Let’s get a sense of the number of count (\(m^s_i\)) and measurement trees (\(m^b_i\)), based on their respective BAFs, across sampling locations (which are uniquely identified using point_id
).
count_trees %>%
group_by(point_id) %>%
summarize(m_s_i = sum(tree_count)) %>%
summarize(min = min(m_s_i), max = max(m_s_i),
mean = mean(m_s_i), total = sum(m_s_i))
#> # A tibble: 1 × 4
#> min max mean total
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 18 8.12 195
measurement_trees %>%
group_by(point_id) %>%
summarize(m_b_i = sum(tree_count)) %>%
summarize(min = min(m_b_i), max = max(m_b_i),
mean = mean(m_b_i), total = sum(m_b_i),
n_zero = sum(m_b_i == 0))
#> # A tibble: 1 × 5
#> min max mean total n_zero
#> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 0 5 2.08 50 4
Given these summaries, as expected, there are roughly four times as many count trees as measurement trees (195 vs. 50 total). The number of count trees per sampling location ranged from 1 to 18, with an average of about 8. This wide range in the number of “in” trees reflects the spatial variability in diameter distributions across the forest. Measurement trees ranged from 0 to 5 per location, with an average of about 2. As indicated by n_zero
, 4 sampling locations had no measurement trees.
Next, using the count trees, we estimate mean basal area (BA, in \(\text{m}^2/\text{ha}\)) using (13.49), standard error of the mean using (13.53), and standard error expressed as a percent using (13.56).
BA <- small_BAF_count %>%
group_by(point_id) %>%
summarize(m_i = sum(tree_count), # (trees per point).
BA = m_i*BAF_small) %>% # (m^2/ha per point).
summarize(n = n(), # Number of points.
mean = mean(BA), # (m^2/ha).
se = sd(BA)/sqrt(n()), # (m^2/ha).
se_percent = 100*se/mean) # Std. err. as % of mean.
BA
#> # A tibble: 1 × 4
#> n mean se se_percent
#> <int> <dbl> <dbl> <dbl>
#> 1 24 40.6 3.90 9.61
Now we estimate mean \(\text{VBAR}\) (\(\text{m}^3/\text{m}^2\)) using (13.50), standard error using (13.54), and standard error expressed as a percent of the mean using (13.57).
Recall that in our cruise data, we included a tree_count
of zero to track which sampling locations had no measurement trees. However, as noted below (13.50), these zeros should not be included in the calculation of \(\overline{\text{VBAR}}\), so we filter them out using filter(tree_count > 0)
in the subsequent piped sequence below.
After removing those rows, tree_count
takes on values of 1 or 2. A value of 2 indicates the tree was double-counted based on the walkthrough method. Because we’re computing the mean across trees (not across sampling points), the cleanest way to handle this is to duplicate the row for any tree with tree_count == 2
. We do this using uncount(tree_count)
from the tidyr
package, which replicates rows according to a count column.160
VBAR <- big_BAF_measure %>%
filter(tree_count > 0) %>%
uncount(tree_count) %>%
mutate(BA = c*dbh_cm^2, # (m^2).
VBAR = agb_Mg/BA) %>% # (Mg/m^2 per tree).
summarize(m = n(), # Number of trees.
mean = mean(VBAR), # (Mg/m^2).
se = sd(VBAR)/sqrt(m), # (Mg/m^2).
se_percent = 100*se/mean) # Std. err. as % of mean.
VBAR
#> # A tibble: 1 × 4
#> m mean se se_percent
#> <int> <dbl> <dbl> <dbl>
#> 1 50 8.93 0.508 5.69
Lastly, given BA
and VBAR
, we compute the Big BAF estimate of AGB per unit area. The code below implements (13.51) and (13.55), and also converts the percent standard error into the standard error in AGB units using (13.58).
#> [1] 362.69
# Standard error as percent of biomass.
se_agb_Mg_percent <- sqrt(BA$se_percent^2 + VBAR$se_percent^2)
# Standard error (Mg/ha).
se_agb_Mg <- agb_Mg_ha * se_agb_Mg_percent/100
se_agb_Mg
#> [1] 40.519
As discussed earlier, it’s common for VBAR to be less variable than BA. This pattern holds for these data as well, with \(s_{\overline{\text{VBAR}}\%} = 5.69\) and \(s_{\overline{\text{BA}}\%} = 9.61\). This suggests that estimation gains would come more from adding sampling locations to improve the estimate of BA, rather than selecting a smaller BAF\(^b\) to increase the number of measurement trees.
We can then construct a confidence interval for the mean AGB estimate using the approximate degrees of freedom formula in (13.59) and the interval expression in (13.60).
BA_df <- BA$n-1
VBAR_df <- VBAR$m-1
df <- (BA$se_percent^2+VBAR$se_percent^2)^2/
(BA$se_percent^4/BA_df+VBAR$se_percent^4/VBAR_df)
t <- qt(p = 1 - 0.05 / 2, df = df)
c(agb_Mg_ha - t*se_agb_Mg, agb_Mg_ha + t*se_agb_Mg)
#> [1] 280.78 444.60
We see that Big BAF yields a mean of 362.69 (\(\text{m}^3/\text{ha}\)) with a 95% confidence interval from 280.78 to 444.6, which covers the true biomass of 341.17.
The code below shows what we would get using a traditional cruise that measures every “in” tree using BAF\(^s\). These data are held in "datasets/BigBAF/HF_BAF_s_measure.csv"
.
measurement_trees <- read_csv("datasets/BigBAF/HF_BAF_s_measure.csv")
c <- pi / (10000 * 4)
measure_all <- measurement_trees %>%
mutate(TF = ifelse(tree_count != 0, BAF_small/(c*dbh_cm^2), 0)) %>%
group_by(point_id) %>%
summarize(agb_Mg = sum(TF*tree_count*agb_Mg), # (Mg/ha per point).
m_i = sum(tree_count), # (trees per point).
BA = m_i*BAF_small) %>% #(m^2/ha per point).
summarize(n = n(), # Number of points.
mean = mean(agb_Mg), # (Mg/ha).
se = sd(agb_Mg)/sqrt(n), # (Mg/ha)
se_percent = 100*se/mean) # Std. err. as % of mean.
agb_Mg_ha <- measure_all$mean
se_agb_Mg_percent <- measure_all$se_percent
agb_Mg_ha
#> [1] 349.77
#> [1] 8.3171
The Big BAF approach offers a clear tradeoff: by measuring 145 fewer trees (the difference between the number of BAF\(^s\) and BAF\(^b\) “in” trees), we reduce field effort and associated costs. In exchange, we accept a modest increase in uncertainty—with percent standard error rising from 8.32 to 11.17, or standard error in AGB units from 29.09 to 40.52.
13.4.4.4 3P sampling
Probability proportional to prediction (3P) sampling directs detailed measurements toward the most influential units in a population. It begins by making a quick, inexpensive prediction for every unit—something easy to obtain, like an ocular estimate or a simple field measurement. Units with larger predictions are more likely to be selected for full measurement, focusing effort on those expected to have the greatest influence on population parameter estimates. Predictions and measurements are combined using a ratio estimator that accounts for the unequal selection probabilities, yielding an efficient design that requires relatively few measurements but still produces accurate, stable estimates under reasonable conditions.
Classical 3P sampling was introduced by Lewis R. Grosenbaugh in the 1960s (see, e.g., Grosenbaugh (1965)) and remains the most widely applied version of the method. While originally developed for estimating timber volume, 3P is useful in any setting where the variable of interest is costly to measure but can be predicted with reasonable accuracy. We’ll use total volume to motivate the discussion, but the same logic applies to other forest parameters. Classical 3P sampling involves the following steps.
Predict. Visit every tree and record a quick prediction \(x\) that scales with the variable of interest \(y\). Here, \(x\) is the covariate. This might be an ocular estimate of volume or a simple dimensional measure like DBH or height. The goal isn’t precision, but consistency—\(x\) should be roughly proportional (or transformable to be proportional) to the measurement \(y\).
Select and measure. Use the predictions to assign each tree a selection probability proportional to \(x\). Select a subsample accordingly, and for each selected tree, collect the measurement \(y\).
Estimate. Combine the paired \(x\) and \(y\) values using a mean-of-ratios estimator. This yields an estimate of the population-level ratio between the measurement and the prediction. Scaling this ratio by the known mean or total of \(x\) (recorded for all units) gives an estimate of the mean or total of \(y\). The estimator naturally accounts for the unequal selection probabilities.
Classical 3P is most effective when the variable of interest is expensive to measure, predictions are available for all trees, and the measurement-to-prediction ratio is relatively stable—even if the variable itself is highly variable. The design concentrates measurement effort on the most influential trees while still visiting every tree, making it especially well suited for small areas or situations where all trees must be assessed (e.g., for marking).
In these settings, 3P offers several advantages:
- straightforward implementation with minimal training;
- every tree is visited, avoiding boundary complications;
- high efficiency, often yielding small standard errors relative to effort;
- detailed data that support stand and stock tables by species and diameter class;
- better estimation for rare species.
Although our focus is on classical 3P, several useful variants exist. One example is point-3P sampling, which extends the idea of classical 3P to point sampling. At each point, trees determined to be “in” using the usual angle method (e.g., prism, angle gauge, or relascope) each receive a prediction \(x\) (e.g., DBH or height), which then determines the probability they’re measured for \(y\). It has all the advantages of classical 3P, but doesn’t require every tree in the population to receive a prediction. An overview of point-3P and related methods is provided by P. West (2011).
For a deeper treatment of 3P sampling, K. Iles (2003) provides a comprehensive coverage, including historical context, practical guidance on implementation, and common pitfalls with solutions. For a more theoretical perspective, see Gregoire and Valentine (2007), which situates 3P within the broader class of unequal-probability sampling designs.
The next section further develops the classical 3P design, detailing its implementation step-by-step.
13.4.4.4.1 Sampling design and implementation
General steps in implementing a 3P cruise are as follows.
- Determine the sample size and selection threshold.
- The sample size \(n\) refers to the number of trees that will receive detailed measurements of \(y\). As with SRS, \(n\) is typically chosen based on available resources or an allowable error. The SRS sample size formula from Section 11.3.6 can be used, with the variance (or CV) reflecting anticipated variability in the measurement-to-prediction ratio \(y / x\).
- To implement the selection rule, define a threshold \(\text{max}_z = \hat{t}_x / n\), where \(\hat{t}_x\) is a pre-cruise estimate of the population total of \(x\). The \(i\)-th tree is selected to measure \(y_i\) with probability \(x_i / \text{max}_z\).161 Choosing \(\text{max}_z\) this way yields an expected sample size of \(n\) under the 3P selection rule described below.
- Traverse the population and apply the 3P selection rule.
- Visit each tree in turn:
- record a quick prediction \(x_i\);
- draw a random number \(z_i\) from a uniform distribution \(U(0, \text{max}_z)\);
- if \(x_i \ge z_i\), select the tree for measurement and record \(y_i\); otherwise, retain only \(x_i\).
- Continue until all trees have been visited.
- The result is a complete set of predictions \(\{x_1, \dots, x_N\}\) and a subsample of about \(n\) trees with paired values \((x_i, y_i)\).
Estimation procedures for totals, means, and standard errors based on the 3P sample are presented in Section 13.4.4.4.2.
The actual sample size achieved after the cruise depends on your initial estimate \(\hat{t}_x\). If this estimate is smaller than the true total \(\tau_x = \sum^N_{i=1}x_i\), you’ll likely end up with fewer than \(n\) sampled trees. If it’s larger, you’ll end up with more. A poor guess at \(\tau_x\) won’t bias your estimates—it only affects the final sample size.
K. Iles (2003) outlines a practical approach for ensuring that you hit the desired sample size. He suggests deliberately selecting a large \(\hat{t}_x\) that results in too many sampled trees—for example, using a value roughly twice the anticipated \(\tau_x\). As you work through the population, use a random mechanism to measure only half of the selected trees, and flag the others (which he calls “insurance trees”). At the end of the cruise, if your actual sample size is below target, you can return to the insurance trees and randomly select enough of them for measurement to meet your target \(n\).
As noted earlier, \(x\) should be proportional to \(y\), meaning we assume a relationship of the form \(y = Rx\), where \(R\) is a scalar to be estimated. This implies a linear relationship through the origin—when \(x = 0\), then \(y = 0\). For example, if \(y\) is tree volume in m\(^3\), a natural choice for \(x\) is also volume, possibly in different units, with any mismatch absorbed into the estimate of \(R\). More generally, \(x\) can be any quantity that scales proportionally with volume.
A good choice for \(x\) is DBH, for a few reasons. First, it’s often easier to give a quick and reasonably accurate guess of a tree’s DBH than its volume. Second, if DBH is recorded for every tree, we can produce detailed stand and stock tables by diameter class (and species, if recorded). This adds value beyond estimation of totals and means, especially in operational settings.
DBH, however, is not itself proportional to volume, so it must be transformed. One option is to use a DBH–volume table, like the one in Section 13.4.4.1, which requires height as an input but avoids the need to directly estimate volume in the field. If height is not available or practical to obtain, a DBH-based volume equation can be used instead. A simple form is \(\text{volume} = \alpha_0 \text{DBH}^{\alpha_1}\), where \(\alpha_0\) and \(\alpha_1\) are parameters taken from the literature or estimated from prior data. The exponent \(\alpha_1\) ensures that \(x\) and \(y\) are proportional, and \(\alpha_0\) scales the ratio so that \(R\) is approximately one. While this scaling isn’t necessary for valid estimation, it can make it easier to choose \(\text{max}_z\), since we usually have better intuition about total volume than about the total of DBH raised to a power.
If you read other references on 3P, you may come across the term “sure-to-be-measured” trees. In our notation, these are trees with predictions \(x_i\) exceeding the threshold \(\text{max}_z\). Given how we define \(\text{max}_z\), such trees are unlikely unless your pre-cruise estimate \(\hat{t}_x\) is far too low and/or the target sample size \(n\) is very large. When \(x_i > \text{max}_z\), the tree must be selected for measurement and flagged accordingly. These trees are treated as a separate stratum in which all units are measured (i.e., a census), so the standard error for that stratum is zero.162
In some cases—such as for high-value trees or rare species—you might choose to measure all occurrences and treat them as their own stratum. This ensures complete accounting for critical trees while still allowing 3P to be applied to the remainder of the population.
If you anticipate wanting to summarize results by category—such as species, diameter class, or product type—be sure to record those variables during the cruise. These additional data allow estimates to be broken down by group in addition to providing overall totals.
Once all trees have been visited and predictions and measurements collected, the next step is to estimate the population mean, total, and associated standard errors, as outlined in the next section.
13.4.4.4.2 Estimation using 3P
Once predictions \(x_i\) and a subset of paired measurements \(y_i\) have been collected, the goal is to estimate the population mean and total of \(y\). Because units with larger \(x_i\) are more likely to be selected, the measured \(y_i\) values don’t form a simple random sample and we can’t just average them. However, under the assumption that \(y\) is proportional to \(x\), the ratio \(y_i / x_i\) is approximately constant and unaffected by the 3P selection probabilities.
Following the notation from Section 13.4.2.2, we use the mean-of-ratios estimator, expressed as \[\begin{equation*} \hat{R} = \frac{1}{n}\sum_{i=1}^n r_i, \end{equation*}\] where \(r_i = y_i/x_i\) is the unit ratio.
Then, following (13.29) and (13.33), we multiply the estimated ratio \(\hat{R}\) by the known population mean of predictions, \(\mu_x = \tau_x / N\), to estimate the population mean and total \[\begin{equation*} \bar{y} = \hat{R} \mu_x, \end{equation*}\] \[\begin{equation*} \hat{t} = N \bar{y}. \end{equation*}\]
Standard errors, confidence intervals, and interpretation follow directly from the general mean-of-ratios case in Section 13.4.2.2, including use of the FPC. Because the sample is drawn from a known, finite population of size \(N\) (the list of \(N\) trees in the sampling frame), the standard error formulas should include the FPC.
Notice that once we have \(\hat{R}\), we could go directly to estimating the total from the total of predictions, i.e., \(\hat{t} = \hat{R} \tau_x\), which is perfectly valid. However, we’re following the estimators in Section 13.4.2.2, which are written to work up the total from the mean \(\mu_x\) instead.
13.4.4.4.3 Estimation using 3P by group
In many cases, you’ll want to summarize estimates by one or more categorical variables, such as species, DBH class, and/or stand. Even when you’re not specifically interested in stratified summaries, it may be that the relationship between predictions and measured values differs systematically across categories. In these cases, estimation accuracy and precision can often be improved by applying group-specific ratios.
To support such estimation, the relevant categorical variable(s) must be recorded for all trees. For example, if you want estimates by species, then species must be recorded for every tree in the population. The same applies for DBH class, stand, or any other grouping variable of interest.
Let the population be divided into \(H\) groups (e.g., species), with known group sizes \(N_h\) and mean predicted values \(\mu_{x,h}\) (both are known after the cruise). Within each group, using the 3P rule a subsample of \(n_h\) trees is measured to obtain paired \((x_{hi}, y_{hi})\) values. For each group \(h = 1, \dots, H\), we compute the sample mean of unit ratios \[\begin{equation} \hat{R}_h = \frac{1}{n_h} \sum_{i=1}^{n_h} r_{hi}, \end{equation}\] with \(r_{hi} = y_{hi} / x_{hi}\).
The estimated mean for group \(h\) is \[\begin{equation} \bar{y}_h = \hat{R}_h \mu_{x,h} \end{equation}\] and the corresponding total is \[\begin{equation} \hat{t}_h = N_h \bar{y}_h. \end{equation}\]
The sample variance of unit ratios within each group is \[\begin{equation} s_{r,h}^2 = \frac{1}{n_h - 1} \sum_{i=1}^{n_h} \left( r_{hi} - \hat{R}_h \right)^2. \end{equation}\]
The standard error of \(\bar{y}_h\) is estimated as \[\begin{equation} s_{\bar{y}_h} = \mu_{x,h} \sqrt{ \frac{s_{r,h}^2}{n_h} \left( 1 - \frac{n_h}{N_h} \right) } \end{equation}\] and the standard error for \(\hat{t}_h\) is \[\begin{equation} s_{\hat{t}_h} = N_h s_{\bar{y}_h}. \end{equation}\]
To obtain overall estimates, we apply the standard stratified estimators \[\begin{equation} \bar{y} = \frac{1}{N} \sum_{h=1}^H N_h \bar{y}_h, \end{equation}\] \[\begin{equation} \hat{t} = N \bar{y} = \sum_{h=1}^H \hat{t}_h. \end{equation}\]
The standard error of the overall mean is \[\begin{equation} s_{\bar{y}} = \sqrt{ \frac{1}{N^2} \sum_{h=1}^H N_h^2 s_{\bar{y}_h}^2 } \end{equation}\] and the standard error of the total is \[\begin{equation} s_{\hat{t}} = N s_{\bar{y}}. \end{equation}\]
This approach allows 3P sampling to accommodate known groupings in the population, deliver category-specific summaries, and potentially improve estimates of population parameters.
13.4.4.4.4 Applying 3P
We’ll demonstrate 3P using a simulated dataset. Although synthetic, the tree values are derived from allometric equations and reflect the type of data encountered in practice. A key advantage of using simulated data is that the true population parameters are known, allowing us to assess the estimator’s performance directly.
Our goal is to estimate the total volume of a mixed-species stand in coastal British Columbia, Canada. We also want species-specific estimates, as well as stand and stock tables by species and DBH class. Although unknown at the outset, the stand contains \(N = 1{,}000\) trees and two species.
For this analysis, the 3P cruise uses volume predicted from an ocular DBH estimate as the \(x\) covariate, with \(y\) also being volume. Both volume variables are expressed in cubic meters.
We begin by fitting a local DBH–volume equation using data from a mix of the anticipated species. In practice, these data might come from a pilot cruise of the stand or from a previous cruise in a similar stand. Once the DBH–volume equation is in place, we follow the 3P steps outlined in the previous section.
Our local DBH–volume equation takes the form \[\begin{equation} y = \alpha_0 \, x^{\alpha_1}, \tag{13.61} \end{equation}\] where \(y\) is volume (\(m^3\)), \(x\) is DBH (cm), and the parameters are estimated to be \(\alpha_0 = 0.00034\) and \(\alpha_1 = 2.3\).163
Based on similar stands and prior experience, our pre-cruise estimate of total volume is 2,000 \(m^3\). We’ll aim to collect \(n = 25\) measurements to estimate \(R\). Given this information, the maximum threshold is \(z_{\text{max}} = 2000 / 25 = 80\).
We follow the 3P cruise steps.
Each tree in the population is visited, its species is recorded, and an ocular estimate of DBH (cm) is obtained. To speed the process, the ocular estimate places trees into 5 cm DBH classes. This DBH is used to compute a predicted volume using (13.61), which is also recorded.
A tree is selected for measurement if its predicted volume exceeds a random number drawn from \(U(0, 80)\). If selected, its volume is accurately measured and recorded.
Both the predicted volume (i.e., given the DBH prediction) and random number can be generated on a tablet or smartphone as each tree is visited.
At the end of the cruise, we know \(N = 1000\) and have predicted DBH and volume for all trees. We also have a subsample of measured trees. These data are stored in "datasets/BC_3P/predicted.csv"
and "datasets/BC_3P/measured.csv"
, respectively. Shared columns in both files include tree_id
, species
, dbh_cm_predicted
, and vol_m3_predicted
. The measured file also contains vol_m3_measured
. These files are read below, and the 3P estimator is applied under the assumption of a common \(R\) across species.
predicted <- read_csv("datasets/BC_3P/predicted.csv")
measured <- read_csv("datasets/BC_3P/measured.csv")
predicted %>% glimpse()
#> Rows: 1,000
#> Columns: 4
#> $ tree_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1…
#> $ species <chr> "A", "A", "B", "A", "A", "B"…
#> $ dbh_cm_predicted <dbl> 52.5, 37.5, 37.5, 47.5, 52.5…
#> $ vol_m3_predicted <dbl> 2.7064, 1.2609, 1.2609, 2.15…
#> Rows: 34
#> Columns: 5
#> $ tree_id <dbl> 4, 43, 54, 57, 66, 81, 93, 9…
#> $ species <chr> "A", "B", "B", "A", "B", "A"…
#> $ dbh_cm_predicted <dbl> 47.5, 42.5, 47.5, 57.5, 52.5…
#> $ vol_m3_predicted <dbl> 2.1564, 1.6752, 2.1564, 3.32…
#> $ vol_m3_measured <dbl> 2.1970, 2.2316, 2.7591, 3.25…
As you can see above, the measured
file has 34 rows, which is our actual sample size \(n\) at the end of the cruise (just a bit off from our target of 25). This might be our first clue that the pre-cruise estimate of total volume was a little low. But that’s okay. It doesn’t bias our estimates, we just ended up measuring a few more trees than anticipated, and our reward is a slightly larger sample to inform the estimates worked up in the code below.
We first compute the population mean of predicted volume, \(\mu_x\), then, within the summarize()
, we estimate the mean-of-ratios estimator \(\hat{R}\) (13.30), mean estimated volume \(\bar{y}\) (13.29), sample variance of the ratios \(s^2_r\) (13.32), standard error of the mean \(s_{\bar{y}}\) (13.31), and the associated estimates for the total volume \(\hat{t}\) and its standard error \(s_{\hat{t}}\) using (13.33) and (13.34), respectively.
# Known from cruise.
N <- nrow(predicted)
# Compute mu_x (and tau_x for fun).
mu_x <- mean(predicted$vol_m3_predicted)
tau_x <- sum(predicted$vol_m3_predicted)
# Estimates for volume.
measured %>%
summarize(
n = n(),
R_hat = mean(vol_m3_measured / vol_m3_predicted),
y_bar = R_hat * mu_x,
s_sq_r = var(vol_m3_measured / vol_m3_predicted),
s_y_bar = mu_x * sqrt(s_sq_r / n * (1 - n / N)),
t_hat = N * y_bar,
s_t_hat = N * s_y_bar,
t = qt(df = n - 1, p = 1 - 0.05 / 2),
t_ci_lower = t_hat - s_t_hat * t,
t_ci_upper = t_hat + s_t_hat * t,
) %>% select(n, R_hat, t_hat, s_t_hat, t_ci_lower, t_ci_upper)
#> # A tibble: 1 × 6
#> n R_hat t_hat s_t_hat t_ci_lower t_ci_upper
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 34 1.05 2653. 62.5 2526. 2780.
So, how’d we do? Our estimated total, with associated 95% confidence interval, is 2,653 (2,526, 2,780) cubic meters. With only \(n = 34\) measurements, we achieved an estimated total error of 2.4% (i.e., \(100 \cdot s_{\hat{t}} / \hat{t}\))!
Because this is simulated data, we know the true volume is \(\mu_y = 2,590\) cubic meters. For this sample, our 95% confidence interval covers the true value, and the estimated total is remarkably close to the actual volume.
As for the difference between our target and actual sample size \(n\), we had initially guessed that \(\tau_x\) was 2000. After the cruise, we see that this was an underestimate—the actual population total of \(x\) is \(2525\) (from the tau_x
value in the code above). Again, this isn’t a problem; it just means we might end up with a few more measurement trees than intended, as we did here.
Now let’s work up species-specific estimates following the methods in Section 13.4.4.4.3. We simulated these data with two species, labeled A and B. Figure 13.12 shows the relationship between predicted and measured volume by species for the \(n = 34\) trees used to estimate ratios. For exploratory analysis, we added best-fit regression lines by species. The plot suggests that species A and B have different ratios, which in turn suggests we might improve estimation by using species-specific ratios.164
FIGURE 13.12: Predicted versus measured volume by species for the simulated 3P dataset. Lines are fit using least squares regression constrained to pass through the origin and included for visual exploration.
We’ll implement the estimator in steps, starting with species-specific summaries of the predictions, measurements, and estimates. We’ll then apply the stratified estimator to combine these summaries and produce stand-level estimates of the mean and total.
pred_summary <- predicted %>%
group_by(species) %>%
summarize(
N_h = n(),
mu_xh = mean(vol_m3_predicted),
tau_xh = sum(vol_m3_predicted)
)
pred_summary
#> # A tibble: 2 × 4
#> species N_h mu_xh tau_xh
#> <chr> <int> <dbl> <dbl>
#> 1 A 734 2.62 1920.
#> 2 B 266 2.27 605.
From the prediction summary, we see that species A dominates the stand, with roughly three times as many stems (\(N_h\)) and three times the volume (\(\tau_{xh}\)) as species B. We’ll use \(N_h\) and \(\mu_{xh}\) in the subsequent estimation, using the measured subset to compute species-specific ratios and associated variance estimates.
measure_summary <- measured %>%
group_by(species) %>%
summarize(
n_h = n(),
R_hat_h = mean(vol_m3_measured / vol_m3_predicted),
s_sq_rh = var(vol_m3_measured / vol_m3_predicted),
)
measure_summary
#> # A tibble: 2 × 4
#> species n_h R_hat_h s_sq_rh
#> <chr> <int> <dbl> <dbl>
#> 1 A 25 0.975 0.00336
#> 2 B 9 1.26 0.0103
Estimates for species-specific \(\hat{R}_h\) suggest that the relationship between predicted and measured volume varies by species—our predictions tend to overestimate volume for species A and underestimate it for species B. Again, this is okay as long as the over- or underestimation is proportional across the observed pairs of \(x\) and \(y\) values—and, by assumption, across the full population.
For example, \(\hat{R}\) for species B is 1.26, so if we predict volumes of 1, 2, 3, and 4 cubic meters, we expect measured values of 1.26, 2.53, 3.79, and 5.05 cubic meters, respectively. In other words, \(y\) and \(x\) are proportional. This relationship is clear for species B in the exploratory plot shown in Figure 13.12, and it supports our assumption about the linear relationship between predicted and measured volume.
Below are the species-specific estimates, followed by stand-level means and totals obtained by combining across strata. We begin by joining the species population information to measure_summary
.
measure_summary <- left_join(measure_summary,
pred_summary,
by = join_by(species)
)
# Species-specific estimates.
measure_summary %>%
mutate(
y_bar_h = R_hat_h * mu_xh,
s_y_bar_h = mu_xh * sqrt(s_sq_rh / n_h * (1 - n_h / N_h)),
t_hat_h = N_h * y_bar_h,
s_t_hat_h = N_h * s_y_bar_h,
t_h = qt(df = n_h - 1, p = 1 - 0.05 / 2),
t_ci_lower_h = t_hat_h - s_t_hat_h * t_h,
t_ci_upper_h = t_hat_h + s_t_hat_h * t_h,
) %>%
select(n_h, t_hat_h, s_t_hat_h, t_ci_lower_h, t_ci_upper_h)
#> # A tibble: 2 × 5
#> n_h t_hat_h s_t_hat_h t_ci_lower_h t_ci_upper_h
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 25 1871. 21.9 1826. 1916.
#> 2 9 764. 20.1 717. 810.
Using a sample size of \(25\) and \(9\) measurements for species A and B, respectively, we achieved RSE values of 1.2% and 2.6% for the corresponding species totals. As defined in (13.48), the RSE is computed as \(100 \cdot s_{\hat{t}_h} / \hat{t}_h\).
Our estimated totals, with associated 95% confidence intervals, are 1,871 (1,826, 1,916) and 764 (717, 810) cubic meters for species A and B, respectively. The true volume totals are A = 1,851 and B = 740 cubic meters.
Now we’ll estimate the stand-level mean and total by combining the species-specific results using their population weights (again following estimators in Section 13.4.4.4.3).
measure_summary %>%
mutate(
y_bar_h = R_hat_h * mu_xh,
s_y_bar_h = mu_xh * sqrt(s_sq_rh / n_h * (1 - n_h / N_h)),
) %>%
summarize(
N = sum(N_h),
y_bar = 1 / N * sum(N_h * y_bar_h),
s_y_bar = sqrt(1 / N^2 * sum(N_h^2 * s_y_bar_h^2)),
t_hat = N * y_bar,
s_t_hat = N * s_y_bar,
t = qt(df = sum(n_h) - n(), p = 1 - 0.05 / 2),
t_ci_lower = t_hat - s_t_hat * t,
t_ci_upper = t_hat + s_t_hat * t
) %>%
select(t_hat, s_t_hat, t_ci_lower, t_ci_upper)
#> # A tibble: 1 × 4
#> t_hat s_t_hat t_ci_lower t_ci_upper
#> <dbl> <dbl> <dbl> <dbl>
#> 1 2635. 29.7 2575. 2696.
The species-specific stratified estimator produced an RSE of 1.1% for the total volume. The estimated total volume, with a 95% confidence interval, is 2,635 (2,575, 2,696) cubic meters. Recall, again, that the true volume is \(\mu_y = 2,590\) cubic meters.
In the first analysis, we ignored potential species differences and estimated a single pooled ratio, which yielded an RSE of 2.4%. We then allowed for species-specific ratios and applied the stratified estimator, which substantially reduced the RSE to 1.1%—less than half the pooled-ratio value. This demonstrates the value of using category-specific ratios when there are indications of differences (e.g., as seen in exploratory analysis) and enough observations to support separate estimation.
13.4.5 Advantages and disadvantages of estimation with covariates
Estimation with covariates improves the efficiency of inference from both a statistical and field-effort perspective by incorporating auxiliary information—for example, remotely sensed data, ocular estimates, or other quick field measures. Below, we summarize key advantages and disadvantages of this approach.
13.4.5.1 Advantages
Improved precision: When the covariate is strongly related to the variable of interest, ratio or regression estimators can substantially reduce standard errors compared to estimators that ignore the covariate—often achieving better precision with the same or even smaller sample size.
Statistical and practical efficiency: Leveraging inexpensive covariate data allows more precise estimates with less field effort, improving cost-effectiveness without sacrificing reliability.
Flexible and widely applicable: Covariate-based estimators can be combined with other sampling designs such as systematic and stratified, and are especially well suited to two-phase sampling frameworks.
Supports small-area and category-specific inference: When covariates are available for the full population or a large phase-1 sample, estimates can be extended to subregions, species groups, or other categories without requiring additional measurements.
Variety of estimator options: Ratio and regression estimators provide flexible ways to incorporate covariates, depending on the form and strength of the relationship with the variable of interest.
13.4.5.2 Disadvantages
Assumptions: These estimators rely on a linear or proportional relationship between the covariate and the variable of interest. If this assumption is weak or violated, inference may be biased or less precise.
Potential lack of design unbiasedness: Regression estimators are generally not design-unbiased and rely on model assumptions for validity, while ratio estimators may be design-unbiased under some designs.
Increased analytical complexity: Selecting the appropriate estimator, checking assumptions, and diagnosing model fit require more statistical expertise and judgement than simpler design-based approaches.
Sensitivity to covariate quality: Weak or poorly measured covariates may provide little gain in precision and can mislead estimation if their uncertainty is not accounted for.
Risk of extrapolation: Estimators that rely on covariates can produce misleading results if applied outside the range of observed data, particularly when the assumed relationship is weak or poorly supported by the data.
13.5 Cluster sampling
Cluster sampling is commonly used in forest inventory, especially when it’s impractical or costly to sample widely dispersed locations. In this design, the primary sampling unit (PSU) is a cluster of smaller secondary sampling units (SSUs), where measurements are taken. The SSUs are nested within PSUs, introducing additional design considerations compared to the sampling designs introduced earlier.
Cluster sampling is particularly useful when:
- the cost of reaching widely dispersed sampling locations is high;
- operational constraints require grouping sampling locations for efficient data collection;
- measurement of a sampling location is inexpensive relative to travel costs;
- or the sampling frame is only available at the PSU level.
These conditions are common in forest inventory, where measurement costs are low relative to the cost of moving a field crew to a sampling location. Cluster designs help reduce travel time by grouping nearby SSUs into the same PSU, allowing crews to sample multiple locations with minimal movement. This approach is especially useful when sampling large areas. For example, cluster designs are widely used in national forest inventory (NFI) programs, where travel costs are high and field crews must cover broad regions efficiently.
The tradeoff, however, is a potential loss of statistical efficiency. Due to spatial autocorrelation—the tendency for observations located close together to be more similar than those farther apart—SSUs in the same PSU often exhibit less variability than single locations sampled independently across space, such as the SYS designs illustrated in Section 13.2. As a result, cluster sampling typically produces less precise estimates than a SRS of the same number of SSUs. This is a classic cost-precision tradeoff: cluster sampling may reduce field effort but increase sampling error.
13.5.1 Clusters, strata, and systematic sampling
Although clusters (PSUs) and strata both involve dividing the population into groups, their purpose and implementation differ in important ways. In stratified sampling, the population is partitioned into non-overlapping subgroups (strata) using known auxiliary variables, and sampling is conducted independently within each stratum. The goal is to increase precision by concentrating sampling effort within relatively homogeneous subgroups.
In cluster sampling, the goal is not to improve statistical efficiency directly, but to simplify or reduce the cost of data collection. PSUs are often based on geographic proximity or field logistics, not auxiliary variables. A subset of PSUs is selected (typically via SRS or SYS), and measurements are collected from all or some SSUs within each selected PSU.
The contrast can be summarized as follows.
Stratification aims to reduce within-group variability to improve estimation precision.
Clustering often results in within-group similarity, which can reduce precision but improve feasibility—by reducing travel time and logistical burden, particularly when sampling dispersed locations in large or remote areas.
While their motivations differ, both cluster and stratified designs are valid and often used together—for example, in stratified cluster sampling, where PSUs are selected independently within each stratum.
SYS, as described in Section 13.2, can be viewed as a form of cluster sampling with a single PSU made up of \(n\) evenly spaced SSUs. This perspective reinforces why SYS does not have a design-based variance estimator (Section 13.2.5)—with only one randomly selected PSU, there is no way to estimate between-PSU variability.
13.5.2 Elements of cluster sampling design
When implementing a cluster sampling design, there are several key design elements to consider:
Cluster layout and spacing. PSUs may be naturally defined or imposed by the analyst. In the latter case, PSUs are often laid out in a regular pattern, such as a systematic grid. The distance between PSUs affects both cost and precision. Widely spaced PSUs typically reduce redundancy due to spatial autocorrelation, but increase travel effort.
Cluster shape and SSU configuration. SSUs within a PSU are typically arranged in a fixed geometric pattern around an anchor point. Common layouts include lines, triangles, squares, and L-shapes. This configuration influences how well the PSU captures local variability and affects edge effects, travel efficiency, and measurement consistency. SSU size can be scaled to efficiently measure different variables, and nested configurations (e.g., for regeneration sampling) can be used when needed.
Number of clusters and SSUs per cluster. The total number of observed SSUs is the product of the number of PSUs and the number of SSUs per PSU. Increasing the number of PSUs generally improves statistical efficiency more than increasing the number of SSUs within each PSU, since PSUs are more likely to be spatially independent. However, logistical considerations may favor sampling more SSUs per PSU to reduce costs.
Method of selecting clusters. PSUs can be selected using any valid probability sampling method. SRS and SYS are most common. If auxiliary information is available, methods such as stratified or PPS sampling may be used to improve precision while maintaining logistical feasibility.
These elements interact to determine the balance between statistical precision and operational cost. In forest inventory, the goal is often to design a layout that captures sufficient spatial variability while remaining cost-effective and logistically manageable.
13.5.3 Choosing the number of PSUs and SSUs
Unlike stratified sampling, where well-established formulas exist for allocating sample sizes across strata, sample size planning in PSU designs is more heuristic and context-specific. The main tradeoff is between the number of PSUs and the number of SSUs per PSU. Adding SSUs within a PSU improves precision only up to a point—when observations are spatially correlated, the marginal benefit of additional SSUs diminishes. Increasing the number of PSUs is usually more effective for reducing sampling error, since PSUs tend to be more independent.
In most forest inventory settings, these decisions are shaped by cost and logistics. Analytical expressions for optimal allocation exist but require estimates of within- and between-PSU variability, which are rarely available before sampling begins. Some forest inventory programs use simulation to explore tradeoffs under realistic spatial and operational constraints. Additional discussion on cluster sampling design can be found, e.g., in Freese (1962), S. L. Lohr (2019) Section 5.5, and Cochran (1977) Section 9.6.
There are many variations of cluster sampling in both design and estimation. This section focuses on single-stage cluster sampling, where all SSUs within each selected PSU are measured. In many settings, however, measuring every SSU is cost-prohibitive or unnecessary. In such cases, only a subset of SSUs is observed—a design referred to as two-stage (or multistage) cluster sampling, which is covered in Section 13.6.
Here, we focus on a common and relatively simple form of single-stage cluster sampling: selecting PSUs with equal probability and measuring all SSUs within each selected PSU. This provides a useful baseline for understanding and implementing cluster-based designs in forest inventory.
13.5.4 Estimation using cluster sampling
In forest inventory, the total number of SSUs in the population is often unknown. A natural estimator in this case is the cluster ratio estimator, which divides the total measurement across all sampled SSUs by the total number of SSUs observed. It’s especially useful when SSU counts vary across PSUs. The estimator is flexible, easy to implement, and well suited to field conditions. Some bias may occur if SSU counts are related to the measured variable, but this is usually small in practice.
We use the following notation:
- \(N\) number of PSUs in the population;
- \(n\) number of PSUs in the sample;
- \(m_j\) number of SSUs in the \(j\)-th sampled PSU;
- \(\bar{m} = \frac{1}{n} \sum_{j=1}^n m_j\) mean number of SSUs per PSU;
- \(y_{ji}\) value for the \(i\)-th SSU in the \(j\)-th PSU;
- \(y_j = \sum_{i=1}^{m_j} y_{ji}\) total for the \(j\)-th PSU.
The cluster ratio estimator for the population mean \(\mu\) is
\[\begin{equation}
\bar{y} = \frac{\sum_{j=1}^n y_j}{\sum_{j=1}^n m_j}.
\tag{13.62}
\end{equation}\]
Under a finite population, the standard error of \(\bar{y}\) is
\[\begin{equation}
s_{\bar{y}} = \sqrt{ \left(1- \frac{n}{N} \right) \frac{1}{n \bar{M}^2} \frac{ \sum_{j=1}^n (y_j - \bar{y} m_j)^2 }{n - 1} },
\tag{13.63}
\end{equation}\]
where \(\bar{M}\) is the average number of SSUs per PSU in the population. If \(\bar{M}\) is unknown, we use the sample average \(\bar{m}\).
As in previous sampling sections, when working from an areal sampling frame, the finite population correction is omitted and the standard error of \(\bar{y}\) becomes \[\begin{equation} s_{\bar{y}} = \sqrt{ \left( \frac{1}{n \bar{M}^2} \right) \frac{ \sum_{j=1}^n (y_j - \bar{y} m_j)^2 }{n - 1} }, \tag{13.64} \end{equation}\] where, again, if \(\bar{M}\) is not known, it can be replaced with \(\bar{m}\).
Confidence intervals for the mean follow the usual form \[\begin{equation} \bar{y} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\bar{y}}, \tag{13.65} \end{equation}\] with degrees of freedom \(df = n - 1\).
The degrees of freedom reflect the number of sampled PSUs, not SSUs, since PSUs are the units selected by the sampling design. SSUs within a PSU are not independent, so inference must be based on the number of independent PSUs.
If the number of SSUs per PSU is constant, these estimators simplify. But it’s usually better to write code for the more general case, which handles varying SSU counts without much additional effort.
13.5.5 Stratified cluster sampling
Cluster sampling can be extended to a stratified framework, resulting in stratified cluster sampling. This is useful when both logistical constraints and known auxiliary information are present. As described in Section 13.3.1, the first step is to divide the population into \(H\) non-overlapping strata based on variables expected to influence the parameter of interest. A separate cluster sampling design is then implemented independently within each stratum using the following steps.
Select \(n_h\) PSUs in each stratum using a probability method such as SRS or SYS. At least two PSUs are needed to compute a sample variance.
Within each sampled PSU \(j = 1, \dots, n_h\), measure \(m_{hj}\) SSUs and compute the stratum mean \(\bar{y}_h\) (13.62) and its standard error \(s_{\bar{y}_h}\) (13.63) or (13.64).
Once stratum-level estimates have been computed, the population mean and its standard error are calculated as \[\begin{equation} \bar{y} = \frac{\sum_{h=1}^H N_h \bar{y}_h}{N} \tag{13.66} \end{equation}\] and \[\begin{equation} s_{\bar{y}} = \sqrt{ \frac{ \sum_{h=1}^H N_h^2 s_{\bar{y}_h}^2 }{N^2} }. \tag{13.67} \end{equation}\]
Here, \(N_h\) is the population size in stratum \(h\), and \(N = \sum_{h=1}^H N_h\). In a traditional finite population setting, these are counts of PSUs. In areal sampling, \(N\) and \(N_h\) may represent area, provided that \(\bar{y}_h\) is expressed per unit area (e.g., per acre or hectare).
Confidence intervals for the population mean should use \(n - H\) degrees of freedom, where \(n\) is the total number of PSUs sampled and \(H\) is the number of strata. As in standard cluster sampling, the degrees of freedom reflect the number of PSUs sampled, not the number of SSUs.
Stratified cluster sampling is widely used in forest inventory because it combines the operational benefits of clustering with the statistical efficiency of stratification. Different strata can use different sampling intensities, SSU layouts, or even PSU configurations—what matters is that estimation is carried out separately within each stratum and then combined using the stratified estimators.
13.5.6 Illustrative analyses
13.5.6.1 Cluster sampling
To illustrate cluster sampling, we return to the Harvard Forest dataset. This is again a toy-like example, but it allows us to walk through the full workflow using real data and compare sample-based estimates to the true population values.
The objective of the cruise is to estimate average AGB per hectare and total AGB for the entire forest. The population of interest includes all stems with DBH \(\ge\) 12.7 cm.
We use systematic sampling to select the locations of our PSUs. Each PSU consists of three fixed-area circular SSUs, positioned 10 meters from the PSU center at bearings of \(0^\circ\), \(135^\circ\), and \(225^\circ\). Each SSU has a radius of 5 meters and is used to tally all trees meeting the DBH threshold. See Section 12.3 for more on plot-based sampling.
As described in Section 13.2, we use the sys_grid()
function to generate a systematic grid of PSU centers across the forest. This function constructs the grid around a randomly chosen starting point. For this example, the spacing is 100 by 100 meters, yielding \(n = 35\) PSUs. These grid points serve as the anchor locations for each PSU.
FIGURE 13.13: Cluster sampling with primary sampling units (PSUs), i.e., clusters, placed using a systematic design in the Harvard Forest example.
The cruise tree data are stored in "datasets/HF/HF_cluster_sys.csv"
and read into ssu_trees
below. Use glimpse()
to review the columns and indexing. Notice the SSU ID is nested within the PSU ID, which has implications for how we group the data in subsequent calculations.
#> Rows: 586
#> Columns: 7
#> $ psu_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2…
#> $ ssu_id <dbl> 1, 1, 2, 3, 3, 3, 3, 3, 1, 1, 2, 2…
#> $ tree_id <dbl> 1, 2, 1, 1, 2, 3, 4, 5, 1, 2, 1, 2…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ agb_Mg <dbl> 0.416741, 0.079753, 0.000000, 0.10…
#> $ dbh_cm <dbl> 24.9, 13.3, 0.0, 14.8, 24.9, 15.8,…
#> $ species <chr> "rubra", "rubra", "no trees", "gra…
Before proceeding with estimation, it’s good practice to confirm how the data are structured. The code below checks the number of PSUs and the number of SSUs per PSU.
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 35
# Number of SSUs per PSU.
ssu_trees %>%
group_by(psu_id) %>%
summarize(m_j = n_distinct(ssu_id)) %>%
distinct(m_j)
#> # A tibble: 1 × 1
#> m_j
#> <int>
#> 1 3
As expected, each PSU contains three SSUs.
We’ll now walk through the estimation workflow using the estimators from Section 13.5.4. We begin by expanding AGB to a per-hectare basis using the tree factor for the 5-meter radius SSU plots. Then we compute SSU-level summaries. Since the SSU index is only unique within its PSU, we must group by both PSU and SSU in the code below.
# Add tree factor for the given plot area.
r <- 5 # SSU plot radius in meters.
ssu_trees <- ssu_trees %>%
mutate(TF = 10000 / (pi * r^2))
# Compute the SSU plot-level AGB estimates.
ssu_summary <- ssu_trees %>%
group_by(psu_id, ssu_id) %>%
summarize(y_ji = sum(TF * tree_count * agb_Mg))
Next, on our way to computing the overall mean, we summarize to the PSU level by calculating \(y_j\) and \(m_j\) for each PSU. Although ssu_summary
is already grouped by psu_id
, we explicitly redefine the grouping in the code below for clarity.
Given psu_summary
, we apply the estimators from (13.62) and (13.64) to compute the estimated population mean and its standard error, along with the associated 95% confidence interval. We then multiply by the total forest area to estimate forest-wide biomass at the end of the piped sequence.
N <- 35 # Hectare forest area.
psu_summary %>%
summarize(
y_bar = sum(y_j) / sum(m_j),
n = n(),
m_bar = mean(m_j),
s_y_bar = sqrt(1 / (n * m_bar^2) *
sum((y_j - y_bar * m_j)^2) / (n - 1)),
t = qt(df = n - 1, p = 1 - 0.05 / 2),
y_ci_lower = y_bar - t * s_y_bar,
y_ci_upper = y_bar + t * s_y_bar
) %>%
select(y_bar, y_ci_lower, y_ci_upper) %>%
mutate(
t_hat = N * y_bar,
t_ci_lower = N * y_ci_lower, t_ci_upper = N * y_ci_upper
)
#> # A tibble: 1 × 6
#> y_bar y_ci_lower y_ci_upper t_hat t_ci_lower t_ci_upper
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 306. 253. 358. 10693. 8868. 12519.
The census-based AGB for the Harvard Forest is 341.17 Mg/ha, with a forest-wide total of 11,941 Mg. Our estimated mean and total, with 95% confidence intervals, were 305.53 ( 253.37, 357.69) Mg/ha and 10,693.00 ( 8,868.00, 12,519.00) Mg, respectively. Based on this sample, both confidence intervals successfully captured the true values.
13.5.6.2 Stratified cluster sampling
Next, let’s extend the cluster design to take advantage of the Harvard Forest strata introduced in Section 13.3.6 and shown in Figure 13.3.
We’ll use the same systematic sample of PSUs from the previous section. Now, with strata imposed, any SSU whose center falls in a different stratum than its PSU is removed. This is visible in Figure 13.14, where some PSUs near stratum boundaries have fewer than three SSUs.
One SSU has its boundary extend beyond its stratum boundary, so a boundary slopover correction is applied. As in the stratified sampling illustration, we use the mirage correction: if an SSU overlaps the stratum boundary and a tree falls within its mirage correction region, that tree is counted twice by assigning it a tree_count
of two (see Section 12.3.3).
FIGURE 13.14: Stratified cluster sampling in the Harvard Forest example, with primary sampling units (PSUs), i.e., clusters, placed using a systematic design.
The cruise tree data are stored in "datasets/HF/HF_strat_cluster_sys.csv"
and read into ssu_trees
below. Use glimpse()
to review the columns and indexing. Notice that SSU IDs are nested within PSU IDs, which are in turn nested within stratum IDs. This nesting structure has implications for how we group the data in subsequent calculations.
#> Rows: 556
#> Columns: 8
#> $ psu_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2…
#> $ ssu_id <dbl> 1, 1, 2, 3, 3, 3, 3, 3, 1, 1, 2, 2…
#> $ tree_id <dbl> 1, 2, 1, 1, 2, 3, 4, 5, 1, 2, 1, 2…
#> $ stratum_id <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ agb_Mg <dbl> 0.416741, 0.079753, 0.000000, 0.10…
#> $ dbh_cm <dbl> 24.9, 13.3, 0.0, 14.8, 24.9, 15.8,…
#> $ species <chr> "Quercus rubra", "Quercus rubra", …
Before proceeding with estimation, let’s examine the data structure by summarizing the number of PSUs and the number of SSUs per PSU within each stratum.
# Number of PSUs within each stratum.
ssu_trees %>%
group_by(stratum_id) %>%
summarize(n = n_distinct(psu_id))
#> # A tibble: 4 × 2
#> stratum_id n
#> <dbl> <int>
#> 1 1 2
#> 2 2 9
#> 3 3 3
#> 4 4 21
# Number of SSUs per PSU within each stratum.
ssu_trees %>%
group_by(stratum_id, psu_id) %>%
summarize(m_j = n_distinct(ssu_id)) %>%
distinct(m_j)
#> # A tibble: 8 × 2
#> # Groups: stratum_id [4]
#> stratum_id m_j
#> <dbl> <int>
#> 1 1 2
#> 2 1 3
#> 3 2 2
#> 4 2 3
#> 5 3 2
#> 6 3 3
#> 7 4 3
#> 8 4 2
As seen in Figure 13.14, the number of SSUs per PSU varies within some strata.
We’ll now walk through the estimation workflow using the estimators from Section 13.5.5. We begin by expanding AGB to a per-hectare basis using the tree factor for the 5-meter radius SSU plots. Next, we compute stratum-specific summaries at the SSU and then PSU levels to prepare for applying the stratified mean and standard error estimators (13.2) and (13.5), respectively. Finally, we use these results to compute a 95% confidence interval for the population mean.
# Add tree factor for the given plot area.
r <- 5 # SSU plot radius in meters.
ssu_trees <- ssu_trees %>%
mutate(TF = 10000 / (pi * r^2))
# Compute the SSU plot-level AGB estimates.
ssu_summary <- ssu_trees %>%
group_by(stratum_id, psu_id, ssu_id) %>%
summarize(y_hji = sum(TF * tree_count * agb_Mg))
Next, for each stratum, we summarize to the PSU level by calculating \(y_{hj}\) and \(m_{hj}\) then stratum level \(\bar{y}_h\) and \(s_{\bar{y}_{h}}\) via (13.62) and (13.64), respectively. Although ssu_summary
is already grouped by stratum_id
and psu_id
, we explicitly redefine the grouping in the code below for clarity.
stratum_summary <- ssu_summary %>%
group_by(stratum_id, psu_id) %>%
summarize(
y_hj = sum(y_hji),
m_hj = n()
) %>%
summarize(
y_bar_h = sum(y_hj) / sum(m_hj),
n_h = n(),
m_bar_h = mean(m_hj),
s_y_bar_h = sqrt(1 / (n_h * m_bar_h^2) *
sum((y_hj - y_bar_h * m_hj)^2) / (n_h - 1))
)
stratum_summary
#> # A tibble: 4 × 5
#> stratum_id y_bar_h n_h m_bar_h s_y_bar_h
#> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 1 223. 2 2.5 3.58
#> 2 2 392. 9 2.89 57.0
#> 3 3 55.4 3 2.33 25.0
#> 4 4 314. 21 2.86 28.3
Given the stratum specific estimates for the mean and its standard error, we apply the stratified estimators (13.66) and (13.64) in the code below.
stratum_summary <- left_join(stratum_summary, stratum_area,
by = "stratum_id"
)
# Estimate population mean and 95% CI.
stratum_summary %>%
summarize(
N = sum(N_h),
y_bar = sum(N_h * y_bar_h) / N,
s_y_bar = sqrt(sum(N_h^2 * s_y_bar_h^2) / N^2),
t = qt(df = sum(n_h - 1), p = 1 - 0.05 / 2),
ci_lower = y_bar - t * s_y_bar,
ci_upper = y_bar + t * s_y_bar
) %>%
select(y_bar, ci_lower, ci_upper)
#> # A tibble: 1 × 3
#> y_bar ci_lower ci_upper
#> <dbl> <dbl> <dbl>
#> 1 314. 266. 362.
The width of this 95% confidence interval is narrower than that estimated using unstratified cluster sampling in Section 13.5.6.1 (by about 9 Mg/ha). However, this could simply be due to chance, since the comparison is based on a single sample. To more reliably assess whether stratification improves efficiency in this setting, we turn to a repeated sampling analysis in the next section.
13.5.6.3 Comparing stratified and unstratified cluster sampling
Let’s compare coverage rates and average 95% confidence interval widths for the unstratified and stratified cluster sampling examples from the two preceding sections. We performed a similar assessment for stratified versus unstratified sampling in Sections 13.3.6.1 and 13.3.6.2. As before, we repeatedly sample from the Harvard Forest population to empirically evaluate the performance of each estimator.
In this case, we fix the total number of PSUs and use SYS to generate 1,000 independent samples. For each sample, we compute the mean AGB and its 95% confidence interval using both unstratified and stratified cluster estimators. This allows us to compare both coverage properties and the precision of resulting estimates.
We expect roughly 95% of confidence intervals to capture the true population mean, and, if stratification is effective, we should see narrower intervals on average. The results bear this out: the empirical coverage rates were 98.1% and 95.1% for the unstratified and stratified estimators, respectively. The average confidence interval widths were 100 and 89 (Mg/ha), again for the unstratified and stratified cases. The percent gain in precision from stratification is 11%, reflecting the reduction in average interval width achieved by incorporating strata.
You can reproduce these results using the following replicate estimate files located in the datasets/HF/
directory:
clust_sys_agb_ests_1000reps.csv
(unstratified),strat_clust_sys_agb_ests_1000reps.csv
(stratified).
Interestingly, the gains in precision from stratification are similar to those observed in Section 13.3.6.2, despite the use of clustered PSUs with differing plot sizes and configurations. These gains highlight the informational value of effective stratification, independent of the specific sampling design.
13.5.7 Advantages and disadvantages of cluster sampling
Cluster sampling is widely used in forest inventory because it can substantially reduce the cost and complexity of field data collection. It’s especially useful when measurement is inexpensive but travel between sampling locations is costly—a common situation in large-area surveys. For these reasons, cluster designs are a core feature of many NFI programs.
13.5.7.1 Advantages
Operational efficiency: By grouping nearby SSUs into PSUs, cluster sampling reduces travel time and limits the need to move frequently between dispersed sampling locations.
Simplified logistics: Field crews can complete several measurements within a single PSU, minimizing the time and effort required to reposition across the landscape.
Aligned with operational structure: Many forest monitoring programs are built around geographic or administrative PSUs, making cluster sampling a natural choice.
13.5.7.2 Disadvantages
Reduced statistical efficiency: SSUs in the same PSU are often spatially autocorrelated and provide less independent information than SSUs sampled across the full population.
Smaller effective sample size: Because randomization occurs at the PSU level, the effective sample size reflects the number of PSUs rather than the total number of SSUs, which can reduce precision.
Despite these limitations, cluster sampling remains a valuable option when logistical constraints are high and moderate precision is acceptable. It often offers the best tradeoff between field efficiency and statistical rigor in large-area inventory.
13.6 Multistage sampling
Multistage sampling is a generalization of cluster sampling in which selection occurs in two or more successive stages. Rather than drawing a sample directly from the population, units are chosen in stages: larger primary sampling units (PSUs) are selected first, and smaller secondary sampling units (SSUs) are then sampled within them.
While multistage designs can have any number of stages, two-stage sampling is by far the most common. In this design the stages are:
a sample of PSUs is selected (e.g., stands, management units, or other areal units used to partition the population of interest);
within each sampled PSU, a further sample of SSUs is selected (e.g., plots, points, or some other sampling unit delineation).
This two-stage structure offers flexibility, allowing tailoring of sampling intensity within PSUs, accommodating operational constraints, and reducing travel costs by concentrating sampling effort.
13.6.1 Connection to cluster sampling
Two-stage sampling is closely related to cluster sampling. The cluster design introduced in Section 13.5 is a special case of two-stage sampling where PSUs are selected using SRS or SYS, and all SSUs within each selected PSU are measured.
In two-stage sampling, instead of measuring every SSU in each selected PSU, a probability sample of SSUs is taken at the second stage. This distinction is important: in cluster sampling, sampling error reflects only variation among PSUs; in two-stage sampling, both stages involve probabilistic selection and contribute to the total variance, and estimation must account for this hierarchical structure.
13.6.2 Why use multistage sampling?
Multistage sampling is often chosen because it aligns with how forests are managed and how fieldwork is organized. For example, it:
takes advantage of existing natural, management, or administrative units (e.g., stands, ownerships) as convenient PSUs;
reduces travel time and cost by allowing crews to work intensively within a few selected PSUs rather than dispersing effort across the entire area;
uses auxiliary data, such as stand area or previous inventory estimates, often available at the PSU level, to guide sampling;
allows strategic selection of PSUs using sampling methods that better capture variability in the population and the contribution of each PSU to the parameter of interest.
This design adapts easily to real-world conditions. The number of SSUs per PSU can be varied, different sampling methods can be used at each stage, and unequal probabilities can be applied to match sampling effort with variation in the population.
13.6.3 Strategies for selecting PSUs
The first step in two-stage sampling is selecting PSUs. Common strategies include the following.
Simple random sampling (SRS): if no prior information is available, draw a random sample of PSUs, giving each an equal selection probability.
Stratification: if PSU-level variables are available (e.g., area, expected volume, productivity class), stratify PSUs and sample within each stratum to improve precision (see Section 13.3.1).
Systematic sampling: if PSU-level variables are known, sort PSUs by one of these variables (e.g., area, expected volume) and apply systematic sampling with a random start to ensure broad representation.
Probability proportional to size (PPS): if each PSU has a known measure of size or importance (e.g., area, expected volume), select PSUs with probabilities proportional to that measure.
These approaches can be combined. For example, PSUs might be stratified by size, then selected using equal probability, systematic, or PPS methods within each stratum. Stratum-specific two-stage estimates for means, totals, and standard errors can then be combined as described in Section 13.5.5.
Availability of auxiliary data usually guides the choice of PSU selection method and two-stage estimator. Without auxiliary data, equal probability sampling is the default. When auxiliary variables are available, they can be used to reduce within-stratum variance or concentrate sampling in strata that contribute most to the parameter of interest. When a size-related variable is available and correlated with the parameter, PPS methods can improve efficiency by allocating sampling effort in proportion to each PSU’s contribution.
The choice of two-stage estimator also depends on the sampling approach at the second stage. Some estimators assume a finite population of SSUs, while others treat the second-stage population as effectively infinite—a view that aligns with point and plot sampling in forest inventory.
We present the estimators’ more general forms, allowing for PSUs of differing sizes and varying numbers of SSUs per PSU. When PSU sizes are equal or the number of SSUs is constant across PSUs, simplified versions of these estimators can be found in Cochran (1977), Thompson (2012), S. L. Lohr (2019), and other survey sampling references.
Because two-stage designs involve both between-PSU and within-PSU variation, the resulting formulas can appear long and a bit daunting. Treated as step-by-step recipes and paired with the tools in Section 10.2, however, they’re straightforward to implement. In the end, each estimator delivers a single mean and total, along with their standard errors.
13.6.4 Estimation using two-stage sampling
We use the following notation:
- \(N\) number of PSUs in the population;
- \(n\) number of PSUs in the sample;
- \(m_j\) number of sampled SSUs in the \(j\)-th sampled PSU;
- \(M_j\) number of SSUs in the \(j\)-th sampled PSU;
- \(M = \sum^N_{j = 1} M_j\) number of SSUs in the population;
- \(y_{ji}\) value for the \(i\)-th SSU in the \(j\)-th PSU;
- \(\bar{y}_j = \frac{1}{m_j} \sum_{i=1}^{m_j} y_{ji}\) sample mean within PSU \(j\);
- \(s_j^2 = \frac{1}{m_j - 1} \sum_{i=1}^{m_j} (y_{ji} - \bar{y}_j)^2\) sample variance within PSU \(j\).
13.6.4.1 Sampling PSUs with equal probability
An unbiased estimator for the population total \(\tau\) is \[\begin{equation} \hat{t} = \frac{N}{n}\sum^n_{j=1}M_j\bar{y}_j. \tag{13.68} \end{equation}\]
Given \(\hat{t}\), the population mean \(\mu\) is estimated as
\[\begin{equation}
\bar{y} = \frac{\hat{t}}{M}.
\tag{13.69}
\end{equation}\]
The standard error for \(\hat{t}\) is estimated using
\[\begin{equation}
s_{\hat{t}} = \sqrt{
\underbrace{\frac{N^2}{n}\left(1 - \frac{n}{N}\right)s^2}_{\text{between-PSU variance}}
+
\underbrace{\frac{N}{n} \sum_{j = 1}^n \frac{M_j^2}{m_j}\left(1 - \frac{m_j}{M_j}\right)s^2_j}_{\text{within-PSU variance}}
},
\tag{13.70}
\end{equation}\]
where \(s^2\) is the among-PSU sample variance, estimated as
\[\begin{equation}
s^2 = \frac{1}{n - 1} \sum_{j = 1}^n \left(M_j \bar{y}_j - \frac{\widehat{\tau}}{N}\right)^2.
\end{equation}\]
Given \(s_{\hat{t}}\), the standard error for \(\bar{y}\) is
\[\begin{equation}
s_{\bar{y}} = \frac{s_{\hat{t}}}{M}.
\tag{13.71}
\end{equation}\]
Equation (13.70) shows how total variance is partitioned into two components.
The first term (between-PSU variance) captures variation among PSU totals. It includes a finite population correction (FPC) because PSUs are sampled without replacement from a finite set. That is, the number of PSUs in the population is known and countable (see Section 11.3).
The second term (within-PSU variance) captures variation among SSUs within each PSU. SSUs are also sampled without replacement from a finite set, so this term includes an FPC as well.
This estimator is unbiased for the population total and mean, as well as their variances.
If the number of SSUs within selected PSUs is not known or is assumed infinite—as in areal sampling with points or plots—the ratio estimator defined next can be used. In many cases, even when SSUs are assumed finite and counts are known, the ratio estimator provides more stable or efficient estimates than the unbiased estimator defined above.
Although the ratio estimator presented below is biased, the bias is often small and represents a reasonable tradeoff for the gains in accuracy and precision it can provide. We’ll quantify this bias using an applied example in Section 13.6.5.
The population mean \(\mu\) is estimated using\[\begin{equation} \bar{y} = \frac{\sum^n_{j=1}M_j\bar{y}_j}{\sum^n_{j=1}M_j}, \tag{13.72} \end{equation}\] and its standard error is estimated as
\[\begin{equation} s_{\bar{y}} = \bar{y}\sqrt{\left(\frac{1}{n-1}\right)\left(\frac{\sum^n_{j=1}M_j^2}{\left(\sum^n_{j=1}M_j\right)^2} + \frac{\sum^n_{j=1}\left(M_j\bar{y}_j\right)^2}{\left(\sum^n_{j=1}M_j\bar{y}_j\right)^2} - \frac{2\sum^n_{j=1}M_j^2\bar{y}_j}{\left(\sum^n_{j=1}M_j\right)\left(\sum^n_{j=1}M_j\bar{y}_j\right)}\right)\left(1-\frac{n}{N}\right)}. \tag{13.73} \end{equation}\]
Importantly, for the ratio estimator, \(M_j\) can represent a measure of PSU size or relative size; it doesn’t need to be the actual count of SSUs in the PSU.
Given the sample mean and its standard error, the total and its standard error are computed as
\[\begin{equation}
\hat{t} = M\bar{y}
\tag{13.74}
\end{equation}\]
and
\[\begin{equation}
s_{\hat{t}} = M s_{\bar{y}},
\tag{13.75}
\end{equation}\]
where again, \(M = \sum^N_{j=1} M_j\) is the sum of PSU size measures across all PSUs—either actual SSU counts or relative measures of PSU size.
For both the unbiased and ratio estimators, confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\bar{y}}, \\ \hat{t} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\hat{t}}, \end{align*}\] with degrees of freedom \(df = n - 1\).
13.6.4.2 Sampling PSUs with probability proportional to size
When auxiliary information on PSU size is available before sampling, efficiency can be improved by selecting PSUs with PPS. Size is often measured as area, but any variable positively correlated with the parameter of interest can serve. This is the same idea you saw in point sampling (Section 12.4), where trees were selected with probability proportional to their DBH, and in 3P (Section 13.4.4.4) where trees were selected proportional to prediction. Here, we apply the concept to PSUs, selecting them with probability proportional to the number of SSU.
We assume the following design elements.
Each PSU \(j\) has a selection probability \(\pi_j = M_j / M\).
Sampling is done with replacement, with \(n\) independent draws. If a PSU is selected more than once, a new SSU sample is drawn independently each time.
The population total \(\tau\) is estimated using \[\begin{equation} \hat{t} = \frac{1}{n} \sum_{j = 1}^n \frac{M_j \bar{y}_j}{\pi_j}, \tag{13.76} \end{equation}\] where the sum is taken over the \(n\) PSUs selected with replacement. Some PSUs may appear more than once, and each appearance contributes an independent term to the sum.
The population mean is estimated as
\[\begin{equation}
\bar{y} = \frac{\hat{t}}{M}.
\tag{13.77}
\end{equation}\]
The standard error for \(\hat{t}\) is estimated using
\[\begin{equation}
s_{\hat{t}} = \sqrt{\frac{1}{n(n - 1)} \sum_{j = 1}^n \left( \frac{M_j \bar{y}_j}{\pi_j} - \hat{t} \right)^2 }.
\tag{13.78}
\end{equation}\]
Given \(s_{\hat{t}}\), the standard error for \(\bar{y}\) is
\[\begin{equation}
s_{\bar{y}} = \frac{s_{\hat{t}}}{M}.
\tag{13.79}
\end{equation}\]
Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\bar{y}}, \\ \hat{t} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\hat{t}}, \end{align*}\] with degrees of freedom \(df = n - 1\).
This estimator is unbiased for both the total and mean and their associated variances. PPS sampling is especially useful when there is high variation in PSU size and that size is related to the parameter of interest. By allocating sampling effort in proportion to PSU contribution to the population parameter, PPS sampling can yield lower standard errors compared with equal-probability sampling.
13.6.5 Illustrative analysis
We’ll demonstrate the two-stage ratio estimator using a motivating example based on simulated data. While the dataset is synthetic, the scenario reflects the kind of feasibility analysis forestry consultants and analysts routinely perform.
Suppose we’re hired by a mass timber mill and fabrication firm exploring the viability of locating a facility in the region shown in Figure 13.15. The firm expects the state’s Department of Natural Resources (DNR) to be its primary supplier and wants to assess how much suitable biomass is available on DNR-managed lands around the proposed facility. This kind of pre-operational analysis is often referred to as a woodbasket study, where the woodbasket refers to the supply of accessible wood within a defined region.
FIGURE 13.15: Conifer and mixed hardwood/softwood stands managed by the Michigan Department of Natural Resources within a one-hour drive of the proposed mass timber mill and fabrication facility. These stands define the population of primary sampling units (PSUs). A simple random sample of 50 PSUs was selected for inventory.
To identify the region of interest, the firm determines that a one-hour travel time is the upper limit for economically feasible wood hauling. We use this threshold to compute an isochrone—a polygon showing the area reachable within a fixed travel time using the road network. In this case, the isochrone includes all locations accessible within one hour of the proposed facility.165 The fabrication process relies primarily on softwoods, though the firm is experimenting with some hardwood inputs.
Our parameters of interest are the mean per-acre and total dry weight of aboveground biomass (AGB) on all DNR conifer and mixed hardwood/softwood stands within the one-hour isochrone.
The population consists of 8927 stands, and our cruising budget supports a sample of 50. Stands serve as the PSUs, and the SSUs are inventory plots placed within each one. We’ll use 1/10-acre circular plots laid out on a regular 7-by-7 chain (462 ft) grid—a systematic design commonly used in operational inventory work throughout the region.
Figure 13.15 shows the full PSU population along with 50 PSUs selected using SRS. Figure 13.16 highlights a region containing several selected PSUs, and Figure 13.17 provides a closer view of these PSUs and their SSUs.
FIGURE 13.16: Zoomed-in view from Figure 13.15, highlighting three selected primary sampling units (PSUs).
FIGURE 13.17: The three primary sampling units (PSUs) shown in Figure 13.16. PSU IDs are 1895, 1791, and 1948 for (a), (b), and (c), respectively. Secondary sampling units (SSUs) were placed using systematic sampling. Each circular inventory plot is colored by the observed aboveground biomass (AGB).
Because the data are simulated, the true population values are known: \(\mu\) = 30.69 tons/ac and \(\tau\) = 11,100,216 tons. This allows us to evaluate estimator performance against these known parameters.
We begin by estimating mean and total AGB using a sample of 50 PSUs selected using SRS. Under this design, each PSU in the population has an equal probability of selection. The following three data files are used in this analysis.
datasets/mass_timber/all_psu.csv
contains the full set of PSUs in the population. Each row is a PSU, and columns hold a unique PSU index and area in acres.datasets/mass_timber/sampled_equal_prob_psu.csv
contains the sampled PSUs. These were drawn from the full set of PSUs usingdplyr::slice_sample()
with \(n=50\) and all arguments set to their default values.datasets/mass_timber/sampled_ssu_equal_prob_psu.csv
contains the SSUs for each sampled PSU. Each row corresponds to a tree measured on the 1/10-acre circular plot, and columns are:psu_id
andssu_id
, which together uniquely identify each tree;tree_count
, which indicates how many trees are associated with each row (used to account for zero-tree plots and trees double-counted via the mirage method);agb_ton_tree
, the dry weight AGB in tons for each tree—our variable of interest, the \(y_{ji}\)’s.
The code below reads and takes a quick look at each data file.
#> Rows: 8,927
#> Columns: 2
#> $ psu_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,…
#> $ ac <dbl> 44.565, 21.690, 51.851, 131.325, 75.43…
sampled_psu <-
read_csv("datasets/mass_timber/sampled_equal_prob_psu.csv")
sampled_psu %>% glimpse()
#> Rows: 50
#> Columns: 2
#> $ psu_id <dbl> 1017, 8004, 4775, 8462, 4050, 8789, 13…
#> $ ac <dbl> 30.787, 18.785, 11.723, 16.064, 38.443…
ssu_trees <-
read_csv("datasets/mass_timber/sampled_ssu_equal_prob_psu.csv")
ssu_trees %>% glimpse()
#> Rows: 7,750
#> Columns: 4
#> $ psu_id <dbl> 1017, 1017, 1017, 1017, 1017, 10…
#> $ ssu_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ agb_ton_tree <dbl> 0.16786, 0.16999, 0.14626, 0.151…
We’ll now rename a few columns to align with our notation and create variables needed in subsequent steps.
all_psu <- all_psu %>% rename(A_j = ac)
sampled_psu <- sampled_psu %>% rename(A_j = ac)
N <- nrow(all_psu)
n <- nrow(sampled_psu)
A <- sum(all_psu$A_j)
# Add the tree factor for 1/10-acre plot.
r <- sqrt(0.1 * 43560 / pi)
ssu_trees <- ssu_trees %>% mutate(TF = 43560 / (pi * r^2))
The ratio estimator for the mean, total, and their associated standard errors—as defined in (13.72), (13.74), (13.73), and (13.75)—is implemented in two steps. First, we summarize SSU-level information within each PSU. Second, we compute the required quantities across sampled PSUs to produce the population estimates. Here, our measure of PSU size is area, i.e., \(M_j = A_j\), which is joined to psu_summary
in the final piped expression below.
psu_summary <- ssu_trees %>%
group_by(psu_id, ssu_id) %>%
summarize(y_ji = sum(TF * tree_count * agb_ton_tree)) %>%
summarize(
y_bar_j = mean(y_ji)
) %>%
left_join(sampled_psu, by = join_by(psu_id))
psu_summary
#> # A tibble: 50 × 3
#> psu_id y_bar_j A_j
#> <dbl> <dbl> <dbl>
#> 1 465 54.2 145.
#> 2 526 56.5 126.
#> 3 878 10.9 33.8
#> 4 1017 33.0 30.8
#> 5 1069 56.4 29.6
#> 6 1129 8.76 26.2
#> 7 1222 26.3 11.1
#> 8 1301 48.7 27.9
#> 9 1530 53.3 41.7
#> 10 1791 49.4 148.
#> # ℹ 40 more rows
pop_summary <- psu_summary %>%
summarize(
y_bar = sum(A_j * y_bar_j) / sum(A_j),
s_y_bar = y_bar *
sqrt(n / (n - 1) *
(sum(A_j^2) / sum(A_j)^2 +
sum((A_j * y_bar_j)^2) / sum(A_j * y_bar_j)^2 -
2 * sum(A_j^2 * y_bar_j) / (sum(A_j) * sum(A_j * y_bar_j))
) * (1 - n / N)),
tau_hat = A * y_bar,
s_tau_hat = A * s_y_bar
)
pop_summary
#> # A tibble: 1 × 4
#> y_bar s_y_bar tau_hat s_tau_hat
#> <dbl> <dbl> <dbl> <dbl>
#> 1 35.6 2.93 12860648. 1059993.
Our estimated mean and total, with 95% confidence intervals, were 35.56 (29.67, 41.45) tons/ac and 12,860,648 (10,730,514, 14,990,783) tons, respectively. Based on this sample, both confidence intervals captured the true values, which are \(\mu\) = 30.69 tons/ac and \(\tau\) = 11,100,216 tons.
13.6.5.1 Reanalysis using SYS of PSU
In Section 13.2 we applied SYS to areal sampling frames. The same design can also be used with a list sampling frame, where it’s generally referred to as systematic list sampling. When no auxiliary information is available, SYS from an unordered list produces results similar to SRS. However, if an auxiliary variable is available for all units and is correlated with the parameter of interest, sorting the list by that variable before sampling can improve efficiency. The gain comes from implicit stratification: a SYS sample from a sorted list tends to span the full range of the ordering variable, yielding balanced coverage without explicitly defining strata. Inclusion probabilities remain equal, so standard equal-probability estimators still apply.
Here we replace the equal-probability SRS of PSUs used for the ratio estimator with SYS applied to the PSU list sorted by area. We use circular systematic sampling (Section 13.6.5.1.1), which ensures one selection per interval and avoids duplicates. As shown later, sorting the list by PSU area reduces RMSE and improves confidence-interval coverage. The takeaway: SYS can be applied to list data frames and when a relevant auxiliary variable exists, applying SYS along its gradient can yield more efficient estimates.
13.6.5.1.1 Circular systematic sampling for a list
In our analysis, the “list” refers to the population of PSUs, identified by the unique index psu_id
in the all_psu
data frame. Circular SYS treats this list as if it were arranged around a circle: choose a random start, take units at fixed intervals, and wrap to the beginning when you reach the end, continuing until \(n\) distinct PSUs are selected. This preserves equal inclusion probabilities, avoids selecting the same unit twice, and—when the list is sorted by a relevant auxiliary variable—can produce a well-spread sample (see Cochran (1977), Section 8.1; K. Iles (2003)).
The function sys_circular(N, n)
in scripts/utils.R
implements this method, returning the row indices to select. In the code below, sample_index
holds these indices. The piped sequence then orders all_psu
by acres, selects the sampled rows, and assigns them to sampled_psu
. SSUs would then be sampled within each PSU (virtually here) to form estimates.
13.6.5.1.2 Repeated-sampling comparison
Here we assess the performance of the two-stage ratio estimator when PSUs are selected using either SRS or circular SYS. For each of 1,000 replicate samples, we:
- drew \(n = 50\) PSUs using simple random sampling from the unsorted list;
- drew \(n = 50\) PSUs using circular systematic sampling from the list sorted by PSU area (see Section 13.6.5.1.1).
In both cases, we applied the two-stage ratio estimator to the sampled PSUs and recorded point estimates, standard errors, and confidence intervals. Estimator performance under each sampling approach is evaluated using the following summaries.
Bias the average difference between the estimate from each sample and the true population mean (see Section 11.2.3 for the formal definition). Closer to zero is better.
RMSE the square root of the mean squared error (see Section 11.2.3), combining both bias and variability into a single measure of accuracy. Smaller is better.
CI width the average width of the 95% confidence interval, providing a measure of estimator precision. Narrower is better, as long as CI cover is close to its nominal 95%.
CI cover the proportion of confidence intervals that contain the true population mean, expressed as a percentage. This measures how well the estimator’s uncertainty estimates are calibrated. Should be close to the nominal value of 95%.
PSU selection | Bias | RMSE | Width | Cover |
---|---|---|---|---|
SRS | -0.39 | 2.6 | 9.2 | 92 |
SYS | -0.34 | 2.6 | 9.5 | 94 |
Summaries across replicates are presented in Table 13.3. As noted earlier, the ratio estimator is biased; however, the results show that this bias is relatively small (compared with the true mean biomass \(\mu\) = 30.69254 tons/ac). Bias is an inherent property of the estimator, and since both sampling approaches (i.e., SRS and SYS) produce equal probability samples, it’s reassuring that they result in nearly identical estimates of bias. In this example, SYS produced slightly lower RMSE and slightly wider 95% confidence intervals, with coverage closer to the nominal 95%.
These modest gains in RMSE and confidence interval performance afforded by SYS are not surprising, as per-acre biomass is not strongly correlated with stand size in this dataset. While that relationship is weak here, applying SYS to a list sorted by a relevant auxiliary variable can improve efficiency by spreading the sample more evenly across the range of that variable. When the ordering variable is closely related to the parameter of interest, this implicit stratification can substantially reduce variance and enhance confidence interval coverage relative to SRS.
This analysis used the two-stage ratio estimator, which permits the number of SSUs in the \(j\)-th sampled PSU to be represented by a relative size—in our case, the sampled PSU’s area \(A_j\). Many two-stage estimators commonly found in the literature—including the unbiased and PPS-based methods presented here—assume a finite population of SSUs. This limits their applicability when SSUs are sampled from an areal frame, where the population is effectively infinite, as is typical in operational forest inventories. Some extensions accommodate PPS sampling with infinite SSU populations within a design-based framework (e.g., Gregoire and Valentine (2007), Chapter 12), but many practical approaches rely on model-assisted or model-based methods (e.g., Särndal, Swensson, and Wretman (1992); Mandallaz (2007); Banerjee (2024)). These represent important extensions used in practice, though a full treatment is beyond the scope of this book.
13.6.6 Advantages and disadvantages of two-stage sampling
Two-stage sampling is a flexible design that can balance statistical efficiency with operational practicality. It’s especially useful when the population is naturally grouped into larger units (PSUs) that can be further subdivided into smaller units (SSUs).
13.6.6.1 Advantages
Fieldwork can be concentrated within a manageable set of PSUs, reducing travel time and logistical complexity compared to SRS of all SSUs.
Allows the use of equal probability selection or PPS at the first stage, depending on whether auxiliary size information is available.
Flexible first-stage selection strategies allow for balancing statistical efficiency with field logistics, such as ordering PSUs spatially to improve coverage or selecting PPS to better represent larger units.
Well suited for inventories covering large, heterogeneous regions where direct sampling of all SSUs is impractical.
13.6.6.2 Disadvantages
If the number of SSUs per PSU is small or if there is high within-PSU variation, precision can be lower than for single-stage designs with the same total sample size.
Equal probability designs can perform poorly when PSU sizes vary greatly and totals are correlated with size, leading to inflated variance and wider confidence intervals.
Many commonly used two-stage estimators assume a finite population of SSUs. These estimators are not appropriate for forest inventory data collected using an areal sampling frame, where the SSU population is effectively infinite. This limits the applicability of certain standard design-based estimators in forestry contexts.
13.7 Summary
This chapter introduced probability sampling designs and estimation strategies widely used in forest inventory, including systematic, stratified, regression-assisted, cluster, and multistage designs. These designs are not only tools for valid inference—they also provide opportunities to reduce variance, improve precision, and make better use of limited time and cost constraints.
We began with SYS, the most commonly used design in forestry. SYS places sampling locations on a regular grid to improve spatial balance and field efficiency. We showed how to generate and position grids using R, and noted that SYS can also be applied to list frames when sorted by an auxiliary variable.
Stratified sampling was presented as a way to improve precision by dividing the population into homogeneous groups. We reviewed allocation strategies and discussed how stratification supports both inference and logistical planning. Ratio and regression estimators were introduced as tools for incorporating auxiliary variables, with examples illustrating their application within and across strata.
We then applied these estimation strategies and design principles to specialized forestry contexts, including Big BAF and 3P sampling. Both designs follow the same theme introduced in the regression and ratio estimation section: using quick, inexpensive measurements across the full population (or a large subset), and correcting them with a smaller subset of accurate measurements to improve inference.
Finally, we introduced cluster and two-stage sampling designs, which are useful when the population is naturally grouped or when field effort must be concentrated in fewer locations. These designs help reduce travel and logistical costs by allowing multiple observations within a selected unit, while still supporting valid and efficient estimation through appropriate weighting and summarization.
Implementing these designs in R using tidyverse workflows offers major practical advantages. Code built around tools like group_by()
, summarize()
, and spatial operations from the sf
package allows analysts to write modular, transparent routines that mirror the structure of the design itself. This leads to workflows that are not only efficient to execute, but also reproducible, auditable, and easily adapted to different forest inventory contexts—scaling from small studies to regional and national assessments.
13.8 Exercises
TBW
References
While the focus is on using SYS to position sampling locations within an areal sampling frame, the same idea applies to list sampling frames, where SYS is often used to select units from a sorted list. We return to this in Section 13.6.5.1.1 as part of a list-sampling example.↩︎
Any of these objects can be saved to file using
st_write()
from thesf
package to be read by GIS or GPS software.↩︎The specific projection is NAD83 / Pennsylvania South (ft), which corresponds to EPSG:2271.↩︎
In
plot_sys_grid()
, notice how thegeom_sf()
layers automatically render eachsf
object based on its geometry type—points, lines, or polygons—making it easy to build clean and consistent maps. You can modify the function to adjust colors, symbols, or other plot features to suit your needs.↩︎We use a
for
loop here to save space. Compare with the hardcoded approach in Section 11.3.6.1.↩︎The
bounds
object is projected in meters. You can check ansf
object’s distance units usingst_crs(bounds)$units
orunits::deparse_unit(st_distance(bounds))
.↩︎This works because
plot_id
is unique across strata. If that’s not the case, for example if plot IDs are reused within each stratum, we’d need to includestratum_id
in the grouping to avoid mixing trees from different strata that share the same plot ID.↩︎At the time of writing, much of the hemlock at Harvard Forest and across the broader region had already succumbed to HWA (D. A. Orwig et al. 2012).↩︎
Volume for guessed trees is based on Girard Form Class 78, International 1/4-inch rule in Table 21 of Mesavage and Girard (1956) (8-inch DBH volumes interpolated from the published 10-inch values). For example, looking at the first row in Table 13.2, a 10-inch DBH tree with 1.5 16-foot logs has a volume of 48 board feet according to Table 21. Volumes for measured trees use the corresponding Form Class 78, International 1/4-inch formula published in Wiant (1986). This formula was used to create R functions in Exercise 5.3. For instance, to calculate the measured volume of a 10.4-inch DBH tree with 27 feet of merchantable height (i.e., 27/16=1.6875 16-foot logs), use
vol_I(10.4, 27/16)
, which returns 57.9 board feet after rounding.↩︎Technically, the VBAR estimate could come from a separate set of sampling locations or even a different sampling design; it’s just convenient to take spatially paired measurements.↩︎
As noted by Gregoire and Valentine (2007) (p. 259), the \(\overline{\text{VBAR}}\) estimator is a mean-of-ratios, and the corresponding standard error estimator—which accounts for each ratio as a function of two random variables—could be used. However, simulation results in Gove et al. (2020) show that treating tree-level VBARs as independent draws from a single random variable and applying the usual sample variance formula yields nearly identical inferences in practice.↩︎
The wide-scale relascope includes larger BAF increments, such as 20.25 \(m^2/\text{ha}\), suitable for Big BAF sampling. To convert from metric to English units (\(\text{ft}^2/\text{ac}\)), multiply by 4.356; the chosen BAFs correspond approximately to English BAFs of 21.8 (small) and 88.2 (big).↩︎
We didn’t cover the
tidyr::count()
anduncount()
functions, but do check them out—they’re very handy.↩︎Trees with \(x_i > \text{max}_z\) are always selected. See the discussion of “sure-to-be-measured” trees later in this section.↩︎
As we’ll see later, the 3P estimator includes a finite population correction (FPC), so this census stratum’s standard error will automatically be zero.↩︎
You can reproduce the estimates for \(\alpha_0\) and \(\alpha_1\) using
nls(vol_m3 ~ alpha_0 * dbh_cm^alpha_1, start = list(alpha_0 = 0.001, alpha_1 = 2.5), data = dbh_vol)
withdbh_vol
read from"datasets/BC_3P/dbh_vol.csv"
.↩︎Since these are simulated data, we know they were generated with slightly different ratios.↩︎
This type of road-based travel-time polygon can be computed in R using the
osrm
package, which provides an interface to the Open Source Routing Machine. The package supports travel-time and routing calculations using OpenStreetMap’s road network data. Other tools and packages are also available for generating isochrones, depending on your needs and data sources.↩︎