Chapter 13 Sampling designs, implementation, and analysis
In Chapters 11 and 12, we focused on inference using sample data collected via a SRS design. In this chapter, we introduce several additional probability sampling designs commonly used in forestry. Recall that a sampling design is a set of rules for selecting units from a population and using the resulting data to generate statistically valid estimates for parameters of interest. We focus on four designs: 1) systematic sampling, 2) stratified sampling, 3) cluster sampling, and 4) multistage sampling. These designs differ in how they configure and select observations and in the estimators they prescribe. In most cases, a design is chosen to improve efficiency from a statistical, logistical, or combined standpoint.
As discussed in Section 12.1, forest inventory typically relies on an areal sampling frame, where sampling locations are identified by randomly placing points within the frame’s spatial extent. Under plot sampling, the location might mark the center of a circular plot, a corner of a rectangular plot, or an anchor point for more complex plot shapes. Under point sampling, a location might serve as the point from which a forester projects the discerning angle to identify measurement trees. Unless stated otherwise, the sampling designs described in this chapter can be used with either plot or point sampling—that is, the design governs the selection of sampling locations, while a separate set of rules determines how trees are selected around each location.
After introducing the designs, we shift attention to the use of auxiliary information, which we refer to as sampling with covariates. In forestry, we’re often interested in estimating parameters based on variables that are difficult or expensive to measure. Sampling with covariates improves precision by incorporating other variables that are easier to measure and correlated with the variable of interest.
The remainder of the chapter is organized as follows. Sections 13.2 through 13.5 describe systematic, stratified, cluster, and multistage sampling designs. Section 13.6 covers strategies for incorporating auxiliary information. As in other mensuration texts (Kershaw et al. 2016; Burkhart, Avery, and Bullock 2018), we present estimators associated with each design. Our added value is in showing how to implement these estimators using tidyverse
tools and other R packages. Emphasis is placed on building computing workflows for efficient and repeatable analysis.
13.1 A gentle introduction to sf
Because sampling in forestry is inherently spatial—we’re selecting locations within an areal sampling frame—it’s helpful to start working with spatially aware data structures. In R, the primary tool for handling spatial vector data is the sf
package.
If you’ve used GIS software before, many of the concepts will feel familiar. Spatial vector data come in three main types: points (e.g., sampling locations), lines (e.g., transects or travel paths), and polygons (e.g., stand boundaries or management units). The sf
package represents each of these as a special kind of data frame with an attached geometry column.
You can work with sf
objects much like regular data frames, using familiar tidyverse
tools such as filter()
, select()
, and mutate()
. We’ll introduce key sf
data structures and functions as needed throughout the chapter. For more background, see the sf
documentation at https://r-spatial.github.io/sf or the package vignette via vignette("sf1")
.
We’ll explore sf
in more detail in Section 14.1.5.
13.2 Systematic sampling
Systematic sampling (SYS) is one of the most widely used probability sampling designs in forest inventory. Sampling locations are arranged in a fixed pattern across the population. To introduce randomness and ensure unbiased estimation of means and totals, the pattern—typically a rectangular grid—is positioned using a randomly chosen starting point. As illustrated in Figure 1.2, sampling locations are usually aligned along straight lines. The distance between locations on a line, known as the sampling interval, and the distance between lines can differ, creating a rectangular grid. This is commonly referred to as line-plot sampling, which is our focus here.
SYS designs are popular in forestry for several reasons. Compared to SRS, their advantages include:
- Sampling locations are evenly distributed across the population, improving the reliability of mean and total estimates. In contrast, SRS locations may, by chance, miss important spatial features.
- Sampling locations are easier and faster to find, reducing cost. The cruiser can follow straight lines between locations, which minimizes travel time and simplifies relocation during remeasurement.
The grid is typically positioned using GIS or spatial software (e.g., R), and the resulting points are uploaded to a GPS device for navigation. Alternatively, the cruiser may locate the first sampling point in the field and lay out the remainder using the sampling interval and grid orientation.
Key considerations when implementing a SYS design include:
- the pattern and orientation of the grid;
- the method used to randomly position the grid.
These points are covered in the next few sections.
13.2.1 Grid pattern, number of grid points, and orientation
In practice, either a rectangular or square grid is used. A rectangular grid defines the spacing between sampling locations along a line, denoted \(D_p\) (think “plot” or “point”), and the spacing between lines, \(D_l\). A square grid is a special case where \(D_{\text{sq}} = D_p = D_l\). Unless there’s a compelling reason to use unequal spacing, a square grid is a reasonable and efficient default.
Given a sample size \(n\) and total area \(A\), the spacing between locations in a square grid is given by
\[\begin{equation} D_{\text{sq}} = \sqrt{\frac{cA}{n}}, \end{equation}\]
where \(c\) is a conversion factor to express area in square distance units. For example, use \(c = 10,\!000\ \text{m}^2/\text{ha}\) if \(A\) is in hectares and \(D_{\text{sq}}\) is in meters, or \(c = 43,\!560\ \text{ft}^2/\text{ac}\) if \(A\) is in acres and \(D_{\text{sq}}\) is in feet.
As introduced in Section 11.3.6, sample size \(n\) should be based on the level of precision needed to meet the survey’s objectives. However, in practice, it’s common for experienced foresters to adopt a sampling intensity that works well for their forest type and inferential objectives. For example, the forester who conducted the Elk County cruise described in Section 1.2.2 typically uses a seven-by-seven chain SYS grid.
If you do use a sample size calculation, as recommended in Section 11.3.6, it’s likely that the number of grid points won’t exactly match the computed \(n\) due to how the grid falls within the boundary. In such cases, we agree with the following sage words.
My advice is simply to take the number of points that are indicated [by the grid], and live with the slight variability. Sample sizes that are “needed” … are just estimated anyway.
— Kim Iles
If you really must have a specific sample size, then the best approach is to specify a denser grid than needed and randomly or systematically thin points until the target sample size is reached (see, e.g., K. Iles (2003)).
The grid orientation can be adjusted to improve field logistics. For example, aligning lines with a forest boundary may simplify navigation. This was the approach used in the Elk County cruise (Figure 1.2).
Although rare, it’s possible for a population to exhibit a systematic pattern that aligns with the grid’s spacing and orientation, which could introduce bias. For example, a grid aligned with plantation rows might consistently place sampling locations between rows, leading to an unrepresentative sample. In such cases, rotating the grid to avoid alignment with periodic features can mitigate bias.
If a known gradient exists in the population—for example, productivity increasing with elevation—orienting the grid perpendicular to that gradient can improve the representativeness of the sample.
13.2.2 Positioning the grid
To avoid bias, the grid must be randomly placed. This is typically done by selecting a random point start within the areal sampling frame and extending the grid in each direction from that point. If rotation is applied, it should be performed about this random point start.
The random point start can be one of the sampling locations or used as an anchor point from which to extend the grid. In the development that follows, we treat the starting point as a location on the sampling grid.
In the next section, we introduce a function that generates systematic sampling grids in R. This function allows users to specify grid spacing, orientation, and how sampling locations are ordered. The code and examples also help highlight some practical considerations when laying out systematic designs.
13.2.3 Creating a systematic sampling grid with sys_grid()
This section is formatted like a mini R manual page—a style introduced earlier in the book—to briefly document the sys_grid()
function. You’re encouraged to read through the function source code in "scripts/utils.R"
(see Section 5.4) to deepen your understanding. The comments in the code walk through each step, and many elements should already be familiar.
Description
Creates a systematic grid of points within a polygonal boundary using a random start. The grid is defined by a fixed spacing between lines (D_l
) and between points along each line (D_p
). The grid can be rotated and the points ordered from a specified origin to facilitate field navigation.
Usage
Arguments
boundary
: Ansf
polygon object defining the sampling boundary. Must be projected with linear units.D_l
: Distance between sampling lines (x-direction) in the same units asboundary
.D_p
: Distance between points along a line (y-direction) in map units asboundary
.angle
: Optional rotation angle (in degrees) applied to the grid. Default is0
.origin
: Optional character string indicating the origin corner used to order points. One of"se"
,"sw"
,"ne"
, or"nw"
(default is"se"
).
Details
The sys_grid()
function creates a rectangular grid of points with spacing determined by D_l
and D_p
, starting from a randomly placed point within the boundary polygon. Here’s how it works:
- A random point start is placed within the
boundary
. - A full grid is generated around that point, extended enough to accommodate possible rotation.
- Points are ordered in a snaking pattern from the specified
origin
. - If specified, the grid is rotated by
angle
degrees about the random point start. - The grid is clipped to the polygon, and the resulting points are labeled and returned.
Ordering by origin
can help plan field navigation, especially when following lines. The choice of origin affects how line_id
, point_id_in_line
, and point_id
are assigned, but not the design itself.
Value
Returns a list with the following components147:
grid
: Ansf
point object with columns:point_id
: unique ID, ordered from the origin;line_id
: identifier for each line;point_id_in_line
: position along each line.
boundary
are appended.path
: Ansf
LINESTRING
connecting the points in sampling order.random_point_start
: The initial random point start (ansf
point object).boundary
: The input polygon(s) used to clip the grid.type
: A string used internally to verify the object when plotting.
Use plot_sys_grid()
to visualize the resulting grid and path.
13.2.4 Systematic sampling grid illustration with sys_grid()
Let’s apply sys_grid()
to set up a SYS design for the Elk County cruise described in Section 1.2.2. In that cruise, the forester used a square grid with sampling locations spaced every seven chains (recall that one chain equals 66 feet).
The sys_grid()
function expects the boundary polygon to be in a projection that uses the same distance units as D_l
and D_p
. The Elk County property is stored in the "datasets/PA/Elk_bounds"
ESRI shapefile (Environmental Systems Research Institute 1998), which uses a geographic coordinate system (i.e., decimal degrees of latitude and longitude)—which isn’t an intuitive measurement system for field work. Instead, we want the boundary in a projected coordinate system that allows us to specify D_l
and D_p
in feet or meters.
After reading the shapefile using st_read()
, the code below reprojects the boundary to a projection appropriate for northern Pennsylvania and consistent with the cruise protocol. This is done using st_transform()
.148
# Read in the bounding polygon.
bounds <- st_read("datasets/PA", "Elk_bounds", quiet = TRUE)
bounds <- st_transform(bounds, crs = st_crs(2271))
Once reprojected, bounds
has units in feet, so we specify D_l
and D_p
in feet when calling sys_grid()
. Because we don’t specify values for angle
and origin
, the defaults of 0
and "se"
, respectively, are used. To ensure the same random point start is generated each time the code is run, we set the random seed prior to calling the function.
We can visualize the resulting SYS design by passing the sys
object to the plot_sys_grid()
function, defined in the "scripts/utils.R"
script. The resulting map is shown in Figure 13.1. We encourage you to explore the plot_sys_grid()
function to see how it builds the graphic using ggplot()
. Compared to the simpler examples in Chapter 9, this plot is a bit more involved, with a customized legend and multiple spatial layers.149
FIGURE 13.1: Systematic sampling design for an Elk County property cruise.
Let’s create a few variations on the design shown in Figure 13.1 by rotating the grid and choosing a different origin—while keeping the same random point start. This isolates the effect of orientation and ordering. To reduce code clutter, we don’t show the calls to plot_sys_grid()
or the patchwork
functions used to produce the composite graphic shown in Figure 13.2.
# (a) Rotate grid 45 degrees.
set.seed(1)
sys_a <- sys_grid(bounds,
D_l = 7 * 66, D_p = 7 * 66, angle = 45)
# (b) Northwest corner start.
set.seed(1)
sys_b <- sys_grid(bounds,
D_l = 7 * 66, D_p = 7 * 66, origin = "nw")
# (c) Rotation and northwest corner start.
set.seed(1)
sys_c <- sys_grid(bounds,
D_l = 7 * 66, D_p = 7 * 66,
angle = 45, origin = "nw")
FIGURE 13.2: Systematic sampling design for an Elk County cruise under different grid rotations and starting origins.
Notice in Figures 13.1 and 13.2 the random point start remains in the same location—it’s the only random component in the systematic sampling design. Once the starting point is fixed, all other sampling locations are determined by the specified spacing and rotation. This highlights a key feature of systematic sampling: although the layout may appear highly structured, the randomness introduced by the initial point ensures the design retains its probabilistic foundation. As a result, estimators derived from such a design can still be used for valid statistical inference under appropriate assumptions.
With a sampling grid in place, we now turn to estimation using data collected under systematic sampling.
13.2.5 Estimation using SYS
The SYS estimators for the population mean \(\mu\) (11.1) and total \(\tau\) (11.6) are the same unbiased sample mean \(\bar{y}\) (11.15) and total \(\hat{t}\) (11.22) used under SRS (Section 11.3). Both are unbiased under SYS.
There is, however, no unbiased estimator for the sample variance (and thus no estimator for the standard error of the mean) under SYS. The most common solution is to apply the SRS estimators for variance (11.17) and standard error (11.21).
In some cases, however, the SRS variance and standard error of the mean estimators can overestimate uncertainty, yielding conservative confidence intervals. In others, they can underestimate it. Unfortunately, it’s not straightforward to know which case you’re in without additional information about the population.
It’s easy to build some intuition about why SYS lacks a valid variance estimator. Recall from Section 11.3 that the sample variance estimator used for SRS assumes the \(n\) observations are drawn independently and at random from the population. This independence is critical: it justifies the use of \(n - 1\) in the denominator of the variance estimator (11.17), which reflects the number of independent units of information available to estimate population variance.
Now contrast that with SYS. Here, a single random point start determines the entire sample. All subsequent locations are fixed by the grid layout. So, while we observe \(n\) units, only one is selected at random. The remaining \(n - 1\) are not independent—they’re determined once the start is fixed.
From this perspective, SYS behaves like a sample of size one. And plugging \(n = 1\) into the SRS variance estimator (11.17) results in division by zero which is undefined. This highlights the core issue: SYS does not produce a sample of independent observations, so the usual SRS variance estimator(s) don’t apply.
Again, despite these limitations, the SRS variance estimator is commonly used as a practical approximation for SYS. In many situations, it provides a reasonable estimate for the sample variance and, hence, the standard error of the mean. Still, understanding how the population is structured—and how that structure interacts with the sampling grid—is key to interpreting results. See Kershaw et al. (2016) and K. Iles (2003) for further discussion, including cautions and justification for why the SRS variance estimator is often considered acceptable for SYS data.
13.2.6 Advantages and disadvantages of systematic sampling
Systematic sampling offers several practical benefits and is widely used in forestry, but like all designs, it involves trade-offs. Here we summarize its main strengths and limitations, particularly in comparison to SRS.
13.2.6.1 Advantages
Efficient field layout: SYS designs are easy to implement in the field. Sampling locations fall on a predictable grid, allowing cruisers to follow straight lines, reducing travel time and simplifying navigation.
Improved spatial coverage: Grid-based sampling ensures sampling locations are evenly distributed across the population, helping to capture spatial heterogeneity and reduce the risk of missing important features—a common concern with SRS.
Simplifies remeasurement: Because sampling locations follow a fixed, repeatable pattern, they are easier to relocate during future inventory cycles.
Cost-effective: The combination of reduced travel time and simplified logistics typically results in lower per-unit sampling cost compared to SRS.
Good performance in many populations: SYS often yields more precise estimates than SRS for the same sample size, particularly when the population lacks strong periodicity that aligns with the grid. When a broad spatial trend is present, careful grid orientation can actually improve precision—for example, by laying the grid across the direction of the trend to ensure full coverage.
13.2.6.2 Disadvantages
No valid variance estimator: SYS lacks a design-based estimator for the sample variance and standard error. While the SRS estimator is commonly used as an approximation, it may over- or underestimate uncertainty, depending on the population structure.
Potential bias from periodicity: If the population exhibits spatial patterns that align with the grid spacing or orientation (e.g., plantation rows, windrow lines, soil type bands), SYS can produce biased estimates.
Requires spatial data infrastructure: Implementing SYS effectively typically relies on GIS tools and spatial data, which may be a barrier in some settings.
Despite these drawbacks, SYS remains a go-to design in operational forestry due to its simplicity, spatial balance, and low implementation cost. When used thoughtfully—with attention to population structure and proper randomization of the starting point—SYS can yield highly effective inventory data.
13.3 Stratified sampling
In forest inventory, it’s common to estimate population parameters not only for the entire area of interest but also for smaller subareas, or strata. Stratification refers to dividing the population into non-overlapping subareas based on characteristics expected to influence the parameter of interest. These strata are internally homogeneous but meaningfully different from one another. Common stratification variables include species, age, productivity, and management unit.
Stratified sampling involves drawing a probability sample independently within each stratum. While sampling within strata is most often done using SRS or SYS, other probability sampling methods can be used. The main benefits of stratification are potential gains in statistical efficiency, i.e., reduced standard errors, and the ability to produce both overall and stratum-specific estimates.
Stratification is often implemented in one of two ways:
Pre-stratification: Strata are delineated before sampling begins, letting us allocate stratum-specific sample sizes based on stratum size, expected variability, and sampling costs. This gives full control over the number of observations placed in each stratum. In this case, the area of each stratum must be known ahead of time.
Post-stratification: Here, sampling proceeds without regard to strata, and strata are assigned after the fact based on information collected during or after sampling. This still supports stratum-specific estimation, but the per-stratum sample sizes aren’t under our control. The area of each stratum must be determined after sampling.
Both approaches are valid. The main advantage of pre-stratification is the ability to apply optimum allocation—assigning more observations to strata that are large or exhibit greater variability, and fewer to those that are small, more homogeneous, or costly to sample. This improves statistical efficiency while controlling sampling costs.
There are substantial statistical and cost-saving benefits to stratification. Even a poor choice of strata yields unbiased estimates and, in almost all practical cases, performs no worse than unstratified sampling. At worst, you’ll end up with about the same standard error as you’d get without stratification; at best, you can dramatically improve statistical and cost efficiency.
Stratified sampling proceeds in two steps:
Assignment: Before any measurements are taken (in the case of pre-stratification), areas of the population—such as stands or management units—are assigned to strata using available information.
Measurement: A sample is then drawn within each stratum. At least two observations per stratum are needed to compute within-stratum variance and associated standard errors.
13.3.1 Choosing stratum membership
Stratification begins by subdividing the population into distinct strata based on available information. Assignment can be guided by prior inventory data (e.g., species, age, height, volume, or biomass), expert judgment, or ancillary sources like topography, canopy structure, or remote sensing-based spectral or structural measurements. Strata may also align with management units or areas where a separate estimate is desired.
The key requirement is that strata don’t overlap. For example, if strata are defined using polygons—where one or more polygons comprise a stratum—those polygons must be spatially distinct, i.e., their boundaries don’t overlap. Each location in the population must belong to one and only one stratum.
Useful strata group together areas that are similar to each other and different from the rest. Ideally, variance within a stratum is low, and variance between strata is large. These characteristics improve statistical efficiency.
A stratum can consist of spatially contiguous or disjoint areas. For example, a stratum may comprise spatially adjacent stands or stands scattered across the forest. It’s the internal consistency of each stratum that counts—not whether the areas are next to each other on the map.
In the next section, we introduce the estimators used in stratified sampling.
13.3.2 Estimation using stratified sampling
Throughout this section, and in the subsequent illustration, we assume that sampling within each stratum is done using SRS or SYS, and that SRS estimators are used for the stratum means and variances. Other within-stratum designs are possible but may require different estimators.
We introduce the following notation:
- \(H\) number of strata in the population;
- \(N_h\) population size in the \(h\)-th stratum;
- \(N = \sum_{h = 1}^H N_h\) total population size across all strata;
- \(n_h\) sample size in the \(h\)-th stratum;
- \(n = \sum_{h = 1}^H n_h\) total sample size across all strata.
If we’re working from an areal sampling frame, then \(N\) and \(N_h\) are area units, e.g., acres or hectares.
The stratum population mean \(\mu_h\) is estimated using the sample mean \[\begin{equation} \bar{y}_h = \frac{\sum_{i = 1}^{n_h} y_{hi}}{n_h}, \tag{13.1} \end{equation}\] where \(y_{hi}\) is the \(i\)-th observation in the \(h\)-th stratum.
The overall population mean \(\mu\) (i.e., across all strata) is estimated using \[\begin{equation} \bar{y} = \frac{\sum_{h = 1}^H N_h \bar{y}_h}{N}. \tag{13.2} \end{equation}\] This is a weighted sum of the stratum-specific sample means, with weights given by each stratum’s proportion of the total population size (or area), i.e., \(N_h / N\).
The population total \(\tau\) is estimated using \[\begin{equation} \hat{t} = \sum_{h = 1}^H N_h \bar{y}_h = N \bar{y}. \tag{13.3} \end{equation}\]
The sample variance within each stratum, \(s^2_h\), is estimated using the SRS variance estimator (11.17). If standard errors and confidence intervals are desired for individual strata, use (11.20) or (11.21) and (11.31), respectively.
Assuming a finite population, the standard error of the overall mean \(\bar{y}\) is \[\begin{equation} s_{\bar{y}} = \sqrt{ \frac{1}{N^2} \sum_{h = 1}^H \left[ \frac{N_h^2 s_h^2}{n_h} \left( \frac{N_h - n_h}{N_h} \right) \right] }. \tag{13.4} \end{equation}\] If we’re working from an areal sampling frame (i.e., an infinite population), the FPC is omitted (see Section 12.2), giving \[\begin{equation} s_{\bar{y}} = \sqrt{ \frac{1}{N^2} \sum_{h = 1}^H \frac{N_h^2 s_h^2}{n_h} }. \tag{13.5} \end{equation}\]
The standard error for the total estimate \(\hat{t}\) is \[\begin{equation} s_{\hat{t}} = N s_{\bar{y}}. \tag{13.6} \end{equation}\]
Confidence intervals for the mean follow the usual form \[\begin{equation} \bar{y} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\bar{y}}, \tag{13.7} \end{equation}\] with degrees of freedom computed as \[\begin{equation} df = \sum_{h = 1}^H (n_h - 1) = n - H. \tag{13.8} \end{equation}\]
Confidence intervals for the total \(\hat{t}\) follow the same form using \(s_{\hat{t}}\).
13.3.2.1 Sampling in stratum and independence
In the current setup, we assume \(n_h\) observations within each stratum are drawn independently and at random from the stratum population. In practice, however, SYS is more commonly used. For example, you might place a separate systematic grid within each stratum using a random point start, as described in Section 13.2.2. The grids may differ in spacing or rotation across strata.
The usual concern with SYS remains: with only one random start per stratum, the \(n_h\) observations are not independent. This has implications for inference. For example, the degrees of freedom in (13.8) assume independent observations within each stratum. If a single systematic grid is used to sample the stratum population, that assumption is violated. In such cases, (13.8) may be overstated, and resulting confidence intervals may be misleading.
It’s also common to use a single SYS grid that spans all strata. This is often convenient because the cruiser can follow continuous straight sampling lines that cross stratum boundaries. In this case, only one random point start is used to position the SYS grid over the entire population. Here, we’re even further from the assumption of \(n\) independent units of information.
SYS is widely used in practice and is adopted in our illustration in Section 13.3.6. As discussed in Section 13.2, it’s important to rotate the grid to avoid alignment with periodic features. In most applications, such alignment is rare, and the risk of bias or distortion of standard errors is negligible.
13.3.3 Sample size determination
After identifying strata (in the case of pre-stratification) and selecting a sampling design, the next step is to determine the total sample size \(n\). This sample size can either be fixed in advance based on available resources or determined based on a desired level of precision for estimating a population mean or total.
Stratified sampling supports two common allocation strategies: proportional allocation and optimum allocation. Proportional allocation assigns sample size to each stratum based solely on its relative area. Optimum allocation also considers variability, and if cost is taken into account, balances that against the relative cost of sampling. The goal is to allocate effort where it provides the greatest gain in precision per unit cost.
When the total sample size \(n\) is unknown, it can be estimated using equations similar to those for SRS (see Section 11.3.6). These equations have the same circularity issue as before: \(n\) appears on both sides of the equation. As with SRS, this can be resolved using the iterative algorithm described in Section 11.3.6.
Importantly, these sample size estimates are intended to meet allowable error targets for the overall population, not for individual strata.
13.3.3.1 Total sample size with proportional allocation
As introduced in Section 11.3.6, we can estimate sample size given an allowable error \(E\) that is in the same units as \(\bar{y}\) and an estimate for each stratum’s sample variance \(s^2_h\). Recall, because we’ve not taken the sample, the value for \(s^2_h\) must come from somewhere else, e.g., from prior experience or inventory data from a similar population.
The total sample size required under proportional allocation for a finite population is given by \[\begin{equation} n = \frac{N \sum_{h=1}^H N_h s_h^2}{\frac{N^2 E^2}{t^2} + \sum_{h=1}^H N_h s_h^2}, \tag{13.9} \end{equation}\] where \(t\) is the \(t\)-value corresponding to the desired confidence level (\(1 - \alpha\)).
If the population is assumed infinite (i.e., under an areal sampling frame), the simplified formula is \[\begin{equation} n = \frac{t^2 \sum_{h=1}^H N_h s_h^2}{N E^2}. \tag{13.10} \end{equation}\]
13.3.3.2 Total sample size with optimum allocation
Optimum allocation minimizes the standard error of the estimate by accounting for both stratum size and variability. The total sample size for a finite population is \[\begin{equation} n = \frac{\left( \sum_{h=1}^H N_h s_h \right)^2}{\frac{N^2 E^2}{t^2} + \sum_{h=1}^H N_h s_h^2}. \tag{13.11} \end{equation}\]
For an infinite population, this becomes \[\begin{equation} n = \frac{t^2 \left( \sum_{h=1}^H N_h s_h \right)^2}{N^2 E^2}. \tag{13.12} \end{equation}\]
13.3.4 Allocation of samples to strata
Given the total sample size \(n\), the next step is to allocate observations among the strata using equations specific to proportional or optimum allocation. Whichever approach was used to determine \(n\) should also be used to allocate samples to the strata.
13.3.4.1 Proportional allocation
In proportional allocation, each stratum receives a share of the total sample based solely on its area. The sample size in stratum \(h\) is given by \[\begin{equation} n_h = n \frac{N_h}{N}. \tag{13.13} \end{equation}\] This approach is easy to apply but ignores differences in variability across strata.
13.3.4.2 Optimum allocation
In optimum allocation, sample sizes are chosen to minimize the overall standard error. The number of samples in stratum \(h\) is computed as \[\begin{equation} n_h = n \frac{N_h s_h}{\sum_{h=1}^H N_h s_h}. \tag{13.14} \end{equation}\] This method improves efficiency by assigning more samples to more variable strata.
13.3.4.3 Optimum allocation with cost constraints
If sampling costs differ between strata, we can modify the optimum allocation formula to incorporate cost per sample. Let \(c_h\) denote the cost of sampling a unit in stratum \(h\). The cost-adjusted optimum allocation is \[\begin{equation} n_h = n \frac{N_h s_h / \sqrt{c_h}}{\sum_{h=1}^H N_h s_h / \sqrt{c_h}}. \tag{13.15} \end{equation}\]
This formula assigns fewer observations to strata that are more expensive to sample, while still accounting for area and variability. It’s particularly useful in operational forest inventories where travel cost, terrain, or accessibility varies across the landscape.
13.3.5 Creating a stratified systematic sampling grid
Once strata have been defined and sample sizes determined using either proportional or optimum allocation, we can construct a systematic sampling grid within each stratum using functions tailored to each approach.
One option is to set up a stratum-specific sampling grid using the sys_grid()
function described in Section 13.2.3. This approach offers a lot of flexibility in terms of grid pattern, size, and orientation. It may be especially useful when strata are large or not spatially adjacent.
However, working with different grid orientations and spacings across strata can complicate field logistics. If strata are relatively small and spatially adjacent, it might be simpler for cruisers to work from a single SYS grid that covers the entire population. The next two functions create a single grid across all strata under proportional or optimum allocation.
13.3.5.1 Grid generation for proportional allocation
Description
Generates a systematic sampling grid using fixed spacing between lines (D_l
) and points (D_p
) across the entire area. Points are assigned to strata based on spatial intersection with the input boundary. This is designed for proportional allocation, where grid density is uniform across strata. Built on sys_grid()
with added logic to track stratum membership and within-stratum ordering.
Usage
Arguments
boundary
: Ansf
polygon object defining the stratum boundaries. Must be projected with linear units.D_l
: Distance between sampling lines (x-direction) in the same units asboundary
.D_p
: Distance between points along a line (y-direction) in the same units asboundary
.angle
: Optional rotation angle (in degrees). Default is0
.origin
: Corner from which to order lines and points. One of"se"
,"sw"
,"ne"
, or"nw"
. Default is"se"
.strat_id_col
: Name of the column inboundary
that identifies strata.
Details
This function builds a systematic grid across the union of stratum polygons using sys_grid()
, then assigns each point to a stratum via spatial intersection. Each sampling location is tagged with its unique point ID as well as its position within each line and stratum. This approach simplifies logistics when a single, uniform grid is acceptable for all strata.
Value
Returns a list with:
grid
: Ansf
object of all points, with columns for uniquepoint_id
,stratum_id
, and other useful indexing IDs. Columns inboundary
are appended.path
: Ansf
LINESTRING
connecting grid points in order.random_point_start
: The initial random point start.boundary
: The input polygon with stratum attributes.type
: A string used internally for validation when plotting.
Use plot_sys_grid()
to visualize the resulting grid and path.
13.3.5.2 Grid generation for optimum allocation
Description
Creates a stratified systematic sampling grid that supports optimum allocation, assigning different numbers of grid points to each stratum. Builds a dense grid using sys_grid()
, then intersects it with stratum polygons and thins the points in each stratum to approximate the target sample sizes. The final number of points per stratum may not match the targets exactly, but the function tries to get as close as possible. Supports both random and sequential thinning.
Usage
strat_opt_sys_grid(boundary, D_l, angle = 0, origin = "se",
n_col, strat_id_col, thinning_method = "random",
D_p = NULL)
Arguments
boundary
: Ansf
polygon object defining the stratum boundaries. Must be projected with linear units.D_l
: Distance between sampling lines (x-direction) in the same units asboundary
.angle
: Optional rotation angle (in degrees). Default is0
.origin
: Corner from which to order lines and points. One of"se"
,"sw"
,"ne"
, or"nw"
. Default is"se"
.n_col
: Name of the column inboundary
containing target sample sizes per stratum.strat_id_col
: Name of the column inboundary
that identifies strata.thinning_method
: Method used to thin grid points within each stratum. Either"random"
or"sequential"
. Default is"random"
.D_p
: Optional distance between points along a line. If not supplied, it’s computed automatically from the stratum with the densest sampling, using the same units asboundary
.
Details
This function is designed for optimum allocation, where sample size varies across strata. It uses sys_grid()
to build an initial grid, assigns each point to a stratum, then selects a subset of points from each stratum using the specified thinning method. Additional columns in the output identify both the full grid and the subset of points selected for measurement.
Value
Returns a list with:
grid
: Ansf
object of all points, with logicalmeasure
column and columns for uniquepoint_id
,stratum_id
, and other useful indexing IDs. Columns inboundary
are appended.measure_path
: ALINESTRING
showing the path through the selected (measure = TRUE
) points.path
: The full path through all grid points.random_point_start
: The initial random point start.boundary
: The original input polygon.type
: A string used internally for validation when plotting.
Use plot_strat_opt_sys_grid()
to visualize the sampled grid and path.
13.3.6 Illustrative analysis
We use the Harvard Forest dataset introduced in Section 1.2.5 to illustrate the application of stratified estimators. While this is a small and somewhat toy-like example, it’s useful because a complete census of trees is available—allowing us to compare our estimates directly to the true population values.
Figure 13.3 shows the extent of the 35 ha forest, partitioned into four strata. Strata 1 and 4 are mixed conifer-hardwood forest, Stratum 2 is conifer-dominated, and Stratum 3 is a sparsely forested wetland.
FIGURE 13.3: Harvard Forest population extent partitioned into four strata.
Our goal is to estimate average aboveground biomass (AGB) per hectare and total AGB for each stratum and for the entire forest. The population of interest includes all stems 12.7 cm DBH and larger. We’ll also break out estimates by species groups of interest.
The cruise will use fixed-area circular plots with a radius of 9 meters. These plots are placed at the selected sample locations and used to tally all trees meeting the DBH threshold. See Section 12.3 for additional details on plot-based sampling.
We’d like our AGB estimates to be within \(\pm\) 35 Mg/ha of the population value with 95% confidence. This allowable error corresponds to roughly 10% of the expected AGB.
The cruise will use a stratified systematic sampling design, with sample sizes determined by optimum allocation as described in the preceding section.
Stratum ID | \(N_h\) (ha) | \(s_h\) (Mg/ha) |
---|---|---|
1 | 1.24 | 100 |
2 | 8.93 | 100 |
3 | 2.25 | 75 |
4 | 22.58 | 125 |
We begin by computing the total sample size using the stratum-specific information in Table 13.1 and the optimum allocation formula in (13.12). As discussed in Section 11.3.6, this formula involves a circular dependency: we’re solving for \(n\) on the left, but also need it to determine the appropriate \(t\)-value on the right, which depends on the degrees of freedom.
To resolve this circular dependency, we use the iterative algorithm described in Section 11.3.6. The code below applies that method using a for
loop, as introduced in Section 5.2.3, to iteratively solve for \(n\) until it converges.150
H <- 4 # Number of strata.
N_h <- c(1.24, 8.93, 2.25, 22.58) # Stratum area (ha).
N <- sum(N_h) # Total area (ha).
E <- 35 # Allowable error for AGB (Mg/ha).
alpha <- 0.05 # For 95% confidence level.
s_h <- c(100, 100, 75, 125) # Stratum standard deviation (Mg/ha).
n_start <- 25 # Initial guess for total sample size.
n_prev <- n_start
for (i in 1:3) {
t <- qt(p = 1 - alpha / 2, df = n_prev - H) # df = n - H
n_next <- ceiling((t / (N * E))^2 * sum(N_h * s_h)^2)
cat("Iteration", i, ": n =", n_next, "\n")
n_prev <- n_next
}
#> Iteration 1 : n = 47
#> Iteration 2 : n = 44
#> Iteration 3 : n = 44
The algorithm converged quickly to \(n\) = 44. In the code below, we allocate the total sample size among strata using the optimum allocation formula given in (13.14).
n_h <- n * (N_h * s_h) / sum(N_h * s_h)
n_h <- ceiling(n_h) # Round up to the nearest whole number.
n_h
#> [1] 2 10 2 31
Note that due to rounding, we end up with \(n\) = 45 plots in the final design.
Now, let’s build our sampling grid using strat_opt_sys_grid()
. As described in Section 13.3.5, this function requires an sf
polygon object that delineates the stratum boundaries. As briefly noted in Section 13.1, sf
objects are data frames with a spatial geometry column and can be manipulated using tidyverse
functions.
The code below reads in the Harvard Forest stratum boundaries as an sf
polygon object, adds a new column for the target stratum-specific sample sizes, and prints the data frame portion of the object so we can inspect its columns. The st_drop_geometry()
function removes the geometry column to simplify printed output.
bounds <- st_read("datasets/HF", "HF_stratified", quiet = TRUE)
bounds <- bounds %>% mutate(n_h = n_h, .after = stratum_id)
bounds %>% st_drop_geometry()
#> stratum_id n_h ha
#> 1 1 2 1.2361
#> 2 2 10 8.9315
#> 3 3 2 2.2494
#> 4 4 31 22.5830
With a unique identifier column for each stratum (stratum_id
) and the target sample sizes (n_h
), we’re ready to pass bounds
to strat_opt_sys_grid()
. This and other sampling functions are defined in "scripts/utils.R"
.
Before calling strat_opt_sys_grid()
in the code below, we set a random seed to ensure reproducible results.
source("scripts/utils.R")
set.seed(1)
strat_sys <- strat_opt_sys_grid(bounds, D_l = 50, n_col = "n_h",
strat_id_col = "stratum_id",
thinning_method = "random")
The call to strat_opt_sys_grid()
above sets the distance between lines to 50 meters and allows the function to compute the sampling interval along lines (D_p
) automatically.151
FIGURE 13.4: Systematic sampling with stratification for the Harvard Forest example. Illustrates random removal of sampling locations to approximate the target optimum sample size.
Figure 13.4, created using plot_strat_opt_sys_grid(strat_sys, path = "lines")
, shows the full systematic grid, the selected sampling locations, and those removed from the grid to meet the target sample sizes (labeled as “Removed sampling location” in the figure legend).
If you only want to visualize the final sampling locations, set the path
argument in plot_strat_opt_sys_grid()
to "points"
as shown below in Figure 13.5.
FIGURE 13.5: Systematic sampling with stratification for the Harvard Forest example. Illustrates the path between sampling locations.
We can confirm the final sample size by stratum using the code below, which shows the target sample sizes were achieved. This won’t always be the case, depending on how the sampling grid intersects with each stratum.
strat_sys$grid %>%
filter(measure) %>%
group_by(stratum_id) %>%
summarize(realized_n_h = n()) %>%
st_drop_geometry()
#> # A tibble: 4 × 2
#> stratum_id realized_n_h
#> * <int> <int>
#> 1 1 2
#> 2 2 10
#> 3 3 2
#> 4 4 31
Given the sampling locations, we’re ready to head to the field (virtually, in this case). For our virtual cruise, we’ll use a 9 meter radius plot centered on the sampling locations. These plot areas are shown as gray circles in Figure 13.6.
As always, if any portion of a plot extends beyond a boundary, then a boundary correction method must be applied—this includes stratum boundaries. Figure 13.6 highlights the three plots with boundary slopover and applies the mirage method described in Section 12.3.3.
FIGURE 13.6: Systematic sampling with stratification for the Harvard Forest example. Fixed-area plots are shown at each sampling location, including those that extend beyond stratum boundaries and require boundary slopover correction. The mirage method is used for these plots; see Section 12.3.3 for details.
The cruise tree data are stored in "datasets/HF/HF_strat_sys_n45.csv"
and are read into plot_trees
below. Each row represents one or more trees. The dataset includes a plot_id
column that uniquely identifies plots across all strata, and a tree_id
column that identifies individual trees within each plot. The stratum_id
column indicates the stratum in which the plot falls. Tree counts are recorded in the tree_count
column, which equals 2 when a tree is double-counted using mirage correction. Additional columns include agb_Mg
(aboveground biomass in metric tons), dbh_cm
(diameter at breast height in centimeters), and species
(tree species).
#> Rows: 797
#> Columns: 7
#> $ plot_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2…
#> $ tree_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, …
#> $ stratum_id <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ agb_Mg <dbl> 0.086166, 1.853968, 2.227097, 1.27…
#> $ dbh_cm <dbl> 14.4, 44.6, 48.0, 38.4, 27.0, 41.8…
#> $ species <chr> "Acer rubrum", "Quercus rubra", "Q…
Given these tree data, and following the steps laid out in Section 12.6.1, our first task is to expand tree-level agb_Mg
measurements to a per-hectare basis using the appropriate tree factor, then summarize these values at the plot level. These plot-level values correspond to the \(y_{hi}\) terms in (13.1), i.e., the \(i\)-th plot measurement in the \(h\)-th stratum.
To compute these plot-level AGB values, we group by plot_id
and use summarize()
to calculate y_hi
, the plot-level AGB in Mg/ha. We also retain each plot’s stratum_id
using first(stratum_id)
.152
# Add tree factor for the given plot area.
r <- 9 # Plot radius in meters.
plot_trees <- plot_trees %>%
mutate(TF = 10000 / (pi * r^2))
# Compute plot-level AGB estimates.
plot_summary <- plot_trees %>%
group_by(plot_id) %>%
summarize(y_hi = sum(TF * tree_count * agb_Mg),
stratum_id = first(stratum_id))
plot_summary
#> # A tibble: 45 × 3
#> plot_id y_hi stratum_id
#> <dbl> <dbl> <dbl>
#> 1 1 468. 4
#> 2 2 197. 4
#> 3 3 253. 4
#> 4 4 419. 4
#> 5 5 174. 4
#> 6 6 388. 4
#> 7 7 579. 4
#> 8 8 346. 4
#> 9 9 364. 4
#> 10 10 158. 4
#> # ℹ 35 more rows
Next, we compute the stratum-level mean \(\bar{y}_h\) and standard deviation \(s_h\) by grouping plot_summary
by stratum_id
. We also compute the number of plots within each stratum for convenient reference in the next step.
stratum_summary <- plot_summary %>%
group_by(stratum_id) %>%
summarize(
y_bar_h = mean(y_hi),
s_h = sd(y_hi),
n_h = n_distinct(plot_id)
)
stratum_summary
#> # A tibble: 4 × 4
#> stratum_id y_bar_h s_h n_h
#> <dbl> <dbl> <dbl> <int>
#> 1 1 311. 38.3 2
#> 2 2 474. 140. 10
#> 3 3 63.0 51.0 2
#> 4 4 348. 108. 31
These stratum-level estimates are useful when reporting means, totals, and associated uncertainty for individual strata. For now, though, we’re focused on estimating the overall population across strata.
The estimators for the stratified population mean and its standard error are given in (13.2) and (13.5), respectively. Both require \(n_h\), \(N_h\), and the stratum-level summaries. To streamline the workflow, we join the stratum areas to stratum_summary
before applying the estimators and computing 95% confidence intervals.
# Join stratum areas.
stratum_area <- tibble(
stratum_id = 1:4,
N_h = c(1.24, 8.93, 2.25, 22.58)
)
stratum_summary <- left_join(stratum_summary, stratum_area,
by = "stratum_id")
# Estimate population mean and 95% CI.
stratum_summary %>%
summarize(
N = sum(N_h),
y_bar = sum(N_h * y_bar_h) / N,
s_y_bar = sqrt(1 / N^2 * sum((N_h^2 * s_h^2) / n_h)),
t = qt(df = sum(n_h - 1), p = 1 - 0.05 / 2),
ci_lower = y_bar - t * s_y_bar,
ci_upper = y_bar + t * s_y_bar
) %>%
select(y_bar, ci_lower, ci_upper)
#> # A tibble: 1 × 3
#> y_bar ci_lower ci_upper
#> <dbl> <dbl> <dbl>
#> 1 360. 326. 395.
To estimate the population total and its confidence interval in the code below, we adapt the code above using the total estimator (13.3) and its standard error (13.6).
stratum_summary %>%
summarize(
N = sum(N_h),
t_hat = N * sum(N_h * y_bar_h) / N,
s_t_hat = N * sqrt(1 / N^2 * sum((N_h^2 * s_h^2) / n_h)),
t = qt(df = sum(n_h - 1), p = 1 - 0.05 / 2),
ci_lower = t_hat - t * s_t_hat,
ci_upper = t_hat + t * s_t_hat
) %>%
select(t_hat, ci_lower, ci_upper)
#> # A tibble: 1 × 3
#> t_hat ci_lower ci_upper
#> <dbl> <dbl> <dbl>
#> 1 12615. 11411. 13819.
13.3.6.1 Checking coverage and allowable error
How well did our estimates perform in terms of uncertainty quantification? The census-based AGB for the Harvard Forest is 341.17 Mg/ha, with a forest-wide total of 11,941 Mg. Our estimated mean and total, with 95% confidence intervals, were 360.44 (326.03, 394.84) Mg/ha and 12,615 (11,411, 13,819) Mg, respectively. Based on this sample, both confidence intervals successfully captured the true values.
If the theory holds, we’d expect about 95% of confidence intervals from repeated samples to cover the true mean. Because we have a census, we can perform repeated sampling to empirically check that the coverage rate matches this expectation. We can also evaluate whether our optimum allocation strategy meets the target allowable error of \(E = 35\) Mg/ha with 95% confidence.
Using the estimated \(n\) and \(n_h\), we drew 1,000 samples from the population using stratified SYS. For each, we computed the mean AGB and its 95% confidence interval. These 1,000 replicate estimates are stored in "datasets/HF/strat_sys_agb_ests_1000reps.csv"
. The code below reads in these data and computes two key empirical rates: the proportion of intervals that covered the true mean, and the proportion with absolute error less than the allowable error \(E = 35\) Mg/ha. Both rates are very close to 95%, which is strong evidence that the design-based theory is holding up in this example!
agb_ests_reps <- read_csv("datasets/HF/strat_sys_agb_ests_1000reps.csv")
# Compute the true mean AGB (Mg/ha) from the census.
y_true <- read_csv("datasets/HF/HF_trees.csv") %>%
filter(dbh_cm > 12.7) %>%
summarize(y_true = sum(agb_Mg) / N) %>%
pull()
# Percent of confidence intervals that cover the true mean.
mean(y_true > agb_ests_reps$ci_lower_95 &
y_true < agb_ests_reps$ci_upper_95) * 100
#> [1] 98.1
# Percent of estimates within the allowable error E
mean(abs(agb_ests_reps$y_bar - y_true) < E) * 100
#> [1] 97.3
13.3.6.2 Efficiency gains from stratification
Stratified sampling is especially effective when variation within strata is smaller than variation between them, allowing for more precise estimates than a SRS of the same size. In our Harvard Forest example, we defined four strata. With the exception of the small Stratum 3, the others were fairly similar, so we might not expect dramatic efficiency gains. As a reminder, efficiency here means achieving greater precision—that is, a smaller standard error—for the same number of plots compared to SRS.
To compare estimator performance (i.e., unstratified vs. stratified), we repeated the procedure from Section 13.3.6.1, drawing 1,000 independent samples from the same population. For each sample, we computed the mean AGB and its 95% confidence interval using both unstratified and stratified estimators.
The unstratified replicate estimates are stored in "datasets/HF/sys_agb_ests_1000reps.csv"
. The stratified replicates are the same as those used in Section 13.3.6.1. For each estimator, we computed the average 95% confidence interval width across replicates. These were 84 and 73 (Mg/ha) for the unstratified and stratified cases, respectively. The percent gain in precision from stratification is 13%, reflecting the reduction in average interval width achieved by incorporating strata. This confirms that even modest stratification can yield measurable gains in precision, making it a worthwhile method when stratifying variables are aviable.
13.3.6.3 Group-level estimates from stratified data
So far, our focus has been on estimating AGB within strata and for the overall population. But we’re often also interested in estimates for subgroups of trees within each stratum and across the entire forest. For example, we might want to know how much AGB is tied up in specific species, age classes, forest types, or functional groups.
Let’s walk through one such example.
Eastern hemlock (Tsuga canadensis) is under threat throughout the northeastern U.S. due to the spread of hemlock woolly adelgid (HWA), an invasive insect causing widespread mortality. Since HWA is present in the region of Massachusetts where our data were collected, it’s reasonable to ask how much AGB would be lost if all Eastern hemlock trees in the forest were killed.153
We start by creating a new column in plot_trees
called species_groups
, which assigns each tree to one of two groups of interest:
"Eastern hemlock"
(species == "Tsuga canadensis"
)"Other"
(species != "Tsuga canadensis"
)
Both species_groups
and plot_id
are converted to factors. This prepares the data for use with complete()
, which relies on factor levels to insert rows for combinations of plot and group that may not appear in the data.
plot_trees <- plot_trees %>%
mutate(
species_groups = factor(if_else(species == "Tsuga canadensis",
"Eastern hemlock",
"Other")),
plot_id = factor(plot_id)
)
As noted in Chapter 12, our convention is to use a tree_count
of zero to retain plots with no trees. This keeps their plot indexes in the dataset, ensuring they’re included in summaries and downstream calculations. What’s important is that all plot indexes—even those with no trees—are retained when converting plot_id
to a factor. We used a similar approach when building stand and stock tables in Section 12.6.3.
Next, we compute AGB at the plot level for each species group.
plot_summary <- plot_trees %>%
group_by(plot_id, species_groups) %>%
summarize(
y_hi = sum(TF * tree_count * agb_Mg),
stratum_id = first(stratum_id),
.groups = "drop"
)
This yields one row per plot per group. If no trees from a group were observed in a plot, it won’t appear yet. We’ll fix that next.
The code below uses complete()
to add any missing combinations of plot and species group, inserting explicit zeros. We then use fill()
to recover stratum_id
for the new rows. To avoid crossing strata, we group by plot_id
before filling stratum_id
, ensuring correct propagation.
plot_summary_filled <- plot_summary %>%
complete(plot_id, species_groups, fill = list(y_hi = 0)) %>%
group_by(plot_id) %>%
fill(stratum_id, .direction = "downup") %>%
ungroup()
This is conceptually the same as our earlier use of complete()
in Section 12.6.3 to fill in missing combinations of species and DBH class. Here, we’re doing the same with species groups. After inserting rows, fill()
restores plot-level variables. Grouping by plot_id
ensures this is done within the correct plot.
Take a look at plot_summary_filled
to verify that it looks right—the number of rows should equal the number of plots times the number of species groups.
We now summarize to the stratum level, computing the mean AGB per hectare (\(\bar{y}_h\)) with associated 95% confidence intervals.
stratum_summary <- plot_summary_filled %>%
group_by(stratum_id, species_groups) %>%
summarize(
y_bar_h = mean(y_hi),
n_h = n(),
s_h = sd(y_hi),
s_y_bar_h = s_h/sqrt(n_h),
t = qt(df = n_h - 1, p = 1 - 0.05 / 2),
ci_lower_h = y_bar_h - t * s_y_bar_h,
ci_upper_h = y_bar_h + t * s_y_bar_h,
.groups = "drop"
)
stratum_summary %>%
select(stratum_id, species_groups, y_bar_h, ci_lower_h, ci_upper_h)
#> # A tibble: 8 × 5
#> stratum_id species_groups y_bar_h ci_lower_h ci_upper_h
#> <dbl> <fct> <dbl> <dbl> <dbl>
#> 1 1 Eastern hemlock 89.8 -678. 857.
#> 2 1 Other 221. -891. 1333.
#> 3 2 Eastern hemlock 255. 101. 410.
#> 4 2 Other 219. 110. 327.
#> 5 3 Eastern hemlock 0 0 0
#> 6 3 Other 63.0 -396. 522.
#> 7 4 Eastern hemlock 48.3 24.3 72.2
#> 8 4 Other 300. 258. 341.
These estimates show that Stratum 2 carries about 250 Mg/ha of Eastern hemlock, which is substantially more than any of the other strata. Not surprisingly, with a sample size of \(n_h = 2\), Strata 1 and 3 have very low precision (i.e., wide confidence intervals). No Eastern hemlock was observed on the plots in Stratum 3, so its estimates are zero. To better understand these differences across the forest, Figure 13.7 shows the estimated mean AGB for Eastern hemlock in each stratum (i.e., y_bar_h
computed above).
FIGURE 13.7: Estimated mean aboveground biomass (AGB) of Eastern hemlock (Tsuga canadensis) per hectare in each stratum. Estimates are based on plot-level measurements and stratified sampling.
To estimate forest-wide AGB by species group, along with standard errors and 95% confidence intervals, we apply the total estimator from (13.3) and its associated standard error from (13.6). The beauty of this approach is that it’s the same code used to estimate total AGB for all species in the previous section—the only difference is the added group_by(species_groups)
. This again illustrates a key strength of the tidyverse: once the logic for a summary is in place, it’s easy to extend it to subgroups with minimal changes.
# Join the stratum areas.
stratum_summary <- stratum_summary %>%
left_join(stratum_area, by = "stratum_id")
stratum_summary %>%
group_by(species_groups) %>%
summarize(
N = sum(N_h),
t_hat = N * sum(N_h * y_bar_h) / N,
s_t_hat = N * sqrt(1 / N^2 * sum((N_h^2 * s_h^2) / n_h)),
t = qt(df = sum(n_h - 1), p = 1 - 0.05 / 2),
ci_lower = t_hat - t * s_t_hat,
ci_upper = t_hat + t * s_t_hat,
.groups = "drop"
) %>%
select(species_groups, t_hat, ci_lower, ci_upper)
#> # A tibble: 2 × 4
#> species_groups t_hat ci_lower ci_upper
#> <fct> <dbl> <dbl> <dbl>
#> 1 Eastern hemlock 3482. 2134. 4831.
#> 2 Other 9133. 7838. 10428.
So, to answer our initial question, how much AGB would be lost if all Eastern hemlock trees in the forest were killed by HWA? We estimate a total loss of 3,482 Mg, with a 95% confidence interval of (2,134, 4,831) Mg, with the majority of this loss coming from Stratum 2.
13.3.7 Advantages and disadvantages of stratified sampling
Stratified sampling is widely used in forest inventory, especially when auxiliary information is available to define meaningful strata. Below, we summarize some key advantages and disadvantages of this design.
13.3.7.1 Advantages
Improved precision: When strata are well defined, stratified sampling increases efficiency by grouping similar areas together. This structure reduces variability within each group and leads to more precise overall estimates.
Supports subregion estimation: Stratification enables valid estimation not just for the entire population, but also for each individual stratum. This is useful when separate summaries are needed for forest types, disturbance regimes, management units, ownership classes, or other subdivisions.
Enables efficient allocation: With optimum allocation, more samples can be directed to strata that are larger or more variable, improving efficiency without increasing cost.
Compatible with other designs: Different sampling strategies—such as SRS, SYS, or others—can be used within strata, depending on operational needs or design goals. See, for example, Section 13.4.5.
Robust to poor strata: Even if strata aren’t optimal, stratified sampling rarely performs worse than unstratified sampling. At worst, standard errors may be comparable to SRS of the same sample size.
Supports post-stratification: If stratum membership isn’t known at the time of sampling, post-stratification can still be used. This allows for flexibility when auxiliary data become available after the field campaign.
13.3.7.2 Disadvantages
Area requirements for pre-stratification: A key requirement of pre-stratification is that the area of each stratum must be known prior to sampling in order to support design-based estimation. This can be a limitation if strata are based on incomplete or evolving information.
Limited allocation flexibility in post-stratification: Post-stratified designs don’t allow proportional or optimum allocation, since sample sizes per stratum aren’t controlled in advance. They may also result in too few samples in some strata to estimate variance reliably.
Added bookkeeping: Stratified sampling requires tracking stratum membership, managing stratum-specific calculations, and combining results with appropriate weights.
13.4 Cluster sampling
Cluster sampling is commonly used in forest inventory, especially when it’s impractical or costly to sample widely dispersed locations. In this design, the primary sampling unit (PSU) is a cluster of smaller secondary sampling units (SSUs), where measurements are taken. The SSUs are nested within PSUs, introducing additional design considerations compared to the sampling designs introduced earlier.
Cluster sampling is particularly useful when:
- the cost of reaching widely dispersed sampling locations is high,
- operational constraints require grouping sampling locations for efficient data collection,
- measurement of a sampling location is inexpensive relative to travel costs,
- or the sampling frame is only available at the PSU level.
These conditions are common in forest inventory, where measurement costs are low relative to the cost of moving a field crew to a sampling location. Cluster designs help reduce travel time by grouping nearby SSUs into the same PSU, allowing crews to sample multiple locations with minimal movement. This approach is especially useful when sampling large areas. For example, cluster designs are widely used in national forest inventory (NFI) programs, where travel costs are high and field crews must cover broad regions efficiently.
The tradeoff, however, is a potential loss of statistical efficiency. Due to spatial autocorrelation—the tendency for observations located close together to be more similar than those farther apart—SSUs in the same PSU often exhibit less variability than single locations sampled independently across space, such as the SYS designs illustrated in Section 13.2. As a result, cluster sampling typically produces less precise estimates than a SRS of the same number of SSUs. This is a classic cost-precision tradeoff: cluster sampling may reduce field effort but increase sampling error.
13.4.1 Clusters, strata, and systematic sampling
Although clusters (PSUs) and strata both involve dividing the population into groups, their purpose and implementation differ in important ways. In stratified sampling, the population is partitioned into non-overlapping subgroups (strata) using known auxiliary variables, and sampling is conducted independently within each stratum. The goal is to increase precision by concentrating sampling effort within relatively homogeneous subgroups.
In cluster sampling, the goal is not to improve statistical efficiency directly, but to simplify or reduce the cost of data collection. PSUs are often based on geographic proximity or field logistics, not auxiliary variables. A subset of PSUs is selected (typically via SRS or SYS), and measurements are collected from all or some SSUs within each selected PSU.
The contrast can be summarized as follows.
- Stratification aims to reduce within-group variability to improve estimation precision.
- Clustering often results in within-group similarity, which can reduce precision but improve feasibility.
While their motivations differ, both cluster and stratified designs are valid and often used together—for example, in stratified cluster sampling, where PSUs are selected independently within each stratum.
SYS, as described in Section 13.2, can be viewed as a form of cluster sampling with a single PSU made up of \(n\) evenly spaced SSUs. This perspective reinforces why SYS does not have a design-based variance estimator (Section 13.2.5)—with only one randomly selected PSU, there is no way to estimate between-PSU variability.
13.4.2 Elements of cluster sampling design
When implementing a cluster sampling design, there are several key design elements to consider:
Cluster layout and spacing. PSUs may be naturally defined or imposed by the analyst. In the latter case, PSUs are often laid out in a regular pattern, such as a systematic grid. The distance between PSUs affects both cost and precision. Widely spaced PSUs typically reduce redundancy due to spatial autocorrelation, but increase travel effort.
Cluster shape and SSU configuration. SSUs within a PSU are typically arranged in a fixed geometric pattern around an anchor point. Common layouts include lines, triangles, squares, and L-shapes. This configuration influences how well the PSU captures local variability and affects edge effects, travel efficiency, and measurement consistency. SSU size can be scaled to efficiently measure different variables, and nested configurations (e.g., for regeneration sampling) can be used when needed.
Number of clusters and SSUs per cluster. The total number of observed SSUs is the product of the number of PSUs and the number of SSUs per PSU. Increasing the number of PSUs generally improves statistical efficiency more than increasing the number of SSUs within each PSU, since PSUs are more likely to be spatially independent. However, logistical considerations may favor sampling more SSUs per PSU to reduce costs.
Method of selecting clusters. PSUs can be selected using any valid probability sampling method. SRS and SYS are most common. If auxiliary information is available, methods such as stratified or PPS sampling may be used to improve precision while maintaining logistical feasibility.
These elements interact to determine the balance between statistical precision and operational cost. In forest inventory, the goal is often to design a layout that captures sufficient spatial variability while remaining cost-effective and logistically manageable.
13.4.3 Choosing the number of PSUs and SSUs
Unlike stratified sampling, where well-established formulas exist for allocating sample sizes across strata, sample size planning in PSU designs is more heuristic and context-specific. The main tradeoff is between the number of PSUs and the number of SSUs per PSU. Adding SSUs within a PSU improves precision only up to a point—when observations are spatially correlated, the marginal benefit of additional SSUs diminishes. Increasing the number of PSUs is usually more effective for reducing sampling error, since PSUs tend to be more independent.
In most forest inventory settings, these decisions are shaped by cost and logistics. Analytical expressions for optimal allocation exist but require estimates of within- and between-PSU variability, which are rarely available before sampling begins. Some forest inventory programs use simulation to explore tradeoffs under realistic spatial and operational constraints. Additional discussion on cluster sampling design can be found, e.g., in Freese (1962), S. L. Lohr (2019) Section 5.5, and Cochran (1977) Section 9.6.
There are many variations of cluster sampling in both design and estimation. This section focuses on single-stage cluster sampling, where all SSUs within each selected PSU are measured. In many settings, however, measuring every SSU is cost-prohibitive or unnecessary. In such cases, only a subset of SSUs is observed—a design referred to as two-stage (or multistage) cluster sampling, which is covered in Section 13.5.
Here, we focus on a common and relatively simple form of single-stage cluster sampling: selecting PSUs with equal probability and measuring all SSUs within each selected PSU. This provides a useful baseline for understanding and implementing cluster-based designs in forest inventory.
13.4.4 Estimation using cluster sampling
In forest inventory, the total number of SSUs in the population is often unknown. A natural estimator in this case is the cluster ratio estimator, which divides the total measurement across all sampled SSUs by the total number of SSUs observed. It’s especially useful when SSU counts vary across PSUs. The estimator is flexible, easy to implement, and well suited to field conditions. Some bias may occur if SSU counts are related to the measured variable, but this is usually small in practice.
We use the following notation:
- \(N\) number of PSUs in the population;
- \(n\) number of PSUs in the sample;
- \(m_j\) number of SSUs in the \(j\)-th sampled PSU;
- \(\bar{m} = \frac{1}{n} \sum_{j=1}^n m_j\) mean number of SSUs per PSU;
- \(y_{ji}\) value for the \(i\)-th SSU in the \(j\)-th PSU;
- \(y_j = \sum_{i=1}^{m_j} y_{ji}\) total for the \(j\)-th PSU.
The cluster ratio estimator for the population mean \(\mu\) is
\[\begin{equation}
\bar{y} = \frac{\sum_{j=1}^n y_j}{\sum_{j=1}^n m_j}.
\tag{13.16}
\end{equation}\]
Under a finite population, the standard error of \(\bar{y}\) is
\[\begin{equation}
s_{\bar{y}} = \sqrt{ \left(1- \frac{n}{N} \right) \frac{1}{n \bar{M}^2} \frac{ \sum_{j=1}^n (y_j - \bar{y} m_j)^2 }{n - 1} },
\tag{13.17}
\end{equation}\]
where \(\bar{M}\) is the average number of SSUs per PSU in the population. If \(\bar{M}\) is unknown, we use the sample average \(\bar{m}\).
As in previous sampling sections, when working from an areal sampling frame, the finite population correction is omitted: \[\begin{equation} s_{\bar{y}} = \sqrt{ \left( \frac{1}{n \bar{M}^2} \right) \frac{ \sum_{j=1}^n (y_j - \bar{y} m_j)^2 }{n - 1} }. \tag{13.18} \end{equation}\]
Confidence intervals for the mean follow the usual form: \[\begin{equation} \bar{y} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\bar{y}}, \tag{13.19} \end{equation}\] with degrees of freedom \(df = n - 1\).
The degrees of freedom reflect the number of sampled PSUs, not SSUs, since PSUs are the units selected by the sampling design. SSUs within a PSU are not independent, so inference must be based on the number of independent PSUs.
If the number of SSUs per PSU is constant, these estimators simplify. But it’s usually better to write code for the more general case, which handles varying SSU counts without much additional effort.
13.4.5 Stratified cluster sampling
Cluster sampling can be extended to a stratified framework, resulting in stratified cluster sampling. This is useful when both logistical constraints and known auxiliary information are present. As described in Section 13.3.1, the first step is to divide the population into non-overlapping strata based on variables expected to influence the parameter of interest. A separate cluster sampling design is then implemented independently within each stratum:
- Select \(n_h\) PSUs in each stratum using a probability method such as SRS or SYS. At least two PSUs are needed to compute a sample variance.
- Within each sampled PSU \(j = 1, \dots, n_h\), measure \(m_{hj}\) SSUs and compute the stratum mean \(\bar{y}_h\) using (13.16) and its standard error \(s_{\bar{y}_h}\) using (13.17) or (13.18).
Once stratum-level estimates have been computed, the population mean and its standard error are calculated as \[\begin{equation} \bar{y} = \frac{\sum_{h=1}^H N_h \bar{y}_h}{N} \tag{13.20} \end{equation}\] and \[\begin{equation} s_{\bar{y}} = \sqrt{ \frac{ \sum_{h=1}^H N_h^2 s_{\bar{y}_h}^2 }{N^2} }. \tag{13.21} \end{equation}\]
Here, \(N_h\) is the population size in stratum \(h\), and \(N = \sum_{h=1}^H N_h\). In a traditional finite population setting, these are counts of PSUs. In areal sampling, \(N\) and \(N_h\) may represent area, provided that \(\bar{y}_h\) is expressed per unit area (e.g., per acre or hectare).
Confidence intervals for the population mean should use \(n - H\) degrees of freedom, where \(n\) is the total number of PSUs sampled and \(H\) is the number of strata. As in standard cluster sampling, the degrees of freedom reflect the number of PSUs sampled, not the number of SSUs.
Stratified cluster sampling is widely used in forest inventory because it combines the operational benefits of clustering with the statistical efficiency of stratification. Different strata can use different sampling intensities, SSU layouts, or even PSU configurations—what matters is that estimation is carried out separately within each stratum and then combined using the stratified estimators above.
13.4.6 Illustrative analyses
13.4.6.1 Cluster sampling
To illustrate cluster sampling, we return to the Harvard Forest dataset. This is again a toy-like example, but it allows us to walk through the full workflow using real data and compare sample-based estimates to the true population values.
The objective of the cruise is to estimate average AGB per hectare and total AGB for the entire forest. The population of interest includes all stems with DBH \(\ge\) 12.7 cm.
We use systematic sampling to select the locations of our PSUs. Each PSU consists of three fixed-area circular SSUs, positioned 10 meters from the PSU center at bearings of \(0^\circ\), \(135^\circ\), and \(225^\circ\). Each SSU has a radius of 5 meters and is used to tally all trees meeting the DBH threshold. See Section 12.3 for more on plot-based sampling.
As described in Section 13.2, we use the sys_grid()
function to generate a systematic grid of PSU centers across the forest. This function constructs the grid around a randomly chosen starting point. For this example, the spacing is 100 by 100 meters, yielding \(n = 35\) PSUs. These grid points serve as the anchor locations for each PSU.
FIGURE 13.8: Cluster sampling with primary sampling units (PSUs), i.e., clusters, placed using a systematic design in the Harvard Forest example.
The cruise tree data are stored in "datasets/HF/HF_cluster_sys.csv"
and read into ssu_trees
below. Use glimpse()
to review the columns and indexing. Notice the SSU ID is nested within the PSU ID, which has implications for how we group the data in subsequent calculations.
#> Rows: 586
#> Columns: 7
#> $ psu_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2…
#> $ ssu_id <dbl> 1, 1, 2, 3, 3, 3, 3, 3, 1, 1, 2, 2…
#> $ tree_id <dbl> 1, 2, 1, 1, 2, 3, 4, 5, 1, 2, 1, 2…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ agb_Mg <dbl> 0.416741, 0.079753, 0.000000, 0.10…
#> $ dbh_cm <dbl> 24.9, 13.3, 0.0, 14.8, 24.9, 15.8,…
#> $ species <chr> "rubra", "rubra", "no trees", "gra…
Before proceeding with estimation, it’s good practice to confirm how the data are structured. The code below checks the number of PSUs and the number of SSUs per PSU.
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 35
# Number of SSUs per PSU.
ssu_trees %>%
group_by(psu_id) %>%
summarize(m_j = n_distinct(ssu_id)) %>%
distinct(m_j)
#> # A tibble: 1 × 1
#> m_j
#> <int>
#> 1 3
As expected, each PSU contains three SSUs.
We’ll now walk through the estimation workflow using the estimators from Section 13.4.4. We begin by expanding AGB to a per-hectare basis using the tree factor for the 5-meter radius SSU plots. Then we compute SSU-level summaries. Since the SSU index is only unique within its PSU, we must group by both PSU and SSU in the code below.
# Add tree factor for the given plot area.
r <- 5 # SSU plot radius in meters.
ssu_trees <- ssu_trees %>%
mutate(TF = 10000 / (pi * r^2))
# Compute the SSU plot-level AGB estimates.
ssu_summary <- ssu_trees %>%
group_by(psu_id, ssu_id) %>%
summarize(y_ji = sum(TF * tree_count * agb_Mg))
Next, on our way to computing the overall mean, we summarize to the PSU level by calculating \(y_j\) and \(m_j\) for each PSU. Although ssu_summary
is already grouped by psu_id
, we explicitly redefine the grouping in the code below for clarity.
Given psu_summary
, we apply the estimators from (13.16) and (??) to compute the estimated population mean and its standard error, along with the associated 95% confidence interval. We then multiply by the total forest area to estimate forest-wide biomass at the end of the piped sequence.
N <- 35 # Hectare forest area.
psu_summary %>%
summarize(
y_bar = sum(y_j) / sum(m_j),
n = n(),
m_bar = mean(m_j),
s_y_bar = sqrt(1 / (n * m_bar^2) *
sum((y_j - y_bar * m_j)^2) / (n - 1)),
t = qt(df = n - 1, p = 1 - 0.05 / 2),
y_ci_lower = y_bar - t * s_y_bar,
y_ci_upper = y_bar + t * s_y_bar
) %>%
select(y_bar, y_ci_lower, y_ci_upper) %>%
mutate(
t_hat = N * y_bar,
t_ci_lower = N * y_ci_lower, t_ci_upper = N * y_ci_upper
)
#> # A tibble: 1 × 6
#> y_bar y_ci_lower y_ci_upper t_hat t_ci_lower t_ci_upper
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 306. 253. 358. 10693. 8868. 12519.
The census-based AGB for the Harvard Forest is 341.17 Mg/ha, with a forest-wide total of 11,941 Mg. Our estimated mean and total, with 95% confidence intervals, were 305.53 ( 253.37, 357.69) Mg/ha and 10,693.00 ( 8,868.00, 12,519.00) Mg, respectively. Based on this sample, both confidence intervals successfully captured the true values.
13.4.6.2 Stratified cluster sampling
Next, let’s extend the cluster design to take advantage of the Harvard Forest strata introduced in Section 13.3.6 and shown in Figure 13.3.
We’ll use the same systematic sample of PSUs from the previous section. Now, with strata imposed, any SSU whose center falls in a different stratum than its PSU is removed. This is visible in Figure 13.9, where some PSUs near stratum boundaries have fewer than three SSUs.
One SSU has its boundary extend beyond its stratum boundary, so a boundary slopover correction is applied. As in the stratified sampling illustration, we use the mirage correction: if an SSU overlaps the stratum boundary and a tree falls within its mirage correction region, that tree is counted twice by assigning it a tree_count
of two (see Section 12.3.3).
FIGURE 13.9: Stratified cluster sampling in the Harvard Forest example, with primary sampling units (PSUs), i.e., clusters, placed using a systematic design.
The cruise tree data are stored in "datasets/HF/HF_strat_cluster_sys.csv"
and read into ssu_trees
below. Use glimpse()
to review the columns and indexing. Notice that SSU IDs are nested within PSU IDs, which are in turn nested within stratum IDs. This nesting structure has implications for how we group the data in subsequent calculations.
#> Rows: 556
#> Columns: 8
#> $ psu_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2…
#> $ ssu_id <dbl> 1, 1, 2, 3, 3, 3, 3, 3, 1, 1, 2, 2…
#> $ tree_id <dbl> 1, 2, 1, 1, 2, 3, 4, 5, 1, 2, 1, 2…
#> $ stratum_id <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ agb_Mg <dbl> 0.416741, 0.079753, 0.000000, 0.10…
#> $ dbh_cm <dbl> 24.9, 13.3, 0.0, 14.8, 24.9, 15.8,…
#> $ species <chr> "Quercus rubra", "Quercus rubra", …
Before proceeding with estimation, let’s examine the data structure by summarizing the number of PSUs and the number of SSUs per PSU within each stratum.
# Number of PSUs within each stratum.
ssu_trees %>%
group_by(stratum_id) %>%
summarize(n = n_distinct(psu_id))
#> # A tibble: 4 × 2
#> stratum_id n
#> <dbl> <int>
#> 1 1 2
#> 2 2 9
#> 3 3 3
#> 4 4 21
# Number of SSUs per PSU within each stratum.
ssu_trees %>%
group_by(stratum_id, psu_id) %>%
summarize(m_j = n_distinct(ssu_id)) %>%
distinct(m_j)
#> # A tibble: 8 × 2
#> # Groups: stratum_id [4]
#> stratum_id m_j
#> <dbl> <int>
#> 1 1 2
#> 2 1 3
#> 3 2 2
#> 4 2 3
#> 5 3 2
#> 6 3 3
#> 7 4 3
#> 8 4 2
As seen in Figure 13.9, the number of SSUs per PSU varies within some strata.
We’ll now walk through the estimation workflow using the estimators from Section 13.4.5. We begin by expanding AGB to a per-hectare basis using the tree expansion factor for the 5-meter radius SSU plots. Next, we compute stratum-specific summaries at the SSU and then PSU levels to prepare for applying the stratified mean and standard error estimators from (13.2) and (13.5), respectively. Finally, we use these results to compute a 95% confidence interval for the population mean.
# Add tree factor for the given plot area.
r <- 5 # SSU plot radius in meters.
ssu_trees <- ssu_trees %>%
mutate(TF = 10000 / (pi * r^2))
# Compute the SSU plot-level AGB estimates.
ssu_summary <- ssu_trees %>%
group_by(stratum_id, psu_id, ssu_id) %>%
summarize(y_hji = sum(TF * tree_count * agb_Mg))
Next, for each stratum, we summarize to the PSU level by calculating \(y_{hj}\) and \(m_{hj}\) then stratum level \(\bar{y}_h\) and \(s_{\bar{y}_{h}}\) via (13.16) and (13.18), respectively. Although ssu_summary
is already grouped by stratum_id
and psu_id
, we explicitly redefine the grouping in the code below for clarity.
stratum_summary <- ssu_summary %>%
group_by(stratum_id, psu_id) %>%
summarize(
y_hj = sum(y_hji),
m_hj = n()
) %>%
summarize(
y_bar_h = sum(y_hj) / sum(m_hj),
n_h = n(),
m_bar_h = mean(m_hj),
s_y_bar_h = sqrt(1 / (n_h * m_bar_h^2) *
sum((y_hj - y_bar_h * m_hj)^2) / (n_h - 1))
)
stratum_summary
#> # A tibble: 4 × 5
#> stratum_id y_bar_h n_h m_bar_h s_y_bar_h
#> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 1 223. 2 2.5 3.58
#> 2 2 392. 9 2.89 57.0
#> 3 3 55.4 3 2.33 25.0
#> 4 4 314. 21 2.86 28.3
Given the stratum specific estimates for the mean and its standard error, we apply the stratified estimators (13.20) and (13.18) in the code below.
stratum_summary <- left_join(stratum_summary, stratum_area,
by = "stratum_id"
)
# Estimate population mean and 95% CI.
stratum_summary %>%
summarize(
N = sum(N_h),
y_bar = sum(N_h * y_bar_h) / N,
s_y_bar = sqrt(sum(N_h^2 * s_y_bar_h^2) / N^2),
t = qt(df = sum(n_h - 1), p = 1 - 0.05 / 2),
ci_lower = y_bar - t * s_y_bar,
ci_upper = y_bar + t * s_y_bar
) %>%
select(y_bar, ci_lower, ci_upper)
#> # A tibble: 1 × 3
#> y_bar ci_lower ci_upper
#> <dbl> <dbl> <dbl>
#> 1 314. 266. 362.
The width of this 95% confidence interval is narrower than that estimated using unstratified cluster sampling in Section 13.4.6.1 (by about 9 Mg/ha). However, this could simply be due to chance, since the comparison is based on a single sample. To more reliably assess whether stratification improves efficiency in this setting, we turn to repeated sampling. That’s the focus of the next section.
13.4.6.3 Comparing stratified and unstratified cluster sampling
Let’s compare coverage rates and average 95% confidence interval widths for the unstratified and stratified cluster sampling examples from the two preceding sections. We performed a similar assessment for stratified versus unstratified sampling in Sections 13.3.6.1 and 13.3.6.2. As before, we repeatedly sample from the Harvard Forest population to empirically evaluate the performance of each estimator.
In this case, we fix the total number of PSUs and use a systematic layout to generate 1,000 independent samples. For each sample, we compute the mean AGB and its 95% confidence interval using both unstratified and stratified cluster estimators. This allows us to compare both coverage properties and the precision of resulting estimates.
We expect roughly 95% of confidence intervals to capture the true population mean, and, if stratification is effective, we should see narrower intervals on average. The results bear this out: the empirical coverage rates were 98.1% and 95.1% for the unstratified and stratified estimators, respectively. The average confidence interval widths were 100 and 89 (Mg/ha), again for the unstratified and stratified cases. The percent gain in precision from stratification is 11%, reflecting the reduction in average interval width achieved by incorporating strata.
You can reproduce these results using the following replicate estimate files located in the datasets/HF/
directory:
clust_sys_agb_ests_1000reps.csv
(unstratified),strat_clust_sys_agb_ests_1000reps.csv
(stratified).
Interestingly, the gains in precision from stratification are similar to those observed in Section 13.3.6.2, despite the use of clustered PSUs with differing plot sizes and configurations. These gains highlight the informational value of effective stratification, independent of the specific sampling design.
13.4.7 Advantages and disadvantages of cluster sampling
Cluster sampling is widely used in forest inventory because it can greatly reduce the cost and complexity of field data collection. It’s especially attractive when measurement costs are low and travel time is high. For these reasons, cluster designs are a central feature of many national forest inventory (NFI) programs around the world.
13.4.7.1 Advantages
Operational efficiency: Cluster sampling reduces travel time by grouping nearby observations, making it easier and more cost-effective to reach remote or difficult-to-access areas.
Simplified logistics: Fieldwork can be streamlined because crews move within a single PSU rather than across dispersed locations.
Feasible under limited information: PSUs can be laid out even in the absence of detailed auxiliary data, unlike stratified designs that depend on known strata boundaries.
Compatible with systematic sampling: PSUs can be established on a regular grid using methods like SYS, allowing for broad spatial coverage without full randomization.
Aligned with operational structure: Some forest monitoring programs are naturally organized around PSUs, making cluster sampling a practical choice.
13.4.7.2 Disadvantages
Reduced statistical efficiency: Because SSUs within a PSU are often spatially autocorrelated, they provide less independent information than the same number of SSUs sampled across the entire population. This correlation inflates variance estimates compared to SRS.
Smaller effective sample size: Since the PSU is the unit of randomization, the effective sample size is closer to the number of PSUs than to the total number of SSUs, which can reduce precision.
Despite these limitations, cluster sampling remains a valuable option when logistical constraints dominate and precision requirements are moderate. It often represents the best balance between cost and statistical rigor in large-area forest inventory.
13.5 Multistage sampling
Multistage sampling extends the logic of cluster sampling by incorporating multiple layers of selection. Rather than selecting all plots or points in a single step from across the whole population, a multistage design chooses units in a sequence of stages. In forest inventory, this often mirrors how forests are already organized for management: large areas such as stands or management units serve as primary sampling units (PSUs), and within each selected PSU, a further sample of smaller units (secondary sampling units, or SSUs) is taken.
While multistage designs can have any number of stages, two-stage sampling is by far the most common. In this design, we:
- select a sample of PSUs (e.g., stands, management units, or another areal unit used to partition the population of interest);
- within each sampled PSU, select a further sample of SSUs (e.g., plots, points, or other smaller units of observation).
This two-stage structure offers flexibility. It allows tailoring of sampling intensity within PSUs, accommodates operational constraints, and reduces travel costs by concentrating sampling effort.
13.5.1 Connection to cluster sampling
Two-stage sampling is closely related to cluster sampling. The standard cluster design in Section 13.4 is a special case of two-stage sampling where:
- PSUs are selected using SRS or SYS,
- all SSUs within each selected PSU are measured.
In general two-stage sampling, the second condition is relaxed. Instead of measuring every SSU in each selected PSU, a probability sample of SSUs is taken at the second stage. This distinction is important: in cluster sampling, sampling error reflects only variation among PSUs; in two-stage sampling, both stages involve probabilistic selection and contribute to the total variance, and estimation must account for this hierarchical structure.
13.5.2 Why use multistage sampling?
Multistage sampling is often chosen because it aligns with how forests are managed and how fieldwork is organized. For example, it:
- takes advantage of existing natural, management, or administrative units (e.g., stands, ownerships) as convenient PSUs;
- reduces travel time and cost by allowing crews to work intensively within a few selected PSUs rather than dispersing effort across the entire area;
- stretches budgets by sampling more SSUs within fewer PSUs instead of covering every corner of the population;
- uses auxiliary data, such as stand area or previous inventory estimates, often available at the PSU level, to guide sampling;
- allows strategic selection of PSUs using sampling methods that better capture variability in the population and the contribution of each PSU to the parameter of interest.
This design adapts easily to real-world conditions. The number of SSUs per PSU can be varied, different sampling methods can be used at each stage, and unequal probabilities can be applied to match sampling effort with variation in the population.
13.5.3 Strategies for selecting PSUs
The first step in two-stage sampling is selecting PSUs. Common strategies include the following.
- Simple random sampling (SRS): if no prior information is available, draw a random sample of PSUs, giving each an equal selection probability.
- Stratification: if PSU-level variables are available (e.g., area, expected volume, productivity class), stratify PSUs and sample within each stratum to improve precision (see Section 13.3.1).
- Systematic sampling: if PSU-level variables are known, sort PSUs by one of these variables (e.g., area, expected volume) and apply systematic sampling with a random start to ensure broad representation.
- Probability proportional to size (PPS): if each PSU has a known measure of size or importance (e.g., area, expected volume), select PSUs with probabilities proportional to that measure.
These approaches can be combined. For example, PSUs might be stratified by size, then selected using equal probability, systematic, or PPS methods within each stratum. Stratum-specific two-stage estimates for means, totals, and standard errors can then be combined as described in Section 13.4.5.
Availability of auxiliary data usually guides the choice of PSU selection method and two-stage estimator. Without auxiliary data, equal probability sampling is the default. When auxiliary variables are available, they can be used to reduce within-stratum variance or concentrate sampling in strata that contribute most to the parameter of interest. When a size-related variable is available and correlated with the parameter, PPS methods can improve efficiency by allocating sampling effort in proportion to each PSU’s contribution.
The next section introduces common two-stage estimators. We frame them assuming each SSU yields a per-unit-area measurement and is drawn using point or plot sampling from within the areal extent of its PSU. We also include the more general forms of these estimators, allowing for PSUs of differing sizes and varying numbers of SSUs per PSU. If PSUs are the same size or the number of SSUs is constant across PSUs, simplified versions of these estimators can be found in Cochran (1977), Thompson (2012), S. L. Lohr (2019), or other sample survey texts.
Because two-stage designs involve both between-PSU and within-PSU variation, formulas can look long and a bit daunting. Treated as step-by-step recipes and paired with the tools in Section 10.2, they’re straightforward to implement. In the end, each estimator delivers a single mean and total, along with their standard errors, for use in building confidence intervals.
13.5.4 Estimation using two-stage sampling
We use the following notation:
- \(N\) number of PSUs in the population;
- \(n\) number of PSUs in the sample;
- \(A_j\) area of PSU \(j\);
- \(A = \sum_{j=1}^N A_j\) total population area;
- \(m_j\) number of SSUs in the \(j\)-th sampled PSU;
- \(\bar{m} = \frac{1}{n} \sum_{j=1}^n m_j\) mean number of SSUs per PSU;
- \(y_{ji}\) per-unit-area value for the \(i\)-th SSU in the \(j\)-th PSU;
- \(\bar{y}_j = \frac{1}{m_j} \sum_{i=1}^{m_j} y_{ji}\) per-unit-area sample mean within PSU \(j\);
- \(s_j^2 = \frac{1}{m_j - 1} \sum_{i=1}^{m_j} (y_{ji} - \bar{y}_j)^2\) sample variance within PSU \(j\).
13.5.4.1 Sampling PSUs with equal probability
The population total \(\tau\) is estimated using
\[\begin{equation}
\hat{\tau} = \frac{N}{n}\sum^n_{j=1}A_j\bar{y}_j.
\tag{13.22}
\end{equation}\]
Given \(\hat{\tau}\), the population mean \(\mu\) is estimated as
\[\begin{equation}
\bar{y} = \frac{\hat{\tau}}{A}.
\tag{13.23}
\end{equation}\]
The standard error for \(\hat{\tau}\) is estimated using
\[\begin{equation}
s_{\hat{\tau}} = \sqrt{
\underbrace{\frac{N^2}{n}\left(1 - \frac{n}{N}\right)s^2}_{\text{between-PSU variance}}
+
\underbrace{\frac{N}{n} \sum_{j = 1}^n \frac{A_j^2}{m_j} s^2_j}_{\text{within-PSU variance}}
},
\tag{13.24}
\end{equation}\]
where \(s^2\) is the among-PSU sample variance, estimated as
\[\begin{equation}
s^2 = \frac{1}{n - 1} \sum_{j = 1}^n \left(A_j \bar{y}_j - \frac{\widehat{\tau}}{N}\right)^2.
\end{equation}\]
Given \(s_{\hat{\tau}}\), the standard error for \(\bar{y}\) is
\[\begin{equation}
s_{\bar{y}} = \frac{1}{A} s_{\hat{\tau}}.
\tag{13.25}
\end{equation}\]
Equation (13.24) shows how total variance is partitioned into two sources.
- The first term (between-PSU variance) captures variation among PSU totals. It includes a finite population correction (FPC) because PSUs are sampled without replacement from a finite set (i.e., there are a countable number of PSUs in the population), see Section 11.3.
- The second term (within-PSU variance) captures variation among SSUs within each PSU. It does not include an FPC, because the areal frame within each PSU is treated as effectively infinite, see Section 12.2.1.
This estimator is unbiased for the population total and mean under equal-probability two-stage sampling at both stages.
If the parameter of interest is strongly related to PSU size (e.g., area), or if the total PSU area \(A\) is not known, the ratio estimator presented next often provides improved precision. Although this estimator is biased, the bias is often small and is a reasonable tradeoff for the gain in precision. We will quantify this bias and the gain in precision for an example analysis in Section 13.5.5.
The population mean \(\mu\) is estimated using\[\begin{equation} \bar{y} = \frac{\sum^n_{j=1}m_j\bar{y}_j}{\sum^n_{j=1}m_j} \tag{13.26} \end{equation}\] and its estimated standard error is
\[\begin{equation} s_{\bar{y}} = \bar{y}\sqrt{\left(\frac{1}{n-1}\right)\left(\frac{\sum^n_{j=1}m_j^2}{\left(\sum^n_{j=1}m_j\right)^2} + \frac{\sum^n_{j=1}\left(m_j\bar{y}_j\right)^2}{\left(\sum^n_{j=1}m_j\bar{y}_j\right)^2} - \frac{2\sum^n_{j=1}m_j^2\bar{y}_j}{\left(\sum^n_{j=1}m_j\right)\left(\sum^n_{j=1}m_j\bar{y}_j\right)}\right)\left(1-\frac{n}{N}\right)}. \tag{13.27} \end{equation}\]
Given the sample mean and its standard error, the total and its standard error are computed as
\[\begin{equation}
\hat{\tau} = A\bar{y}
\tag{13.28}
\end{equation}\]
and
\[\begin{equation}
s_{\hat{\tau}} = A s_{\bar{y}}.
\tag{13.29}
\end{equation}\]
For both the unbiased and ratio estimators, confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\bar{y}}, \\ \hat{\tau} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\hat{\tau}}, \end{align*}\] with degrees of freedom \(df = n - 1\).
13.5.4.2 Sampling PSU with probability proportional to size
When auxiliary information on PSU size is available prior to sampling—such as the area \(A_j\) of each PSU—efficiency can be improved by selecting PSUs with PPS. In this context, size is often measured as area, but any variable positively correlated with the parameter of interest can be used.
We assume the following design elements.
- Each PSU \(j\) is selected with probability \(\pi_j = A_j / A\), where \(A\) is the total area of all PSUs.
- Sampling is done with replacement, with \(n\) independent draws. If a PSU is selected more than once, a new SSU sample is drawn independently each time.
- Within each selected PSU, a SRS—or more commonly SYS—of \(m_j\) SSUs is taken.
The population total \(\tau\) is estimated using \[\begin{equation} \hat{\tau} = \frac{1}{n} \sum_{j = 1}^n \frac{A_j \bar{y}_j}{\pi_j}, \tag{13.30} \end{equation}\] where the sum is taken over the \(n\) PSUs selected with replacement. Some PSUs may appear more than once, and each appearance contributes an independent term to the sum.
The population mean is estimated as
\[\begin{equation}
\bar{y} = \frac{\hat{\tau}}{A}.
\tag{13.31}
\end{equation}\]
The standard error for \(\hat{\tau}\) is estimated using
\[\begin{equation}
s_{\hat{\tau}} = \sqrt{\frac{1}{n(n - 1)} \sum_{j = 1}^n \left( \frac{A_j \bar{y}_j}{\pi_j} - \hat{\tau} \right)^2 }.
\tag{13.32}
\end{equation}\]
Given \(s_{\hat{\tau}}\), the standard error for \(\bar{y}\) is
\[\begin{equation}
s_{\bar{y}} = \frac{1}{A} s_{\hat{\tau}}.
\tag{13.33}
\end{equation}\]
Confidence intervals for the mean and total follow the usual form \[\begin{align*} \bar{y} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\bar{y}}, \\ \hat{\tau} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\hat{\tau}}, \end{align*}\] with degrees of freedom \(df = n - 1\).
This estimator is unbiased for both the total and mean. PPS sampling is especially useful when there is high variation in PSU size and that size is related to the parameter of interest. By allocating sampling effort in proportion to PSU contribution to the population parameter, PPS sampling can yield lower standard errors compared with equal-probability sampling.
If a PSU is selected more than once, the crew simply treats each selection as a separate within-PSU sample. For example, they might lay out multiple systematic grids in the same PSU, each with its own random start (see 13.2.2), and cruise them during a single visit.
13.5.5 Illustrative analysis
We’ll demonstrate the two-stage estimators using a motivating example based on simulated data. While the dataset is synthetic, the scenario mirrors the kind of feasibility analysis that forestry consultants or analysts routinely perform.
Suppose we’re hired by a mass timber mill and fabrication firm exploring the viability of locating a facility in the region shown in Figure 13.10. The firm expects the state’s Department of Natural Resources (DNR) to be its primary supplier and wants to assess how much suitable biomass is available on DNR-managed lands around the proposed facility. This kind of pre-operational analysis is often referred to as a woodbasket study, where the woodbasket refers to the supply of accessible wood within a defined region.
FIGURE 13.10: Conifer and mixed hardwood/softwood stands managed by the Michigan Department of Natural Resources within a one-hour drive of the proposed mass timber mill and fabrication facility. These stands define the population of primary sampling units (PSUs). A simple random sample of 50 PSUs was selected for inventory.
To identify the region of interest, the firm determines that a one-hour travel time is the upper limit for economically feasible wood hauling. We use this threshold to compute an isochrone—a polygon showing the area reachable within a fixed travel time using the road network. In this case, the isochrone includes all locations accessible within one hour of the proposed facility.154 The fabrication process relies primarily on softwoods, though the firm is experimenting with some hardwood inputs.
Our parameters of interest are the mean per-acre and total dry weight of aboveground biomass (AGB) on all DNR conifer and mixed hardwood/softwood stands within the one-hour isochrone.
The population consists of 8927 stands, and our cruising budget supports a sample of 50. Stands serve as the PSUs, and the SSUs are inventory plots placed within each one. We’ll use 1/10-acre circular plots laid out on a regular 7-by-7 chain (462 ft) grid—a systematic design commonly used in operational inventory work throughout the region.
Figure 13.10 shows the full PSU population along with 50 PSUs selected using SRS. Figure 13.11 highlights a region containing several selected PSUs, and Figure 13.12 provides a closer view of these PSUs and their SSUs.
FIGURE 13.11: Zoomed-in view from Figure 13.10, highlighting three selected primary sampling units (PSUs).
FIGURE 13.12: The three primary sampling units (PSUs) shown in Figure 13.11. PSU IDs are 1895, 1791, and 1948 for (a), (b), and (c), respectively. Secondary sampling units (SSUs) were placed using systematic sampling. Each circular inventory plot is colored by the observed aboveground biomass (AGB).
In the sections that follow, we’ll apply each two-stage estimator to estimate both mean and total AGB. Because the data are simulated, we know the true population values: \(\mu\) = 30.69 tons/ac and \(\tau\) = 11,100,216 tons. This lets us evaluate and compare each estimator’s performance against known parameters.
As you work through these sections, you’ll notice that point estimates (means and totals) are similar across the estimators. As discussed when the estimators were introduced, the standard error under the equal probability unbiased estimator tends to be larger than that of the ratio estimator. The standard error from the PPS estimator is smaller than either of the equal probability estimators. However, these results are based on just one sample, so it’s not clear whether the observed differences in standard errors are due to estimator performance or sample-to-sample variability. To better understand how the estimators perform under repeated sampling, we turn to a simulation-based comparison in Section 13.5.5.3.
13.5.5.1 Estimation using an equal probability sampling
We’ll begin by estimating mean and total AGB using a sample of 50 PSUs selected using SRS. Under this design, each PSU in the population has an equal probability of selection. The following three data files are used in this analysis.
datasets/mass_timber/all_psu.csv
contains the full set of PSUs in the population. Each row is a PSU, and columns hold a unique PSU index and area in acres.datasets/mass_timber/sampled_equal_prob_psu.csv
contains the sampled PSUs. These were drawn from the full set of PSUs usingdplyr::slice_sample()
with \(n=50\) and all arguments set to their default values.datasets/mass_timber/sampled_ssu_equal_prob_psu.csv
contains the SSUs for each sampled PSU. Each row corresponds to a tree measured on the 1/10-acre circular plot, and columns are:psu_id
andssu_id
, which together uniquely identify each tree;tree_count
, which indicates how many trees are associated with each row (used to account for zero-tree plots and trees double-counted via the mirage method);agb_ton_tree
, the dry weight AGB in tons for each tree—our variable of interest, the \(y_{ji}\)’s.
The code below reads and takes a quick look at each data file.
#> Rows: 8,927
#> Columns: 2
#> $ psu_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,…
#> $ ac <dbl> 44.565, 21.690, 51.851, 131.325, 75.43…
sampled_psu <-
read_csv("datasets/mass_timber/sampled_equal_prob_psu.csv")
sampled_psu %>% glimpse()
#> Rows: 50
#> Columns: 2
#> $ psu_id <dbl> 1017, 8004, 4775, 8462, 4050, 8789, 13…
#> $ ac <dbl> 30.787, 18.785, 11.723, 16.064, 38.443…
ssu_trees <-
read_csv("datasets/mass_timber/sampled_ssu_equal_prob_psu.csv")
ssu_trees %>% glimpse()
#> Rows: 7,750
#> Columns: 4
#> $ psu_id <dbl> 1017, 1017, 1017, 1017, 1017, 10…
#> $ ssu_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ agb_ton_tree <dbl> 0.16786, 0.16999, 0.14626, 0.151…
We’ll now rename a few columns to align with our notation and create variables needed in subsequent steps.
all_psu <- all_psu %>% rename(A_j = ac)
sampled_psu <- sampled_psu %>% rename(A_j = ac)
N <- nrow(all_psu)
n <- nrow(sampled_psu)
A <- sum(all_psu$A_j)
# Add the tree factor for 1/10-acre plot.
r <- sqrt(0.1 * 43560 / pi)
ssu_trees <- ssu_trees %>% mutate(TF = 43560 / (pi * r^2))
13.5.5.1.1 Unbiased estimator
We’ll now apply the unbiased estimator defined in Section 13.5.4.1. The formulas for the total and mean are given in (13.22) and (13.23), with their associated standard errors in (13.24) and (13.25), respectively.
Below we break the process into two steps. First, we compute PSU-level summaries and join to sampled_psu
, which adds the A_j
column (area in acres for each PSU) needed for the second step. Second, we apply the unbiased estimators to compute the total and mean AGB, and compute their associated standard errors.
The psu_summary
output below is the summary over SSUs within each of the \(n\) = 50 PSUs.
psu_summary <- ssu_trees %>%
group_by(psu_id, ssu_id) %>%
summarize(y_ji = sum(TF * tree_count * agb_ton_tree)) %>%
summarize(
y_bar_j = mean(y_ji),
m_j = n(),
s_sq_j = 1 / (m_j - 1) * sum((y_ji - y_bar_j)^2)
) %>%
left_join(sampled_psu, by = join_by(psu_id))
psu_summary
#> # A tibble: 50 × 5
#> psu_id y_bar_j m_j s_sq_j A_j
#> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 465 54.2 29 340. 145.
#> 2 526 56.5 27 169. 126.
#> 3 878 10.9 11 224. 33.8
#> 4 1017 33.0 6 316. 30.8
#> 5 1069 56.4 8 212. 29.6
#> 6 1129 8.76 3 197. 26.2
#> 7 1222 26.3 3 638. 11.1
#> 8 1301 48.7 5 77.2 27.9
#> 9 1530 53.3 7 402. 41.7
#> 10 1791 49.4 34 365. 148.
#> # ℹ 40 more rows
The psu_summary
estimates are then used to form the population estimates below.
pop_summary <- psu_summary %>%
summarize(
tau_hat = N / n * sum(A_j * y_bar_j),
y_bar = tau_hat / A,
s_sq = 1 / (n - 1) * sum((A_j * y_bar_j - tau_hat / N)^2),
s_tau_hat = sqrt((1 - n / N) * N^2 / n * s_sq +
N / n * sum(A_j^2 / m_j * s_sq_j)),
s_y_bar = s_tau_hat / A
)
# Remove the intermediate PSU variance before printing.
pop_summary %>% select(-s_sq)
#> # A tibble: 1 × 4
#> tau_hat y_bar s_tau_hat s_y_bar
#> <dbl> <dbl> <dbl> <dbl>
#> 1 13258617. 36.7 2185319. 6.04
Our estimated mean and total, with 95% confidence intervals, were 36.66 (24.52, 48.80) tons/ac and 13,258,617 ( 8,867,053, 17,650,180) tons, respectively. Based on this sample, both confidence intervals captured the true values, which are \(\mu\) = 30.69 tons/ac and \(\tau\) = 11,100,216 tons.
13.5.5.1.2 Ratio estimator
Here we implement the ratio estimator for mean, total, and their associated standard errors, as defined in (13.26), (13.28), (13.27), and (13.29).
As before, we break the process into two steps. First, we summarize SSU-level information within each PSU. Second, we compute the required quantities across all sampled PSUs to produce the population estimates.
psu_summary <- ssu_trees %>%
group_by(psu_id, ssu_id) %>%
summarize(y_ji = sum(TF * tree_count * agb_ton_tree)) %>%
summarize(
y_bar_j = mean(y_ji),
m_j = n()
) %>%
left_join(sampled_psu, by = join_by(psu_id))
pop_summary <- psu_summary %>%
summarize(
y_bar = sum(A_j * y_bar_j) / sum(A_j),
tau_hat = A * y_bar,
s_y_bar = y_bar *
sqrt(n / (n - 1) *
(sum(m_j^2) / sum(m_j)^2 +
sum((m_j * y_bar_j)^2) / sum(m_j * y_bar_j)^2 -
2 * sum(m_j^2 * y_bar_j) / (sum(m_j) * sum(m_j * y_bar_j))
) * (1 - n / N)),
s_tau_hat = A * s_y_bar
)
pop_summary
#> # A tibble: 1 × 4
#> y_bar tau_hat s_y_bar s_tau_hat
#> <dbl> <dbl> <dbl> <dbl>
#> 1 35.6 12860648. 2.88 1040049.
Our estimated mean and total, with 95% confidence intervals, were 35.56 (29.78, 41.34) tons/ac and 12,860,648 (10,770,591, 14,950,706) tons, respectively. Based on this sample, both confidence intervals captured the true values, which are \(\mu\) = 30.69 tons/ac and \(\tau\) = 11,100,216 tons.
13.5.5.2 Estimation using a PPS sample
Here we’ll apply the PPS estimators defined in Section 13.5.4.2. We again use a sample of 50 PSUs; however, unlike under equal probability sampling, these PSUs are selected proportional to their area. In addition to all_psu
from the previous section, the following two data files are used in this analysis.
datasets/mass_timber/sampled_pps_psu.csv
contains the sampled PSUs. Each row is a PSU, and columns hold a unique PSU index and area in acres. These PSUs were selected usingdplyr::slice_sample()
with theweight_by
argument set to each PSU’s area, applying the desired selection probabilities \(\pi_j = A_j / A\). The argumentreplace = TRUE
allows for selection with replacement.datasets/mass_timber/sampled_ssu_pps_psu.csv
contains the SSUs for each sampled PSU. Each row corresponds to a tree measured on the 1/10-acre circular plot, and columns are:psu_id
andssu_id
, which together uniquely identify each tree;tree_count
, which indicates how many trees are associated with each record (used to account for zero-tree plots and trees double-counted via the mirage method);agb_ton_tree
, the dry weight AGB in tons for each tree—our variable of interest, the \(y_{ji}\)’s.
The code below reads and takes a quick look at each data file.
#> Rows: 50
#> Columns: 2
#> $ psu_id <chr> "316_1", "323_1", "358_1", "420_1", "5…
#> $ ac <dbl> 19.939, 67.284, 188.677, 128.584, 120.…
#> Rows: 16,586
#> Columns: 4
#> $ psu_id <chr> "316_1", "316_1", "316_1", "316_…
#> $ ssu_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ agb_ton_tree <dbl> 0.00294279, 0.00183660, 0.005673…
Rename a few columns to align with our notation and creating variables needed in subsequent steps. The set of all PSUs all_psu
, N
, and A
are the same as the previous section.
sampled_psu <- sampled_psu %>% rename(A_j = ac)
n <- nrow(sampled_psu)
# Add the tree factor for 1/10-acre plots
r <- sqrt(0.1 * 43560 / pi)
ssu_trees <- ssu_trees %>% mutate(TF = 43560 / (pi * r^2))
Given the PPS sample, we apply the mean and total estimators defined in (13.31) and (13.30), along with their associated standard errors in (13.33) and (13.32).
As with the equal probability estimators, we break the process into two steps. First, we summarize SSU-level information within each PSU, join to sampled_psu
to add the area column A_j
, and compute the selection probabilities pi_j
, both of which are needed for the second step. Second, we compute the required quantities across all sampled PSUs to produce the overall estimates and standard errors.
psu_summary <- ssu_trees %>%
group_by(psu_id, ssu_id) %>%
summarize(y_ji = sum(TF * tree_count * agb_ton_tree)) %>%
summarize(
y_bar_j = mean(y_ji),
m_j = n(),
s_sq_j = 1 / (m_j - 1) * sum((y_ji - y_bar_j)^2)
) %>%
left_join(sampled_psu, by = join_by(psu_id)) %>%
mutate(pi_j = A_j / A)
psu_summary %>% glimpse()
#> Rows: 50
#> Columns: 6
#> $ psu_id <chr> "1017_1", "1069_1", "1418_1", "1530_1…
#> $ y_bar_j <dbl> 27.0608, 47.5176, 51.1689, 48.8034, 3…
#> $ m_j <int> 5, 7, 39, 10, 23, 24, 43, 43, 9, 3, 5…
#> $ s_sq_j <dbl> 324.535, 139.027, 145.808, 215.842, 5…
#> $ A_j <dbl> 30.787, 29.630, 216.419, 41.660, 128.…
#> $ pi_j <dbl> 0.000085126, 0.000081929, 0.000598408…
pop_summary <- psu_summary %>%
summarize(
tau_hat = 1 / n * sum(A_j * y_bar_j / pi_j),
y_bar = tau_hat / A,
s_tau_hat = sqrt(1 / (n * (n - 1)) *
sum((A_j * y_bar_j / pi_j - tau_hat)^2)
),
s_y_bar = s_tau_hat / A
)
pop_summary
#> # A tibble: 1 × 4
#> tau_hat y_bar s_tau_hat s_y_bar
#> <dbl> <dbl> <dbl> <dbl>
#> 1 11979201. 33.1 672744. 1.86
Our estimated mean and total, with 95% confidence intervals, were 33.12 (29.38, 36.86) tons/ac and 11,979,201 (10,627,272, 13,331,131) tons, respectively. Based on this sample, both confidence intervals captured the true values, which are \(\mu\) = 30.69 tons/ac and \(\tau\) = 11,100,216 tons.
References
Any of these objects can be saved to file using
st_write()
from thesf
package to be read by GIS or GPS software.↩︎The specific projection is NAD83 / Pennsylvania South (ft), which corresponds to EPSG:2271.↩︎
In
plot_sys_grid()
, notice how thegeom_sf()
layers automatically render eachsf
object based on its geometry type—points, lines, or polygons—making it easy to build clean and consistent maps. You can modify the function to adjust colors, symbols, or other plot features to suit your needs.↩︎We use a
for
loop here to save space. Compare with the hardcoded approach in Section 11.3.6.1.↩︎The
bounds
object is projected in meters. You can check ansf
object’s distance units usingst_crs(bounds)$units
orunits::deparse_unit(st_distance(bounds))
.↩︎This works because
plot_id
is unique across strata. If that’s not the case, for example if plot IDs are reused within each stratum, we’d need to includestratum_id
in the grouping to avoid mixing trees from different strata that share the same plot ID.↩︎At the time of writing, much of the hemlock at Harvard Forest and across the broader region had already succumbed to HWA (D. A. Orwig et al. 2012).↩︎
This type of road-based travel-time polygon can be computed in R using the
osrm
package, which provides an interface to the Open Source Routing Machine. The package supports travel-time and routing calculations using OpenStreetMap’s road network data. Other tools and packages are also available for generating isochrones, depending on your needs and data sources.↩︎