Chapter 13 Sampling designs, implementation, and analysis
In Chapters 11 and 12, we focused on inference using sample data collected via a SRS design. In this chapter, we introduce several additional probability sampling designs commonly used in forestry. Recall that a sampling design is a set of rules for selecting units from a population and using the resulting data to generate statistically valid estimates for parameters of interest. We focus on four designs: 1) systematic sampling, 2) stratified sampling, 3) cluster sampling, and 4) multistage sampling. These designs differ in how they configure and select observations and in the estimators they prescribe. In most cases, a design is chosen to improve efficiency from a statistical, logistical, or combined standpoint.
As discussed in Section 12.1, forest inventory typically relies on an areal sampling frame, where sampling locations are identified by randomly placing points within the frame’s spatial extent. Under plot sampling, the location might mark the center of a circular plot, a corner of a rectangular plot, or an anchor point for more complex plot shapes. Under point sampling, a location might serve as the point from which a forester projects the discerning angle to identify measurement trees. Unless stated otherwise, the sampling designs described in this chapter can be used with either plot or point sampling—that is, the design governs the selection of sampling locations, while a separate set of rules determines how trees are selected around each location.
After introducing the designs, we shift attention to the use of auxiliary information, which we refer to as sampling with covariates. In forestry, we’re often interested in estimating parameters based on variables that are difficult or expensive to measure. Sampling with covariates improves precision by incorporating other variables that are easier to measure and correlated with the variable of interest.
The remainder of the chapter is organized as follows. Sections 13.2 through 13.5 describe systematic, stratified, cluster, and multistage sampling designs. Section 13.6 covers strategies for incorporating auxiliary information. As in other mensuration texts (Kershaw et al. 2016; Burkhart, Avery, and Bullock 2018), we present estimators associated with each design. Our added value is in showing how to implement these estimators using tidyverse tools and other R packages. Emphasis is placed on building computing workflows for efficient and repeatable analysis.
13.1 A gentle introduction to sf
Because sampling in forestry is inherently spatial—we’re selecting locations within an areal sampling frame—it’s helpful to start working with spatially aware data structures. In R, the primary tool for handling spatial vector data is the sf package.
If you’ve used GIS software before, many of the concepts will feel familiar. Spatial vector data come in three main types: points (e.g., sampling locations), lines (e.g., transects or travel paths), and polygons (e.g., stand boundaries or management units). The sf package represents each of these as a special kind of data frame with an attached geometry column.
You can work with sf objects much like regular data frames, using familiar tidyverse tools such as filter(), select(), and mutate(). We’ll introduce key sf data structures and functions as needed throughout the chapter. For more background, see the sf documentation at https://r-spatial.github.io/sf or the package vignette via vignette("sf1").
We’ll explore sf in more detail in Section 14.1.5.
13.2 Systematic sampling
Systematic sampling (SYS) is one of the most widely used probability sampling designs in forest inventory. Sampling locations are arranged in a fixed pattern across the population. To introduce randomness and ensure unbiased estimation of means and totals, the pattern—typically a rectangular grid—is positioned using a randomly chosen starting point. As illustrated in Figure 1.2, sampling locations are usually aligned along straight lines. The distance between locations on a line, known as the sampling interval, and the distance between lines can differ, creating a rectangular grid. This is commonly referred to as line-plot sampling, which is our focus here.
SYS designs are popular in forestry for several reasons. Compared to SRS, their advantages include:
- Sampling locations are evenly distributed across the population, improving the reliability of mean and total estimates. In contrast, SRS locations may, by chance, miss important spatial features.
- Sampling locations are easier and faster to find, reducing cost. The cruiser can follow straight lines between locations, which minimizes travel time and simplifies relocation during remeasurement.
The grid is typically positioned using GIS or spatial software (e.g., R), and the resulting points are uploaded to a GPS device for navigation. Alternatively, the cruiser may locate the first sampling point in the field and lay out the remainder using the sampling interval and grid orientation.
Key considerations when implementing a SYS design include:
- the pattern and orientation of the grid;
- the method used to randomly position the grid.
These points are covered in the next few sections.
13.2.1 Grid pattern, number of grid points, and orientation
In practice, either a rectangular or square grid is used. A rectangular grid defines the spacing between sampling locations along a line, denoted \(D_p\) (think “plot” or “point”), and the spacing between lines, \(D_l\). A square grid is a special case where \(D_{\text{sq}} = D_p = D_l\). Unless there’s a compelling reason to use unequal spacing, a square grid is a reasonable and efficient default.
Given a sample size \(n\) and total area \(A\), the spacing between locations in a square grid is given by
\[\begin{equation} D_{\text{sq}} = \sqrt{\frac{cA}{n}}, \end{equation}\]
where \(c\) is a conversion factor to express area in square distance units. For example, use \(c = 10,\!000\ \text{m}^2/\text{ha}\) if \(A\) is in hectares and \(D_{\text{sq}}\) is in meters, or \(c = 43,\!560\ \text{ft}^2/\text{ac}\) if \(A\) is in acres and \(D_{\text{sq}}\) is in feet.
As introduced in Section 11.3.6, sample size \(n\) should be based on the level of precision needed to meet the survey’s objectives. However, in practice, it’s common for experienced foresters to adopt a sampling intensity that works well for their forest type and inferential objectives. For example, the forester who conducted the Elk County cruise described in Section 1.2.2 typically uses a seven-by-seven chain SYS grid.
If you do use a sample size calculation, as recommended in Section 11.3.6, it’s likely that the number of grid points won’t exactly match the computed \(n\) due to how the grid falls within the boundary. In such cases, we agree with the following sage words.
My advice is simply to take the number of points that are indicated [by the grid], and live with the slight variability. Sample sizes that are “needed” … are just estimated anyway.
— Kim Iles
If you really must have a specific sample size, then the best approach is to specify a denser grid than needed and randomly or systematically thin points until the target sample size is reached (see, e.g., K. Iles (2003)).
The grid orientation can be adjusted to improve field logistics. For example, aligning lines with a forest boundary may simplify navigation. This was the approach used in the Elk County cruise (Figure 1.2).
Although rare, it’s possible for a population to exhibit a systematic pattern that aligns with the grid’s spacing and orientation, which could introduce bias. For example, a grid aligned with plantation rows might consistently place sampling locations between rows, leading to an unrepresentative sample. In such cases, rotating the grid to avoid alignment with periodic features can mitigate bias.
If a known gradient exists in the population—for example, productivity increasing with elevation—orienting the grid perpendicular to that gradient can improve the representativeness of the sample.
13.2.2 Positioning the grid
To avoid bias, the grid must be randomly placed. This is typically done by selecting a random point start within the areal sampling frame and extending the grid in each direction from that point. If rotation is applied, it should be performed about this random point start.
The random point start can be one of the sampling locations or used as an anchor point from which to extend the grid. In the development that follows, we treat the starting point as a location on the sampling grid.
In the next section, we introduce a function that generates systematic sampling grids in R. This function allows users to specify grid spacing, orientation, and how sampling locations are ordered. The code and examples also help highlight some practical considerations when laying out systematic designs.
13.2.3 Creating a systematic sampling grid with sys_grid()
This section is formatted like a mini R manual page—a style introduced earlier in the book—to briefly document the sys_grid() function. You’re encouraged to read through the function source code in "scripts/utils.R" (see Section 5.4) to deepen your understanding. The comments in the code walk through each step, and many elements should already be familiar.
Description
Creates a systematic grid of points within a polygonal boundary using a random start. The grid is defined by a fixed spacing between lines (D_l) and between points along each line (D_p). The grid can be rotated and the points ordered from a specified origin to facilitate field navigation.
Usage
Arguments
boundary: Ansfpolygon object defining the sampling boundary. Must be projected with linear units.D_l: Distance between sampling lines (x-direction) in the same units asboundary.D_p: Distance between points along a line (y-direction) in map units asboundary.angle: Optional rotation angle (in degrees) applied to the grid. Default is0.origin: Optional character string indicating the origin corner used to order points. One of"se","sw","ne", or"nw"(default is"se").
Details
The sys_grid() function creates a rectangular grid of points with spacing determined by D_l and D_p, starting from a randomly placed point within the boundary polygon. Here’s how it works:
- A random point start is placed within the
boundary. - A full grid is generated around that point, extended enough to accommodate possible rotation.
- Points are ordered in a snaking pattern from the specified
origin. - If specified, the grid is rotated by
angledegrees about the random point start. - The grid is clipped to the polygon, and the resulting points are labeled and returned.
Ordering by origin can help plan field navigation, especially when following lines. The choice of origin affects how line_id, point_id_in_line, and point_id are assigned, but not the design itself.
Value
Returns a list with the following components147:
grid: Ansfpoint object with columns:point_id: unique ID, ordered from the origin;line_id: identifier for each line;point_id_in_line: position along each line.
boundaryare appended.path: AnsfLINESTRINGconnecting the points in sampling order.random_point_start: The initial random point start (ansfpoint object).boundary: The input polygon(s) used to clip the grid.type: A string used internally to verify the object when plotting.
Use plot_sys_grid() to visualize the resulting grid and path.
13.2.4 Systematic sampling grid illustration with sys_grid()
Let’s apply sys_grid() to set up a SYS design for the Elk County cruise described in Section 1.2.2. In that cruise, the forester used a square grid with sampling locations spaced every seven chains (recall that one chain equals 66 feet).
The sys_grid() function expects the boundary polygon to be in a projection that uses the same distance units as D_l and D_p. The Elk County property is stored in the "datasets/PA/Elk_bounds" ESRI shapefile (Environmental Systems Research Institute 1998), which uses a geographic coordinate system (i.e., decimal degrees of latitude and longitude)—which isn’t an intuitive measurement system for field work. Instead, we want the boundary in a projected coordinate system that allows us to specify D_l and D_p in feet or meters.
After reading the shapefile using st_read(), the code below reprojects the boundary to a projection appropriate for northern Pennsylvania and consistent with the cruise protocol. This is done using st_transform().148
library(sf)
library(tidyverse)
source("scripts/utils.R")
# Read in the bounding polygon.
bounds <- st_read("datasets/PA", "Elk_bounds", quiet = TRUE)
bounds <- st_transform(bounds, crs = st_crs(2271))Once reprojected, bounds has units in feet, so we specify D_l and D_p in feet when calling sys_grid(). Because we don’t specify values for angle and origin, the defaults of 0 and "se", respectively, are used. To ensure the same random point start is generated each time the code is run, we set the random seed prior to calling the function.
We can visualize the resulting SYS design by passing the sys object to the plot_sys_grid() function, defined in the "scripts/utils.R" script. The resulting map is shown in Figure 13.1. We encourage you to explore the plot_sys_grid() function to see how it builds the graphic using ggplot(). Compared to the simpler examples in Chapter 9, this plot is a bit more involved, with a customized legend and multiple spatial layers.149
FIGURE 13.1: Systematic sampling design for an Elk County property cruise.
Let’s create a few variations on the design shown in Figure 13.1 by rotating the grid and choosing a different origin—while keeping the same random point start. This isolates the effect of orientation and ordering. To reduce code clutter, we don’t show the calls to plot_sys_grid() or the patchwork functions used to produce the composite graphic shown in Figure 13.2.
# (a) Rotate grid 45 degrees.
set.seed(1)
sys_a <- sys_grid(bounds,
D_l = 7 * 66, D_p = 7 * 66, angle = 45)
# (b) Northwest corner start.
set.seed(1)
sys_b <- sys_grid(bounds,
D_l = 7 * 66, D_p = 7 * 66, origin = "nw")
# (c) Rotation and northwest corner start.
set.seed(1)
sys_c <- sys_grid(bounds,
D_l = 7 * 66, D_p = 7 * 66,
angle = 45, origin = "nw")FIGURE 13.2: Systematic sampling design for an Elk County cruise under different grid rotations and starting origins.
Notice in Figures 13.1 and 13.2 the random point start remains in the same location—it’s the only random component in the systematic sampling design. Once the starting point is fixed, all other sampling locations are determined by the specified spacing and rotation. This highlights a key feature of systematic sampling: although the layout may appear highly structured, the randomness introduced by the initial point ensures the design retains its probabilistic foundation. As a result, estimators derived from such a design can still be used for valid statistical inference under appropriate assumptions.
With a sampling grid in place, we now turn to estimation using data collected under systematic sampling.
13.2.5 Estimation using SYS
The SYS estimators for the population mean \(\mu\) (11.1) and total \(\tau\) (11.6) are the same unbiased sample mean \(\bar{y}\) (11.15) and total \(\hat{t}\) (11.22) used under SRS (Section 11.3). Both are unbiased under SYS.
There is, however, no unbiased estimator for the sample variance (and thus no estimator for the standard error of the mean) under SYS. The most common solution is to apply the SRS estimators for variance (11.17) and standard error (11.21).
In some cases, however, the SRS variance and standard error of the mean estimators can overestimate uncertainty, yielding conservative confidence intervals. In others, they can underestimate it. Unfortunately, it’s not straightforward to know which case you’re in without additional information about the population.
It’s easy to build some intuition about why SYS lacks a valid variance estimator. Recall from Section 11.3 that the sample variance estimator used for SRS assumes the \(n\) observations are drawn independently and at random from the population. This independence is critical: it justifies the use of \(n - 1\) in the denominator of the variance estimator (11.17), which reflects the number of independent units of information available to estimate population variance.
Now contrast that with SYS. Here, a single random point start determines the entire sample. All subsequent locations are fixed by the grid layout. So, while we observe \(n\) units, only one is selected at random. The remaining \(n - 1\) are not independent—they’re determined once the start is fixed.
From this perspective, SYS behaves like a sample of size one. And plugging \(n = 1\) into the SRS variance estimator (11.17) results in division by zero which is undefined. This highlights the core issue: SYS does not produce a sample of independent observations, so the usual SRS variance estimator(s) don’t apply.
Again, despite these limitations, the SRS variance estimator is commonly used as a practical approximation for SYS. In many situations, it provides a reasonable estimate for the sample variance and, hence, the standard error of the mean. Still, understanding how the population is structured—and how that structure interacts with the sampling grid—is key to interpreting results. See Kershaw et al. (2016) and K. Iles (2003) for further discussion, including cautions and justification for why the SRS variance estimator is often considered acceptable for SYS data.
13.2.6 Advantages and disadvantages of systematic sampling
Systematic sampling offers several practical benefits and is widely used in forestry, but like all designs, it involves trade-offs. Here we summarize its main strengths and limitations, particularly in comparison to SRS.
13.2.6.1 Advantages
Efficient field layout: SYS designs are easy to implement in the field. Sampling locations fall on a predictable grid, allowing cruisers to follow straight lines, reducing travel time and simplifying navigation.
Improved spatial coverage: Grid-based sampling ensures sampling locations are evenly distributed across the population, helping to capture spatial heterogeneity and reduce the risk of missing important features—a common concern with SRS.
Simplifies remeasurement: Because sampling locations follow a fixed, repeatable pattern, they are easier to relocate during future inventory cycles.
Cost-effective: The combination of reduced travel time and simplified logistics typically results in lower per-unit sampling cost compared to SRS.
Good performance in many populations: SYS often yields more precise estimates than SRS for the same sample size, particularly when the population lacks strong periodicity that aligns with the grid. When a broad spatial trend is present, careful grid orientation can actually improve precision—for example, by laying the grid across the direction of the trend to ensure full coverage.
13.2.6.2 Disadvantages
No valid variance estimator: SYS lacks a design-based estimator for the sample variance and standard error. While the SRS estimator is commonly used as an approximation, it may over- or underestimate uncertainty, depending on the population structure.
Potential bias from periodicity: If the population exhibits spatial patterns that align with the grid spacing or orientation (e.g., plantation rows, windrow lines, soil type bands), SYS can produce biased estimates.
Requires spatial data infrastructure: Implementing SYS effectively typically relies on GIS tools and spatial data, which may be a barrier in some settings.
Despite these drawbacks, SYS remains a go-to design in operational forestry due to its simplicity, spatial balance, and low implementation cost. When used thoughtfully—with attention to population structure and proper randomization of the starting point—SYS can yield highly effective inventory data.
13.3 Stratified sampling
In forest inventory, it’s common to estimate population parameters not only for the entire area of interest but also for smaller subareas, or strata. Stratification refers to dividing the population into non-overlapping subareas based on characteristics expected to influence the parameter of interest. These strata are internally homogeneous but meaningfully different from one another. Common stratification variables include species, age, productivity, and management unit.
Stratified sampling involves drawing a probability sample independently within each stratum. While sampling within strata is most often done using SRS or SYS, other probability sampling methods can be used. The main benefits of stratification are potential gains in statistical efficiency, i.e., reduced standard errors, and the ability to produce both overall and stratum-specific estimates.
Stratification is often implemented in one of two ways:
Pre-stratification: Strata are delineated before sampling begins, letting us allocate stratum-specific sample sizes based on stratum size, expected variability, and sampling costs. This gives full control over the number of observations placed in each stratum. In this case, the area of each stratum must be known ahead of time.
Post-stratification: Here, sampling proceeds without regard to strata, and strata are assigned after the fact based on information collected during or after sampling. This still supports stratum-specific estimation, but the per-stratum sample sizes aren’t under our control. The area of each stratum must be determined after sampling to allow valid estimation.
Both approaches are valid. The main advantage of pre-stratification is the ability to apply optimum allocation—assigning more observations to strata that are large or exhibit greater variability, and fewer to those that are small, more homogeneous, or costly to sample. This improves statistical efficiency while controlling sampling costs.
There are substantial statistical and cost-saving benefits to stratification. Even a poor choice of strata yields unbiased estimates and, in almost all practical cases, performs no worse than unstratified sampling. At worst, you’ll end up with about the same standard error as you’d get without stratification; at best, you can dramatically improve statistical and cost efficiency.
Stratified sampling proceeds in two steps:
Assignment: Before any measurements are taken (in the case of pre-stratification), areas of the population—such as stands or management units—are assigned to strata using available information.
Measurement: A sample is then drawn within each stratum. At least two observations per stratum are needed to compute within-stratum variance and associated standard errors.
13.3.1 Choosing stratum membership
Stratification begins by subdividing the population into distinct strata based on available information. Assignment can be guided by prior inventory data (e.g., species, age, height, volume, or biomass), expert judgment, or ancillary sources like topography, canopy structure, or remote sensing-based spectral or structural measurements. Strata may also align with management units or areas where a separate estimate is desired.
The key requirement is that strata don’t overlap. For example, if strata are defined using polygons—where one or more polygons comprise a stratum—those polygons must be spatially distinct, i.e., their boundaries don’t overlap. Each location in the population must belong to one and only one stratum.
Good strata group together areas that are similar to each other and different from the rest. Ideally, variance within a stratum is low, and variance between strata is large. These characteristics improve statistical efficiency.
A stratum can consist of spatially contiguous or disjoint areas. For example, a stratum may comprise spatially adjacent stands or stands scattered across the forest. What’s important is the internal similarity within each stratum, not how the areas that comprise a stratum are arranged.
In the next section, we introduce the estimators used in stratified sampling.
13.3.2 Estimation using stratified sampling
Throughout this section, and in the subsequent illustration, we assume that sampling within each stratum is done using SRS or SYS, and that SRS estimators are used for the stratum means and variances. Other within-stratum designs are possible but may require different estimators.
Let \(M\) denote the number of strata in the population. In each stratum \(j\), we have a sample size \(n_j\). Across all strata, the total sample size is \(n = \sum_{j = 1}^M n_j\). The size of stratum \(j\) is \(N_j\), and the total population size is \(N = \sum_{j = 1}^M N_j\). In the cases considered here, \(N_j\) refers to the area of stratum \(j\), and \(N\) is the total area across all strata.
For each stratum \(j = 1, \dots, M\), the sample mean is \[\begin{equation} \bar{y}_j = \frac{\sum_{i = 1}^{n_j} y_{i, j}}{n_j}, \tag{13.1} \end{equation}\] which is used to estimate the stratum population mean \(\mu_j\).
The overall population mean \(\mu\) (i.e., over all strata) is estimated using \[\begin{equation} \bar{y} = \frac{\sum_{j = 1}^M N_j \bar{y}_j}{N}. \tag{13.2} \end{equation}\] Notice that \(\bar{y}\) is a weighted sum of the stratum-specific sample means, with each stratum weighted by its proportion of the total area, i.e., \(N_j/N\).
The estimate for the population total \(\tau\) over all strata is \[\begin{equation} \hat{t} = \sum_{j = 1}^M N_j \bar{y}_j = N \bar{y}. \tag{13.3} \end{equation}\]
The sample variance within each stratum \(s^2_j\) is estimated using the SRS variance estimator (11.17). If standard errors and confidence intervals are desired for each stratum, they can be computed using (11.20) or (11.21) and (11.31), respectively.
The standard error for the sample mean \(\bar{y}\) is \[\begin{equation} s_{\bar{y}} = \sqrt{\frac{1}{N^2} \sum^M_{j=1} \left[ \frac{N_j^2 s_j^2}{n_j} \, \left( \frac{N_j - n_j}{N_j} \right) \right]}. \tag{13.4} \end{equation}\] If we’re working from an areal sampling frame (i.e., an infinite population), the FPC is removed (see Section 12.2 for discussion), giving \[\begin{equation} s_{\bar{y}} = \sqrt{ \frac{1}{N^2} \sum_{j = 1}^M \frac{N_j^2 s_j^2}{n_j} }. \tag{13.5} \end{equation}\]
The standard error for the total estimate \(\hat{t}\) is \[\begin{equation} s_{\hat{t}} = N s_{\bar{y}}. \tag{13.6} \end{equation}\]
Confidence intervals for the mean follow the usual form \[\begin{equation} \bar{y} \pm t_{df, 1 - \frac{\alpha}{2}} s_{\bar{y}}, \tag{13.7} \end{equation}\] with degrees of freedom for the \(t\)-value computed as \[\begin{equation} df = \sum_{j = 1}^M (n_j - 1) = n - M. \tag{13.8} \end{equation}\]
Confidence intervals for the total \(\hat{t}\) follow the same form using \(s_{\hat{t}}\).
13.3.2.1 Sampling in stratum and independence
In the current setup, we assume \(n_j\) observations within each stratum are drawn independently and at random from the stratum population. In practice, however, SYS is more commonly used. For example, you might place a separate systematic grid within each stratum using a random point start, as described in Section 13.2.2. The grids may differ in spacing or rotation across strata.
The usual concern with SYS remains: with only one random start per stratum, the \(n_j\) observations are not independent. This has implications for inference. For example, the degrees of freedom in (13.8) assume independent observations within each stratum. If a single systematic grid is used to sample the stratum population, that assumption is violated. In such cases, (13.8) may be overstated, and resulting confidence intervals may be misleading.
It’s also common to use a single SYS grid that spans all strata. This is often convenient because the cruiser can follow continuous straight sampling lines that cross stratum boundaries. In this case, only one random point start is used to position the SYS grid over the entire population. Here, we’re even further from the assumption of \(n\) independent units of information.
Nonetheless, SYS is widely used in practice and is adopted in our illustration in Section 13.3.6. As discussed in Section 13.2, when using SYS within and across strata, care should be taken to rotate the grid to avoid alignment with periodic features. Also, we expect that in most cases, issues with observation independence and standard errors should be small.
13.3.3 Sample size determination
After identifying strata (in the case of pre-stratification) and selecting a sampling design, the next step is to determine the total sample size \(n\). This sample size can either be fixed in advance based on available resources or determined based on a desired level of precision for estimating a population mean or total.
Stratified sampling supports two common allocation strategies: proportional allocation and optimum allocation. Proportional allocation assigns sample size to each stratum based solely on its relative area. Optimum allocation also considers variability, and if cost is taken into account, balances that against the relative cost of sampling. The goal is to allocate effort where it provides the greatest gain in precision per unit cost.
When the total sample size \(n\) is unknown, it can be estimated using equations similar to those for SRS (see Section 11.3.6). These equations have the same circularity issue as before: \(n\) appears on both sides of the equation. As with SRS, this can be resolved using the iterative algorithm described in Section 11.3.6.
Importantly, these sample size estimates are intended to meet allowable error targets for the overall population, not for individual strata.
13.3.3.1 Total sample size with proportional allocation
As introduced in Section 11.3.6, we can estimate sample size given an allowable error \(E\) that is in the same units as \(\bar{y}\) and an estimate for each stratum’s sample variance \(s^2_j\). Recall, because we’ve not taken the sample, the value for \(s^2_j\) must come from somewhere else, e.g., from prior experience or inventory data from a similar population.
The total sample size required under proportional allocation for a finite population is given by \[\begin{equation} n = \frac{N \sum_{j=1}^M N_j s_j^2}{\frac{N^2 E^2}{t^2} + \sum_{j=1}^M N_j s_j^2}, \tag{13.9} \end{equation}\] where \(t\) is the \(t\)-value corresponding to the desired confidence level (\(1 - \alpha\)).
If the population is assumed infinite (i.e., under an areal sampling frame), the simplified formula is \[\begin{equation} n = \frac{t^2 \sum_{j=1}^M N_j s_j^2}{N E^2}. \tag{13.10} \end{equation}\]
13.3.3.2 Total sample size with optimum allocation
Optimum allocation minimizes the standard error of the estimate by accounting for both stratum size and variability. The total sample size for a finite population is \[\begin{equation} n = \frac{\left( \sum_{j=1}^M N_j s_j \right)^2}{\frac{N^2 E^2}{t^2} + \sum_{j=1}^M N_j s_j^2}. \tag{13.11} \end{equation}\]
For an infinite population, this becomes \[\begin{equation} n = \frac{t^2 \left( \sum_{j=1}^M N_j s_j \right)^2}{N^2 E^2}. \tag{13.12} \end{equation}\]
13.3.4 Allocation of samples to strata
Given the total sample size \(n\), the next step is to allocate observations among the strata using equations specific to proportional or optimum allocation. Whichever approach was used to determine \(n\) should also be used to allocate samples to the strata.
13.3.4.1 Proportional allocation
In proportional allocation, each stratum receives a share of the total sample based solely on its area. The sample size in stratum \(j\) is given by \[\begin{equation} n_j = n \cdot \frac{N_j}{N}. \tag{13.13} \end{equation}\] This approach is easy to apply but ignores differences in variability across strata.
13.3.4.2 Optimum allocation
In optimum allocation, sample sizes are chosen to minimize the overall standard error. The number of samples in stratum \(j\) is computed as \[\begin{equation} n_j = n \cdot \frac{N_j s_j}{\sum_{j=1}^M N_j s_j}. \tag{13.14} \end{equation}\] This method improves efficiency by assigning more samples to more variable strata.
13.3.4.3 Optimum allocation with cost constraints
If sampling costs differ between strata, we can modify the optimum allocation formula to incorporate cost per sample. Let \(c_j\) denote the cost of sampling a unit in stratum \(j\). The cost-adjusted optimum allocation is \[\begin{equation} n_j = n \cdot \frac{N_j s_j / \sqrt{c_j}}{\sum_{j=1}^M N_j s_j / \sqrt{c_j}}. \tag{13.15} \end{equation}\]
This formula assigns fewer observations to strata that are more expensive to sample, while still accounting for area and variability. It’s particularly useful in operational forest inventories where travel cost, terrain, or accessibility varies across the landscape.
13.3.5 Creating a stratified systematic sampling grid
Once strata have been defined and sample sizes determined using either proportional or optimum allocation, we can construct a systematic sampling grid within each stratum using functions tailored to each approach.
One option is to set up a stratum-specific sampling grid using the sys_grid() function described in Section 13.2.3. This approach offers a lot of flexibility in terms of grid pattern, size, and orientation. It may be especially useful when strata are large or not spatially adjacent.
However, working with different grid orientations and spacings across strata can complicate field logistics. If strata are relatively small and spatially adjacent, it might be simpler for cruisers to work from a single SYS grid that covers the entire population. The next two functions create a single grid across all strata under proportional or optimum allocation.
13.3.5.1 Grid generation for proportional allocation
Description
Generates a systematic sampling grid using fixed spacing between lines (D_l) and points (D_p) across the entire area. Points are assigned to strata based on spatial intersection with the input boundary. This is designed for proportional allocation, where grid density is uniform across strata. Built on sys_grid() with added logic to track stratum membership and within-stratum ordering.
Usage
Arguments
boundary: Ansfpolygon object defining the stratum boundaries. Must be projected with linear units.D_l: Distance between sampling lines (x-direction) in the same units asboundary.D_p: Distance between points along a line (y-direction) in the same units asboundary.angle: Optional rotation angle (in degrees). Default is0.origin: Corner from which to order lines and points. One of"se","sw","ne", or"nw". Default is"se".strat_id_col: Name of the column inboundarythat identifies strata.
Details
This function builds a systematic grid across the union of stratum polygons using sys_grid(), then assigns each point to a stratum via spatial intersection. Each sampling location is tagged with its unique point ID as well as its position within each line and stratum. This approach simplifies logistics when a single, uniform grid is acceptable for all strata.
Value
Returns a list with:
grid: Ansfobject of all points, with columns for uniquepoint_ID,stratum_ID, and other useful indexing IDs. Columns inboundaryare appended.path: AnsfLINESTRINGconnecting sampling points in order.random_point_start: The initial random point start.boundary: The input polygon with stratum attributes.type: A string used internally for validation when plotting.
Use plot_sys_grid() to visualize the resulting grid and path.
13.3.5.2 Grid generation for optimum allocation
Description
Creates a stratified systematic sampling grid that supports optimum allocation, assigning different numbers of grid points to each stratum. Builds a dense grid using sys_grid(), then intersects it with stratum polygons and thins the points in each stratum to approximate the target sample sizes. The final number of points per stratum may not match the targets exactly, but the function tries to get as close as possible. Supports both random and sequential thinning.
Usage
strat_opt_sys_grid(boundary, D_l, angle = 0, origin = "se",
n_col, strat_id_col, thinning_method = "random",
D_p = NULL)Arguments
boundary: Ansfpolygon object defining the stratum boundaries. Must be projected with linear units.D_l: Distance between sampling lines (x-direction) in the same units asboundary.angle: Optional rotation angle (in degrees). Default is0.origin: Corner from which to order lines and points. One of"se","sw","ne", or"nw". Default is"se".n_col: Name of the column inboundarycontaining target sample sizes per stratum.strat_id_col: Name of the column inboundarythat identifies strata.thinning_method: Method used to thin grid points within each stratum. Either"random"or"sequential". Default is"random".D_p: Optional distance between points along a line. If not supplied, it’s computed automatically from the stratum with the densest sampling, using the same units asboundary.
Details
This function is designed for optimum allocation, where sample size varies across strata. It uses sys_grid() to build an initial grid, assigns each point to a stratum, then selects a subset of points from each stratum using the specified thinning method. Additional columns in the output identify both the full grid and the subset of points selected for measurement.
Value
Returns a list with:
grid: Ansfobject of all points, with logicalmeasurecolumn and columns for uniquepoint_ID,stratum_ID, and other useful indexing IDs. Columns inboundaryare appended.measure_path: ALINESTRINGshowing the path through the selected (measure = TRUE) points.path: The full path through all grid points.random_point_start: The initial random point start.boundary: The original input polygon.type: A string used internally for validation when plotting.
Use plot_strat_opt_sys_grid() to visualize the sampled grid and path.
13.3.6 Illustrative analysis
We use the Harvard Forest dataset introduced in Section 1.2.5 to illustrate the application of stratified estimators. While this is a small and somewhat toy-like example, it’s useful because a complete census of trees is available—allowing us to compare our estimates directly to the true population values.
Figure 13.3 shows the extent of the 35 ha forest, partitioned into four strata. Strata 1 and 4 are mixed conifer-hardwood forest, Stratum 2 is conifer-dominated, and Stratum 3 is a sparsely forested wetland.
FIGURE 13.3: Harvard Forest population extent partitioned into four strata.
Our goal is to estimate average aboveground biomass (AGB) per hectare and total AGB for each stratum and for the entire forest. The population of interest includes all stems 12.7 cm DBH and larger. We’ll also break out estimates by species groups of interest.
The cruise will use fixed-area circular plots with a radius of 9 meters. These plots are placed at the selected sample locations and used to tally all trees meeting the DBH threshold. See Section 12.3 for additional details on plot-based sampling.
We’d like our AGB estimates to be within \(\pm\) 35 Mg/ha of the population value with 95% confidence. This allowable error corresponds to roughly 10% of the expected AGB.
The cruise will use a stratified systematic sampling design, with sample sizes determined by optimum allocation as described in the preceding section.
| Stratum ID | \(N_j\) (ha) | \(s_j\) (Mg/ha) |
|---|---|---|
| 1 | 1.24 | 100 |
| 2 | 8.93 | 100 |
| 3 | 2.25 | 75 |
| 4 | 22.58 | 125 |
We begin by computing the total sample size using the stratum-specific information in Table 13.1 and the optimum allocation formula in (13.12). As discussed in Section 11.3.6, this formula involves a circular dependency: we’re solving for \(n\) on the left, but also need it to determine the appropriate \(t\)-value on the right, which depends on the degrees of freedom.
To resolve this circular dependency, we use the iterative algorithm described in Section 11.3.6. The code below applies that method using a for loop, as introduced in Section 5.2.3, to iteratively solve for \(n\) until it converges.150
M <- 4 # Number of strata.
N_j <- c(1.24, 8.93, 2.25, 22.58) # Stratum area (ha).
N <- sum(N_j) # Total area (ha).
E <- 35 # Allowable error for AGB (Mg/ha).
alpha <- 0.05 # For 95% confidence level.
s_j <- c(100, 100, 75, 125) # Stratum standard deviation (Mg/ha).
n_start <- 25 # Initial guess for total sample size.
n_prev <- n_start
for (i in 1:3) {
t <- qt(p = 1 - alpha / 2, df = n_prev - M) # df = n - M
n_next <- ceiling((t / (N * E))^2 * sum(N_j * s_j)^2)
cat("Iteration", i, ": n =", n_next, "\n")
n_prev <- n_next
}#> Iteration 1 : n = 47
#> Iteration 2 : n = 44
#> Iteration 3 : n = 44
The algorithm converged quickly to \(n\) = 44. In the code below, we allocate the total sample size among strata using the optimum allocation formula given in (13.14).
n_j <- n * (N_j * s_j) / sum(N_j * s_j)
n_j <- ceiling(n_j) # Round up to the nearest whole number.
n_j#> [1] 2 10 2 31
Note that due to rounding, we end up with \(n\) = 45 plots in the final design.
Now, let’s build our sampling grid using strat_opt_sys_grid(). As described in Section 13.3.5, this function requires an sf polygon object that delineates the stratum boundaries. As briefly noted in Section 13.1, sf objects are data frames with a spatial geometry column and can be manipulated using tidyverse functions.
The code below reads in the Harvard Forest stratum boundaries as an sf polygon object, adds a new column for the target stratum-specific sample sizes, and prints the data frame portion of the object so we can inspect its columns. The st_drop_geometry() function removes the geometry column to simplify printed output.
bounds <- st_read("datasets/HF", "HF_stratified", quiet = TRUE)
bounds <- bounds %>% mutate(n_j = n_j, .after = stratum_id)
bounds %>% st_drop_geometry()#> stratum_id n_j ha
#> 1 1 2 1.2361
#> 2 2 10 8.9315
#> 3 3 2 2.2494
#> 4 4 31 22.5830
With a unique identifier column for each stratum (stratum_id) and the target sample sizes (n_j), we’re ready to pass bounds to strat_opt_sys_grid(). This and other sampling functions are defined in "scripts/utils.R".
Before calling strat_opt_sys_grid() in the code below, we set a random seed to ensure reproducible results.
source("scripts/utils.R")
set.seed(1)
strat_sys <- strat_opt_sys_grid(bounds, D_l = 50, n_col = "n_j",
strat_id_col = "stratum_id",
thinning_method = "random")The call to strat_opt_sys_grid() above sets the distance between lines to 50 meters and allows the function to compute the sampling interval along lines (D_p) automatically.151
FIGURE 13.4: Systematic sampling with stratification for the Harvard Forest example. Illustrates random removal of sampling locations to approximate the target optimum sample size.
Figure 13.4, created using plot_strat_opt_sys_grid(strat_sys, path = "lines"), shows the full systematic grid, the selected sampling points, and those removed from the grid to meet the target sample sizes (labeled as “Removed sampling location” in the figure legend).
If you only want to visualize the final sampling locations, set the path argument in plot_strat_opt_sys_grid() to "points" as shown below in Figure 13.5.
FIGURE 13.5: Systematic sampling with stratification for the Harvard Forest example. Illustrates the path between sampling locations.
We can confirm the final sample size by stratum using the code below, which shows the target sample sizes were achieved. This won’t always be the case, depending on how the sampling grid intersects with each stratum.
strat_sys$grid %>%
filter(measure) %>%
group_by(stratum_id) %>%
summarize(realized_n_j = n()) %>%
st_drop_geometry()#> # A tibble: 4 × 2
#> stratum_id realized_n_j
#> * <int> <int>
#> 1 1 2
#> 2 2 10
#> 3 3 2
#> 4 4 31
Given the sampling locations, we’re ready to head to the field (virtually, in this case). For our virtual cruise, we’ll use a 9 meter radius plot centered on the sampling locations. These plot areas are shown as gray circles in Figure 13.6.
As always, if any portion of a plot extends beyond a boundary, then a boundary correction method must be applied—this includes stratum boundaries. Figure 13.6 highlights the three plots with boundary slopover and applies the mirage method described in Section 12.3.3.
FIGURE 13.6: Systematic sampling with stratification for the Harvard Forest example. Fixed-area plots are shown at each sampling location, including those that extend beyond stratum boundaries and require boundary slopover correction. The mirage method is used for these plots; see Section 12.3.3 for details.
The cruise tree data are stored in "datasets/HF/HF_strat_sys_n45.csv" and are read into plot_trees below. Each row represents one or more trees. The dataset includes a plot_id column that uniquely identifies plots across all strata, and a tree_id column that identifies individual trees within each plot. The stratum_id column indicates the stratum in which the plot falls. Tree counts are recorded in the tree_count column, which equals 2 when a tree is double-counted using mirage correction. Additional columns include agb_Mg (aboveground biomass in metric tons), dbh_cm (diameter at breast height in centimeters), and species (tree species).
#> Rows: 797
#> Columns: 7
#> $ plot_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2…
#> $ tree_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, …
#> $ stratum_id <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ agb_Mg <dbl> 0.086166, 1.853968, 2.227097, 1.27…
#> $ dbh_cm <dbl> 14.4, 44.6, 48.0, 38.4, 27.0, 41.8…
#> $ species <chr> "Acer rubrum", "Quercus rubra", "Q…
Given these tree data, and following the steps laid out in Section 12.6.1, our first task is to expand tree-level agb_Mg measurements to a per-hectare basis using the appropriate tree factor, then summarize these values at the plot level. These plot-level values correspond to the \(y_{i,j}\) terms in (13.1), i.e., the \(i\)-th plot measurement in the \(j\)-th stratum.
To compute these plot-level AGB values, we group by plot_id and use summarize() to calculate y_ij, the plot-level AGB in Mg/ha. We also retain each plot’s stratum_id using first(stratum_id).152
# Add tree factor for the given plot area.
r <- 9 # Plot radius in meters.
plot_trees <- plot_trees %>%
mutate(TF = 10000 / (pi * r^2))
# Compute plot-level AGB estimates.
plot_summary <- plot_trees %>%
group_by(plot_id) %>%
summarize(y_ij = sum(TF * tree_count * agb_Mg),
stratum_id = first(stratum_id))
plot_summary#> # A tibble: 45 × 3
#> plot_id y_ij stratum_id
#> <dbl> <dbl> <dbl>
#> 1 1 468. 4
#> 2 2 197. 4
#> 3 3 253. 4
#> 4 4 419. 4
#> 5 5 174. 4
#> 6 6 388. 4
#> 7 7 579. 4
#> 8 8 346. 4
#> 9 9 364. 4
#> 10 10 158. 4
#> # ℹ 35 more rows
Next, we compute the stratum-level mean \(\bar{y}_j\) and standard deviation \(s_j\) by grouping plot_summary by stratum_id. We also compute the number of plots within each stratum for convenient reference in the next step.
stratum_summary <- plot_summary %>%
group_by(stratum_id) %>%
summarize(
y_bar_j = mean(y_ij),
s_j = sd(y_ij),
n_j = n_distinct(plot_id)
)
stratum_summary#> # A tibble: 4 × 4
#> stratum_id y_bar_j s_j n_j
#> <dbl> <dbl> <dbl> <int>
#> 1 1 311. 38.3 2
#> 2 2 474. 140. 10
#> 3 3 63.0 51.0 2
#> 4 4 348. 108. 31
These stratum-level estimates are used when reporting means, totals, and associated standard errors and confidence intervals for individual strata. For now, we’re after estimates for the overall population across strata.
The estimators for the stratified population mean and its standard error are given in (13.2) and (13.5), respectively. Both require \(n_j\), \(N_j\), and the stratum-level summaries. To streamline the workflow, we join the stratum areas to stratum_summary before applying the estimators and computing 95% confidence intervals.
# Join stratum areas.
stratum_area <- tibble(
stratum_id = 1:4,
N_j = c(1.24, 8.93, 2.25, 22.58)
)
stratum_summary <- left_join(stratum_summary, stratum_area,
by = "stratum_id")
# Estimate population mean and 95% CI.
stratum_summary %>%
summarize(
N = sum(N_j),
y_bar = sum(N_j * y_bar_j) / N,
s_y_bar = sqrt(1 / N^2 * sum((N_j^2 * s_j^2) / n_j)),
t = qt(df = sum(n_j - 1), p = 1 - 0.05 / 2),
ci_lower = y_bar - t * s_y_bar,
ci_upper = y_bar + t * s_y_bar
) %>%
select(y_bar, ci_lower, ci_upper)#> # A tibble: 1 × 3
#> y_bar ci_lower ci_upper
#> <dbl> <dbl> <dbl>
#> 1 360. 326. 395.
To estimate the population total and its confidence interval in the code below, we adapt the code above using the total estimator (13.3) and its standard error (13.6).
stratum_summary %>%
summarize(
N = sum(N_j),
t_hat = N * sum(N_j * y_bar_j) / N,
s_t_hat = N * sqrt(1 / N^2 * sum((N_j^2 * s_j^2) / n_j)),
t = qt(df = sum(n_j - 1), p = 1 - 0.05 / 2),
ci_lower = t_hat - t * s_t_hat,
ci_upper = t_hat + t * s_t_hat
) %>%
select(t_hat, ci_lower, ci_upper)#> # A tibble: 1 × 3
#> t_hat ci_lower ci_upper
#> <dbl> <dbl> <dbl>
#> 1 12615. 11411. 13819.
13.3.6.1 Checking coverage and allowable error
How well did our estimates perform in terms of uncertainty quantification? The census-based AGB for the Harvard Forest is 341.17 Mg/ha, with a forest-wide total of 11,941 Mg. Our estimated mean and total, with 95% confidence intervals, were 360.44 (326.03, 394.84) Mg/ha and 12,615 (11,411, 13,819) Mg, respectively. Based on this sample, both confidence intervals successfully captured the true values.
If the theory holds, we’d expect about 95% of confidence intervals from repeated samples to cover the true mean. Because we have a census, we can perform repeated sampling to empirically check that the coverage rate matches this expectation. We can also evaluate whether our optimum allocation strategy meets the target allowable error of \(E = 35\) Mg/ha with 95% confidence.
Using the estimated \(n\) and \(n_j\), we drew 100 samples from the population. For each, we computed the mean AGB and its 95% confidence interval. These 100 replicate estimates are stored in "datasets/HF/strat_sys_agb_ests_100reps.csv". The code below reads in these data and computes two key empirical rates: the proportion of intervals that covered the true mean, and the proportion with absolute error less than the allowable error \(E = 35\) Mg/ha. Both rates are very close to 95%, which is strong evidence that the design-based theory is holding up in this example!
agb_ests_reps <- read_csv("datasets/HF/strat_sys_agb_ests_100reps.csv")
# Read in the census tree data and filter to trees > 12.7 cm DBH.
trees <- read_csv("datasets/HF/HF_trees.csv") %>%
filter(dbh_cm > 12.7)
# Compute the true mean AGB (Mg/ha) from the census.
y_true <- trees %>%
summarize(y_true = sum(agb_Mg) / N) %>%
pull()
# Percent of confidence intervals that cover the true mean.
mean(y_true > agb_ests_reps$ci_lower_95 &
y_true < agb_ests_reps$ci_upper_95) * 100#> [1] 96
# Percent of estimates within the allowable error E
mean(abs(agb_ests_reps$y_bar - y_true) < E) * 100#> [1] 96
13.3.6.2 Group-level estimates from stratified data
So far, our focus has been on estimating AGB within strata and for the overall population. But we’re often also interested in estimates for subgroups of trees within each stratum and across the entire forest. For example, we might want to know how much AGB is tied up in specific species, age classes, forest types, or functional groups.
Let’s walk through one such example.
Eastern hemlock (Tsuga canadensis) is under threat throughout the northeastern U.S. due to the spread of hemlock woolly adelgid (HWA), an invasive insect causing widespread mortality. Since HWA is present in the region of Massachusetts where our data were collected, it’s reasonable to ask how much AGB would be lost if all Eastern hemlock trees in the forest were killed.153
We start by creating a new column in plot_trees called species_groups, which assigns each tree to one of two groups of interest:
"Eastern hemlock"(species == "Tsuga canadensis")"Other"(species != "Tsuga canadensis")
Both species_groups and plot_id are converted to factors. This prepares the data for use with complete(), which relies on factor levels to insert rows for combinations of plot and group that may not appear in the data.
plot_trees <- plot_trees %>%
mutate(
species_groups = factor(if_else(species == "Tsuga canadensis",
"Eastern hemlock",
"Other")),
plot_id = factor(plot_id)
)As noted in Chapter 12, our convention is to use a tree_count of zero to retain plots with no trees. This keeps their plot indexes in the dataset, ensuring they’re included in summaries and downstream calculations. What’s important is that all plot indexes—even those with no trees—are retained when converting plot_id to a factor. We used a similar approach when building stand and stock tables in Section ??.
Next, we compute AGB at the plot level for each species group.
plot_summary <- plot_trees %>%
group_by(plot_id, species_groups) %>%
summarize(
y_ij = sum(TF * tree_count * agb_Mg),
stratum_id = first(stratum_id),
.groups = "drop"
)This yields one row per plot per group. If no trees from a group were observed in a plot, it won’t appear yet. We’ll fix that next.
The code below uses complete() to add any missing combinations of plot and species group, inserting explicit zeros. We then use fill() to recover stratum_id for the new rows. To avoid crossing strata, we group by plot_id before filling, ensuring correct propagation.
plot_summary_filled <- plot_summary %>%
complete(plot_id, species_groups, fill = list(y_ij = 0)) %>%
group_by(plot_id) %>%
fill(stratum_id, .direction = "downup") %>%
ungroup()This is conceptually the same as our earlier use of complete() in Section ?? to fill in missing combinations of species and DBH class. Here, we’re doing the same with species groups. After inserting rows, fill() restores plot-level variables. Grouping by plot_id ensures this is done within the correct plot.
Take a look at plot_summary_filled to verify that it looks right—the number of rows should equal the number of plots times the number of species groups.
We now summarize to the stratum level, computing the mean AGB per hectare (\(\bar{y}_j\)) with associated 95% confidence intervals.
stratum_summary <- plot_summary_filled %>%
group_by(stratum_id, species_groups) %>%
summarize(
y_bar_j = mean(y_ij),
n_j = n(),
s_y_bar_j = sd(y_ij)/sqrt(n_j),
t = qt(df = n_j - 1, p = 1 - 0.05 / 2),
ci_lower = y_bar_j - t * s_y_bar_j,
ci_upper = y_bar_j + t * s_y_bar_j,
.groups = "drop"
) %>% select(stratum_id, species_groups, y_bar_j, ci_lower, ci_upper)
stratum_summary#> # A tibble: 8 × 5
#> stratum_id species_groups y_bar_j ci_lower ci_upper
#> <dbl> <fct> <dbl> <dbl> <dbl>
#> 1 1 Eastern hemlock 89.8 -678. 857.
#> 2 1 Other 221. -891. 1333.
#> 3 2 Eastern hemlock 255. 101. 410.
#> 4 2 Other 219. 110. 327.
#> 5 3 Eastern hemlock 0 0 0
#> 6 3 Other 63.0 -396. 522.
#> 7 4 Eastern hemlock 48.3 24.3 72.2
#> 8 4 Other 300. 258. 341.
These estimates show that Stratum 2 carries about 250 Mg per hectare of Eastern hemlock, which is substantially more than any of the other strata. Not surprisingly, with a sample size of \(n_j = 2\), Strata 1 and 3 have very low precision (i.e., wide confidence intervals). No Eastern hemlock was observed on the plots in Stratum 3, so its estimates are zero. To better understand these differences across the forest, Figure 13.7 shows the estimated mean AGB for Eastern hemlock in each stratum (i.e., y_bar_j computed above).
FIGURE 13.7: Estimated mean aboveground biomass (AGB) of Eastern hemlock (Tsuga canadensis) per hectare in each stratum. Estimates are based on plot-level measurements and stratified sampling.
To estimate forest-wide AGB by species group, along with standard errors and 95% confidence intervals, we apply the total estimator from (13.3) and its associated standard error from (13.6). The beauty of this approach is that it’s the same code used to estimate total AGB for all species in the previous section—the only difference is the added group_by(species_groups). This again illustrates a key strength of the tidyverse: once the logic for a summary is in place, it’s easy to extend it to subgroups with minimal changes.
# Join the stratum areas.
stratum_summary <- stratum_summary %>%
left_join(stratum_area, by = "stratum_id")
stratum_summary %>%
group_by(species_groups) %>%
summarize(
N = sum(N_j),
t_hat = N * sum(N_j * y_bar_j) / N,
s_t_hat = N * sqrt(1 / N^2 * sum((N_j^2 * s_j^2) / n_j)),
t = qt(df = sum(n_j - 1), p = 1 - 0.05 / 2),
ci_lower = t_hat - t * s_t_hat,
ci_upper = t_hat + t * s_t_hat,
.groups = "drop"
) %>%
select(species_groups, t_hat, ci_lower, ci_upper)#> # A tibble: 2 × 4
#> species_groups t_hat ci_lower ci_upper
#> <fct> <dbl> <dbl> <dbl>
#> 1 Eastern hemlock 3482. 2273. 4692.
#> 2 Other 9133. 7923. 10342.
So, to answer our initial question, how much AGB would be lost if all Eastern hemlock trees in the forest were killed by HWA? We estimate a total loss of 3,482 Mg, with a 95% confidence interval of (2,273, 4,692) Mg, with the majority of this loss coming from Stratum 2.
13.3.7 Advantages and disadvantages of stratified sampling
Stratified sampling is widely used in forest inventory, particularly when auxiliary information is available to define meaningful strata. When variation within strata is smaller than variation between strata, stratification can substantially reduce standard errors compared to a simple random sample of the same size.
13.3.7.1 Advantages
Improved precision: When strata are well defined, stratified sampling increases efficiency by grouping similar areas together. This structure reduces variability within each group and leads to more precise overall estimates.
Supports subregion estimation: Stratification enables valid estimation not just for the entire population, but also for each individual stratum. This is useful when separate summaries are needed for forest types, disturbance regimes, management units, ownership classes, or other subdivisions.
Enables efficient allocation: With optimum allocation, more samples can be directed to strata that are larger or more variable, improving efficiency without increasing cost.
Compatible with other designs: Different sampling strategies—SRS, SYS, or even model-assisted approaches—can be applied within different strata depending on operational needs or design goals.
Robust to poor strata: Even if strata aren’t optimal, stratified sampling rarely performs worse than unstratified sampling. At worst, standard errors may be comparable to SRS of the same sample size.
Supports post-stratification: If stratum membership isn’t known at the time of sampling, post-stratification can still be used. This allows for flexibility when auxiliary data become available after the field campaign.
13.3.7.2 Disadvantages
Area requirements for pre-stratification: A key requirement of pre-stratification is that the area of each stratum must be known prior to sampling in order to support design-based estimation. This can be a limitation if strata are based on incomplete or evolving information.
Limited allocation flexibility in post-stratification: Post-stratified designs don’t allow proportional or optimum allocation, since sample sizes per stratum aren’t controlled in advance. They may also result in too few samples in some strata to estimate variance reliably.
Added bookkeeping: Stratified sampling requires tracking stratum membership, managing stratum-specific calculations, and combining results with appropriate weights.
References
Any of these objects can be saved to file using
st_write()from thesfpackage to be read by GIS or GPS software.↩︎The specific projection is NAD83 / Pennsylvania South (ft), which corresponds to EPSG:2271.↩︎
In
plot_sys_grid(), notice how thegeom_sf()layers automatically render eachsfobject based on its geometry type—points, lines, or polygons—making it easy to build clean and consistent maps. You can modify the function to adjust colors, symbols, or other plot features to suit your needs.↩︎We use a
forloop here to save space. Compare with the hardcoded approach in Section 11.3.6.1.↩︎The
boundsobject is projected in meters. You can check ansfobject’s distance units usingst_crs(bounds)$unitsorunits::deparse_unit(st_distance(bounds)).↩︎This works because
plot_idis unique across strata. If that’s not the case, for example if plot IDs are reused within each stratum, we’d need to includestratum_idin the grouping to avoid mixing trees from different strata that share the same plot ID.↩︎At the time of writing, much of the hemlock at Harvard Forest and across the broader region had already succumbed to HWA (D. A. Orwig et al. 2012).↩︎