This book is a work in progress and is Open Review. We want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right hand corner of the page

Chapter 15 Stratified sampling

In forest inventory, it’s common to estimate population parameters not only for the entire area of interest but also for smaller subareas, or strata. Stratification refers to dividing the population into non-overlapping subareas, either to improve statistical efficiency or to ensure estimates are available for specific subareas of interest. Ideally, strata are defined using characteristics expected to influence the parameter being estimated, making them internally homogeneous but meaningfully different from one another. Common stratification variables include species, age, productivity, and management unit.

Stratified sampling involves drawing a probability sample independently within each stratum. While sampling within strata is most often done using SRS or SYS, other probability sampling methods can be used. The main benefits of stratification are potential gains in statistical efficiency, i.e., reduced standard errors, and the ability to produce both overall and stratum-specific estimates.

Stratification is implemented in one of two ways.

Pre-stratification: Strata are delineated before sampling begins, letting us allocate stratum-specific sample sizes based on stratum size, expected variability, and sampling costs. This gives full control over the number of observations placed in each stratum. In this case, the area of each stratum must be known ahead of time.
Post-stratification: Here, sampling proceeds without regard to strata, and strata are assigned after the fact based on information collected during or after sampling. This still supports stratum-specific estimation, but the per-stratum sample sizes aren’t under our control. The area of each stratum may be determined after sampling.

The main advantage of pre-stratification is the ability to apply optimum allocation—assigning more observations to strata that are large or exhibit greater variability, and fewer to those that are small, more homogeneous, or costly to sample. This improves statistical efficiency while controlling sampling costs.

There are substantial statistical and cost-saving benefits to stratification. Even a poor choice of strata yields unbiased estimates and, in almost all practical cases, performs no worse than unstratified sampling. At worst, you’ll end up with about the same standard error as you’d get without stratification; at best, you can dramatically improve statistical and cost efficiency.

Our focus for the remainder of this section is on pre-stratification, where strata are defined in advance and sample sizes can be controlled by design.

Stratified sampling proceeds in two steps.

Assignment: Before any measurements are taken, areas of the population—such as stands or management units—are assigned to strata using available information.
Measurement: A sample is then drawn within each stratum. At least two observations per stratum are needed to compute within-stratum variance.

15.1 Choosing stratum membership

Stratification begins by subdividing the population into distinct strata based on available information. Assignment can be guided by prior inventory data (e.g., species, age, height, volume, or biomass), expert judgment, or ancillary information like topography, canopy structure, or remote sensing-based spectral or structural measurements. Strata may also align with management units or areas where a separate estimate is desired.

The key requirement is that strata don’t overlap. For example, if strata are defined using polygons—where one or more polygons comprise a stratum—those polygons must be spatially distinct, i.e., their boundaries don’t overlap. Each location in the population must belong to one and only one stratum.

Useful strata group together areas that are similar to each other and different from the rest. Ideally, variance within a stratum is low, and variance between strata is large. These characteristics improve statistical efficiency.

A stratum can consist of spatially contiguous or disjoint areas. For example, a stratum may comprise spatially adjacent stands or stands scattered across the forest. It’s the internal consistency of each stratum that counts—not whether the areas are next to each other on the map.

Stratum assignment doesn’t introduce bias in the estimate, even if the grouping doesn’t reflect real differences in the population. As long as every part of the population belongs to one and only one stratum, and sampling within each follows the intended design, the overall estimate remains unbiased. A poor choice of strata mainly affects estimate precision rather than accuracy—at worst giving a standard error about the same as you’d get without stratification, and at best providing a clear gain in efficiency.

In the next section, we introduce the estimators used in stratified sampling.

15.2 Estimation using stratified sampling

Throughout this section, and in the subsequent illustration, we assume that sampling within each stratum is done using SRS or SYS, and that SRS estimators are used for the stratum means and variances. A key feature of stratified sampling is that these samples are drawn independently within each stratum. A separate probability sample is selected in each stratum, and the resulting stratum-specific estimators are treated as independent by design. Other within-stratum designs are possible but may require different estimators, see, e.g., Section 17.6.

We introduce the following notation:

$H$ number of strata in the population;
$N_h$ population size in the $h$-th stratum;
$N = \sum_{h = 1}^H N_h$ total population size across all strata;
$n_h$ sample size in the $h$-th stratum;
$n = \sum_{h = 1}^H n_h$ total sample size across all strata.

This notation and the estimators that follow assume a finite population. Section 15.5 describes the adjustments needed when working from an areal sampling frame. We present estimators both with and without the FPC, which lets us introduce the finite-population results in a logical order and later makes it easier to see how these same estimators change under an areal (effectively infinite) sampling frame.

The stratum population mean $\mu_h$ is estimated using the sample mean \[\begin{equation} \bar{y}_h = \frac{\sum_{i = 1}^{n_h} y_{hi}}{n_h}, \tag{15.1} \end{equation}\] where $y_{hi}$ is the $i$-th observation in the $h$-th stratum.

The overall population mean $\mu$ (i.e., across all strata) is estimated using \[\begin{equation} \bar{y} = \frac{\sum_{h = 1}^H N_h \bar{y}_h}{N}. \tag{15.2} \end{equation}\] This is a weighted sum of the stratum-specific sample means, with weights given by each stratum’s proportion of the total population size, i.e., $N_h / N$.

The population total $\tau$ is estimated using \[\begin{equation} \hat{t} = \sum_{h = 1}^H N_h \bar{y}_h = N \bar{y}. \tag{15.3} \end{equation}\]

The sample variance within each stratum, $s^2_h$, is estimated using the SRS variance estimator (11.17). If standard errors and confidence intervals are desired for individual strata, use (11.20) or (11.21) and (11.31), respectively.

The standard error of the overall mean $\bar{y}$ is \[\begin{equation} s_{\bar{y}} = \sqrt{ \frac{1}{N^2} \sum_{h = 1}^H \left[ \frac{N_h^2 s_h^2}{n_h}\left(1-\frac{n_h}{N_h}\right)\right] }, \tag{15.4} \end{equation}\] and, if the FPC is omitted, \[\begin{equation} s_{\bar{y}} = \sqrt{ \frac{1}{N^2} \sum_{h = 1}^H \frac{N_h^2 s_h^2}{n_h} }. \tag{15.5} \end{equation}\]

The standard error for the total estimate $\hat{t}$ is \[\begin{equation} s_{\hat{t}} = N s_{\bar{y}}. \tag{15.6} \end{equation}\]

Confidence intervals for the mean follow the usual form \[\begin{equation} \bar{y} \pm \tval_{\text{df}, 1 - \frac{\alpha}{2}} s_{\bar{y}}, \tag{15.7} \end{equation}\] with degrees of freedom computed as \[\begin{equation} \text{df} = \sum_{h = 1}^H (n_h - 1) = n - H. \tag{15.8} \end{equation}\]

Confidence intervals for $\hat{t}$ follow the same form using $s_{\hat{t}}$.

Later sections introduce estimators that work directly with totals rather than means. If the total is the quantity of interest, it’s often convenient to express the stratified estimators in terms of the stratum totals. Nothing new is happening here—the total estimator simply uses the same stratum weights already introduced for the mean estimator, but applied to the stratum-level totals. In particular, \[\begin{equation} \hat{t} = \sum_{h = 1}^H \hat{t}_h, \tag{15.9} \end{equation}\] where $\hat{t}_h$ is the estimated total in stratum $h$. Because the strata are treated independently in both design and estimation, we can compute the standard error of the total using a direct sum of their estimated sampling variances \[\begin{equation} s_{\hat{t}} = \sqrt{\sum_{h = 1}^H s^2_{\hat{t}_h}}, \tag{15.10} \end{equation}\] where $s_{\hat{t}_h}$ is the estimated standard error of the total in stratum $h$.

15.3 Sample size determination

After identifying strata and selecting a sampling design, the next step is to determine the total sample size $n$. This sample size can either be fixed in advance based on available resources or determined based on a desired level of precision for estimating a population mean or total.

Stratified sampling supports two common allocation strategies: proportional allocation and optimum allocation. Proportional allocation assigns sample size to each stratum based solely on its relative area. Optimum allocation also considers variability and, if cost is taken into account, balances that against the relative cost of sampling. The goal is to allocate effort where it provides the greatest reduction in the variance of the estimated mean (and hence its standard error) per unit cost.

When the total sample size $n$ is unknown, it can be estimated using equations similar to those for SRS (see Section 11.3.6). These equations have the same circularity issue as before: $n$ appears on both sides of the equation. As with SRS, this can be resolved using the iterative algorithm described in Section 11.3.6.

Importantly, these sample size estimates are intended to meet allowable error targets for the overall population, not for individual strata.

15.3.1 Total sample size with proportional allocation

As introduced in Section 11.3.6, we can estimate sample size given an allowable error $E$ that is in the same units as $\bar{y}$ and an estimate for each stratum’s sample variance $s^2_h$. Because we haven’t yet taken the sample, the values for $s^2_h$ must come from somewhere else, e.g., from prior experience or inventory data from a similar population.

The total sample size required under proportional allocation is given by \[\begin{equation} n = \frac{N \sum_{h=1}^H N_h s_h^2}{\frac{N^2 E^2}{\tval^2} + \sum_{h=1}^H N_h s_h^2}, \tag{15.11} \end{equation}\] where $\tval$ is the $\tval$-value corresponding to the desired confidence level ($1 - \alpha$).

The sample size (15.11) without the FPC is \[\begin{equation} n = \frac{\tval^2 \sum_{h=1}^H N_h s_h^2}{N E^2}. \tag{15.12} \end{equation}\]

15.3.2 Total sample size with optimum allocation

Optimum allocation minimizes the variance of the stratified mean by accounting for both stratum size and variability. The total sample size is \[\begin{equation} n = \frac{\left( \sum_{h=1}^H N_h s_h \right)^2}{\frac{N^2 E^2}{\tval^2} + \sum_{h=1}^H N_h s_h^2}. \tag{15.13} \end{equation}\]

The sample size (15.13) without the FPC is \[\begin{equation} n = \frac{\tval^2 \left( \sum_{h=1}^H N_h s_h \right)^2}{N^2 E^2}. \tag{15.14} \end{equation}\]

15.4 Allocation of samples to strata

Given the total sample size $n$, the next step is to allocate observations among the strata using equations specific to proportional or optimum allocation. Whichever approach was used to determine $n$ should also be used to allocate samples to the strata.

15.4.1 Proportional allocation

In proportional allocation, each stratum receives a share of the total sample based solely on its size. The sample size in stratum $h$ is given by \[\begin{equation} n_h = n \frac{N_h}{N}. \tag{15.15} \end{equation}\] This approach is easy to apply but ignores differences in variability across strata.

15.4.2 Optimum allocation

In optimum allocation, sample sizes are chosen to minimize the variance of the stratified mean. The number of samples in stratum $h$ is computed as \[\begin{equation} n_h = n \frac{N_h s_h}{\sum_{h=1}^H N_h s_h}. \tag{15.16} \end{equation}\] This method improves efficiency by assigning more samples to strata that are larger or more variable.

15.4.3 Optimum allocation with cost constraints

If sampling costs differ between strata, we can modify the optimum allocation formula to incorporate cost. Let $c_h$ denote the per-unit cost of sampling a single observation in stratum $h$. The cost-adjusted optimum allocation is \[\begin{equation} n_h = n \frac{N_h s_h / \sqrt{c_h}}{\sum_{h=1}^H N_h s_h / \sqrt{c_h}}. \tag{15.17} \end{equation}\]

This formula reduces sample sizes in strata that are more expensive to survey, while still accounting for size and variability. It’s particularly useful in operational forest inventories where travel cost, terrain, or accessibility varies across the landscape.

15.5 Stratified sampling under an areal sampling frame

In most forest inventory applications, stratified sampling is implemented under an areal sampling frame. The population has a geographic population extent of known total area $A$, partitioned into $H$ non-overlapping strata with known areas $A_h$, such that $A = \sum_{h=1}^H A_h$. Within each stratum, sampling locations are selected from the continuous set of point locations contained in that stratum, rather than from a finite list of population elements (see Section 12.2).

Under an areal sampling frame, the interpretation changes but the estimator structure doesn’t. Observations $y_{hi}$ are expressed on a per-unit-area basis (e.g., tons/ha), so $\bar{y}_h$ estimates the stratum mean per unit area. Stratified estimators therefore target a population mean per unit area, with the corresponding population total obtained by multiplying by the total area $A$.

Because the population within each stratum is treated as effectively infinite, the finite-population quantities $N_h$ and $N$ are replaced by $A_h$ and $A$. In particular, the stratified mean estimator in (15.2) applies directly after making this substitution.

Variance estimation likewise proceeds using the stratified estimators without the FPC. The standard error of the stratified mean follows (15.5), and the standard error of the total follows (15.6) after replacing $N$ with $A$.

The same logic applies when working in terms of stratum totals. The total estimator in (15.9) and its standard error in (15.10) carry over after interpreting $\hat{t}_h$ as the contribution from stratum $h$ based on its area $A_h$ and per-unit-area estimate $\bar{y}_h$. As in the finite-population case, stratum-level contributions combine because strata are treated independently by design.

Sample size determination under an areal sampling frame also follows directly from earlier results. In practice, this means using the stratified sample size formulas without the FPC—namely (15.12) or (15.14)—with stratum areas $A_h$ replacing $N_h$ and total area $A$ replacing $N$. As before, these expressions are intended to meet allowable error targets for the overall population mean, not for individual strata.

Overall, the differences relative to the finite-population setting are largely interpretive: population sizes are replaced by areas, the FPC is omitted, and all quantities are understood on a per-unit-area basis.

With this areal interpretation in mind, we now turn to the practical problem of laying out systematic sampling grids within strata.

15.5.1 Sampling in strata and systematic designs

In the development above, we assumed that the $n_h$ observations within each stratum are drawn according to an independent probability sampling design, such as SRS or SYS. In practice, stratified sampling is often implemented using SYS within strata rather than SRS (see Section 14).

A common approach is to place a separate systematic grid within each stratum, using a random point start for each grid, as described in Section 14.2. Grid spacing or orientation may differ across strata, reflecting differences in stratum size, shape, or field logistics. This approach preserves the space-filling advantages of SYS while retaining the basic stratified structure.

It’s also common to use a single systematic grid that spans all strata. This is often attractive operationally, since cruisers can follow continuous straight sampling lines that cross stratum boundaries without interruption. Under this design, only one random point start is used to position the grid over the entire population, and stratum membership is assigned based on each plot’s location.

When a single systematic grid is used across all strata, sampling is no longer independent across strata in the strict design-based sense, since all strata share a common random point start. In practice, however, this dependence is typically weak and is rarely, if ever, of practical concern. As with systematic sampling more generally, variance estimation proceeds using standard SRS-based approximations, and the stratified estimators presented above remain the standard working estimators.

In both cases, the inferential considerations associated with SYS apply directly. As discussed in Section 14, systematic samples are generated from a small number of random start points (one per stratum in the first case, one overall in the second), and variance estimation typically relies on SRS-based approximations. These issues aren’t unique to stratified designs, and we don’t revisit them here. Instead, we proceed under the same working assumptions used elsewhere with SYS, focusing on how stratification affects estimation and allocation rather than on the finer details of SYS variance behavior.

15.6 Creating a stratified systematic sampling grid

Once strata have been defined and sample sizes allocated (either proportionally or optimally), the next step is to lay out a systematic sampling grid within each stratum. This can be done using functions that extend the sys_grid() approach to stratified designs.

One option is to build stratum-specific grids with the sys_grid() function described in Section 14.3. This gives full flexibility in grid spacing, orientation, and pattern, and can be useful when strata are large or widely separated.

In many applications, however, a separate grid for each stratum complicates field logistics. When strata are small or adjacent, it’s often easier for field crews to work from a single systematic grid that covers the entire population. The two functions below support this approach by generating a unified grid across all strata under either proportional or optimum allocation. As in Section 14.3, the documentation is presented in the style of condensed R help pages.

15.6.1 Grid generation for proportional allocation `strat_prop_sys_grid()`

Description

Generates a systematic sampling grid using fixed spacing between lines (D_l) and points (D_p) across the entire area. Points are assigned to strata based on spatial intersection with the input boundary. This is designed for proportional allocation, where grid density is uniform across strata. Built on sys_grid() with added logic to track stratum membership and within-stratum ordering.

Usage

strat_prop_sys_grid(boundary, D_l, D_p, angle = 0, origin = "se", 
                    strat_id_col)

Arguments

boundary: An sf polygon object defining the stratum boundaries. Must be projected with linear units.
D_l: Distance between sampling lines (x-direction) in the same units as boundary.
D_p: Distance between points along a line (y-direction) in the same units as boundary.
angle: Optional rotation angle (in degrees). Default is 0.
origin: Corner from which to order lines and points. One of "se", "sw", "ne", or "nw". Default is "se".
strat_id_col: Name of the column in boundary that identifies strata.

Details

This function builds a systematic grid across the union of stratum polygons using sys_grid(), then assigns each point to a stratum via spatial intersection. Each sampling location is tagged with its unique point ID as well as its position within each line and stratum. This approach simplifies logistics when a single, uniform grid is acceptable for all strata.

Value

Returns a list with the following components.

grid: An sf object of all points, with columns for unique point_id, stratum_id, and other useful indexing IDs. Columns in boundary are appended.
path: An sf LINESTRING connecting grid points in order.
random_point_start: The initial random point start.
boundary: The input polygon with stratum attributes.
type: A string used internally for validation when plotting.

Use plot_sys_grid() to visualize the resulting grid and path.

15.6.2 Grid generation for optimum allocation `strat_opt_sys_grid()`

Description

Creates a stratified systematic sampling grid that supports optimum allocation, assigning different numbers of grid points to each stratum. Builds a dense grid using sys_grid(), then intersects it with stratum polygons and thins the points in each stratum to approximate the target sample sizes. The final number of points per stratum may not match the targets exactly, but the function tries to get as close as possible. Supports both random and sequential thinning.

Usage

strat_opt_sys_grid(boundary, D_l, angle = 0, origin = "se",
                   n_col, strat_id_col, thinning_method = "random", 
                   D_p = NULL)

Arguments

boundary: An sf polygon object defining the stratum boundaries. Must be projected with linear units.
D_l: Distance between sampling lines (x-direction) in the same units as boundary.
angle: Optional rotation angle (in degrees). Default is 0.
origin: Corner from which to order lines and points. One of "se", "sw", "ne", or "nw". Default is "se".
n_col: Name of the column in boundary containing target sample sizes per stratum.
strat_id_col: Name of the column in boundary that identifies strata.
thinning_method: Method used to thin grid points within each stratum. Either "random" or "sequential". Default is "random".
D_p: Optional distance between points along a line. If not supplied, it’s computed automatically from the stratum with the densest sampling, using the same units as boundary.

Details

This function is designed for optimum allocation, where sample size varies across strata. It uses sys_grid() to build an initial grid, assigns each point to a stratum, then selects a subset of points from each stratum using the specified thinning method. Additional columns in the output identify both the full grid and the subset of points selected for measurement.

Value

Returns a list with the following components.

grid: An sf object of all points, with logical measure column and columns for unique point_id, stratum_id, and other useful indexing IDs. Columns in boundary are appended.
measure_path: A LINESTRING showing the path through the selected (measure = TRUE) points.
path: The full path through all grid points.
random_point_start: The initial random point start.
boundary: The original input polygon.
type: A string used internally for validation when plotting.

Use plot_strat_opt_sys_grid() to visualize the sampled grid and path.

15.7 Illustrative analysis

We use the Harvard Forest dataset introduced in Section 1.2.5 to walk through an example of stratified estimation. The dataset is small and a bit toy-like, but a complete census of trees is available, which lets us compare our stratified estimates directly to the true population values.

As discussed in Section 15.5, this illustration adopts an areal sampling frame interpretation. Observations are expressed on a per-unit-area basis, stratum areas replace finite population sizes, and inference follows the stratified estimators without the FPC.

Figure 15.1 shows the extent of the 35 ha forest, partitioned into four strata. Strata 1 and 4 are mixed conifer-hardwood forest, Stratum 2 is conifer-dominated, and Stratum 3 is a sparsely forested wetland.

FIGURE 15.1: Harvard Forest population extent partitioned into four strata.

Our goal is to estimate average AGB per hectare and total AGB for each stratum and for the entire forest. The population of interest includes all stems 12.7 cm DBH and larger. We’ll also break out estimates by species groups of interest.

The cruise will use fixed-area circular plots with a radius of 9 meters. At each selected sampling location, all trees meeting the DBH threshold will be measured. See Section 12.3 for additional details on plot-based sampling.

We’d like our AGB estimates to be within $\pm$ 35 Mg/ha of the population value with 95% confidence. This allowable error corresponds to roughly 10% of the expected AGB.

The cruise will use a stratified systematic sampling design, with sample sizes determined by optimum allocation as described in the preceding sections.

TABLE 15.1: Information used for the Harvard Forest example. The stratum-specific aboveground biomass (AGB) standard deviation $s_h$ is an estimate used in sample size calculations.
Stratum ID	$A_h$ (ha)	$s_h$ (Mg/ha)
1	1.24	100
2	8.93	100
3	2.25	75
4	22.58	125

We begin by computing the total sample size using the stratum-specific information in Table 15.1 and the optimum allocation formula (15.14). As discussed in Section 11.3.6, this formula involves a circular dependency: we’re solving for $n$ on the left, but also need it to determine the appropriate $\tval$-value on the right, which depends on the degrees of freedom.

To resolve this circular dependency, we use the iterative algorithm described in Section 11.3.6. The code below applies that method using a for loop, as introduced in Section 5.2.3, to iteratively solve for $n$ until it converges.¹⁵⁵

H <- 4 # Number of strata.
A_h <- c(1.24, 8.93, 2.25, 22.58) # Stratum area (ha).
A <- sum(A_h) # Total area (ha).
E <- 35 # Allowable error for AGB (Mg/ha).
alpha <- 0.05 # For 95% confidence level.
s_h <- c(100, 100, 75, 125) # Stratum standard deviation (Mg/ha).
n_start <- 25 # Initial guess for total sample size.

n_prev <- n_start
for (i in 1:3) {
  t <- qt(p = 1 - alpha / 2, df = n_prev - H) # df = n - H
  n_next <- ceiling((t / (A * E))^2 * sum(A_h * s_h)^2)
  cat("Iteration", i, ": n =", n_next, "\n")
  n_prev <- n_next
}

#> Iteration 1 : n = 47 
#> Iteration 2 : n = 44 
#> Iteration 3 : n = 44

n <- n_prev # Final sample size.

The algorithm converged quickly to $n$ = 44. In the code below, we allocate the total sample size among strata using the optimum allocation formula (15.16).

n_h <- n * (A_h * s_h) / sum(A_h * s_h)
n_h <- ceiling(n_h) # Round up to the nearest whole number.
n_h

#> [1]  2 10  2 31

Note that due to rounding, we end up with $n$ = 45 plots in the final design.

Now, let’s build our sampling grid using strat_opt_sys_grid(). As described in Section 15.6, this function requires an sf polygon object that delineates the stratum boundaries. As briefly noted in Section 13.1, sf objects are data frames with a spatial geometry column and can be manipulated using tidyverse functions.

The code below reads in the Harvard Forest stratum boundaries as an sf polygon object, adds a new column for the target stratum-specific sample sizes, and prints the data frame portion of the object so we can inspect its columns. st_drop_geometry() removes the spatial geometry column to simplify printed output.

bounds <- st_read("datasets/HF", "HF_stratified", quiet = TRUE)

bounds <- bounds %>% mutate(n_h = n_h, .after = stratum_id)

bounds %>% st_drop_geometry()

#>   stratum_id n_h      ha
#> 1          1   2  1.2361
#> 2          2  10  8.9315
#> 3          3   2  2.2494
#> 4          4  31 22.5830

With a unique identifier column for each stratum (stratum_id) and the target sample sizes (n_h), we’re ready to pass bounds to strat_opt_sys_grid(). This and other sampling functions are defined in "scripts/utils.R".

Before calling strat_opt_sys_grid() in the code below, we set a random seed to ensure reproducible results.

source("scripts/utils.R")

set.seed(1)
strat_sys <- strat_opt_sys_grid(bounds, D_l = 50, n_col = "n_h", 
                                strat_id_col = "stratum_id",
                                thinning_method = "random")

The call to strat_opt_sys_grid() above sets the distance between lines to 50 meters and allows the function to compute the sampling interval along lines (D_p) automatically.¹⁵⁶

Systematic sampling with stratification for the Harvard Forest example. Illustrates random removal of sampling locations to approximate the target optimum sample size.

FIGURE 15.2: Systematic sampling with stratification for the Harvard Forest example. Illustrates random removal of sampling locations to approximate the target optimum sample size.

Figure 15.2, created using plot_strat_opt_sys_grid(strat_sys, path = "lines"), shows the full systematic grid, the selected sampling locations, and those removed from the grid to meet the target sample sizes (labeled as “Removed sampling location” in the figure legend).

If you only want to visualize the final sampling locations, set the path argument in plot_strat_opt_sys_grid() to "points" as shown below in Figure 15.3.

FIGURE 15.3: Systematic sampling with stratification for the Harvard Forest example. Illustrates the path between sampling locations.

We can confirm the final sample size by stratum using the code below, which shows the target sample sizes were achieved. This won’t always be the case, depending on how the sampling grid intersects with each stratum.

strat_sys$grid %>%
  filter(measure) %>%
  group_by(stratum_id) %>%
  summarize(realized_n_h = n()) %>%
  st_drop_geometry()

#> # A tibble: 4 × 2
#>   stratum_id realized_n_h
#> *      <int>        <int>
#> 1          1            2
#> 2          2           10
#> 3          3            2
#> 4          4           31

Given the sampling locations, we’re ready to head to the field (virtually, in this case). For our virtual cruise, we’ll use a 9 meter radius plot centered on the sampling locations. These plot areas are shown as gray circles in Figure 15.4.

Systematic sampling with stratification for the Harvard Forest example. Fixed-area plots are shown at each sampling location, including those that extend beyond stratum boundaries and require boundary slopover correction. The mirage method is used for these plots; see Section 12.3.3 for details.

FIGURE 15.4: Systematic sampling with stratification for the Harvard Forest example. Fixed-area plots are shown at each sampling location, including those that extend beyond stratum boundaries and require boundary slopover correction. The mirage method is used for these plots; see Section 12.3.3 for details.

As always, if any portion of a plot extends beyond a boundary, then a boundary correction method must be applied—this includes stratum boundaries. Figure 15.4 highlights the three plots with boundary slopover and applies the mirage method described in Section 12.3.3.

The cruise tree data are stored in "datasets/HF/HF_strat_sys_n45.csv" and are read into plot_trees below. Each row represents one or more trees. The dataset includes a plot_id column that uniquely identifies plots across all strata, and a tree_id column that identifies individual trees within each plot. The stratum_id column indicates the stratum in which the plot falls. Tree counts are recorded in the tree_count column, which equals 2 when a tree is double-counted using mirage correction. Additional columns include agb_Mg (aboveground biomass in metric tons), dbh_cm (diameter at breast height in centimeters), and species (tree species).

plot_trees <- read_csv("datasets/HF/HF_strat_sys_n45.csv")

plot_trees %>% glimpse()

#> Rows: 797
#> Columns: 7
#> $ plot_id    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2…
#> $ tree_id    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, …
#> $ stratum_id <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
#> $ tree_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ agb_Mg     <dbl> 0.086166, 1.853968, 2.227097, 1.27…
#> $ dbh_cm     <dbl> 14.4, 44.6, 48.0, 38.4, 27.0, 41.8…
#> $ species    <chr> "Acer rubrum", "Quercus rubra", "Q…

Given these tree data, and following the steps laid out in Section 12.6.1, our first task is to expand tree-level agb_Mg measurements to a per-hectare basis using the appropriate tree factor, then summarize these values at the plot level. These plot-level values correspond to the $y_{hi}$ terms in (15.1), i.e., the $i$-th plot measurement in the $h$-th stratum.

To compute these plot-level AGB values, we group by plot_id and use summarize() to calculate y_hi, the plot-level AGB in Mg/ha. We also retain each plot’s stratum_id using first(stratum_id).¹⁵⁷

# Add tree factor for the given plot area.
r <- 9 # Plot radius (m).

plot_trees <- plot_trees %>% 
  mutate(TF = 10000 / (pi * r^2))

# Compute plot-level AGB estimates.
plot_summary <- plot_trees %>%
  group_by(plot_id) %>%
  summarize(y_hi = sum(TF * tree_count * agb_Mg),
            stratum_id = first(stratum_id))

plot_summary

#> # A tibble: 45 × 3
#>    plot_id  y_hi stratum_id
#>      <dbl> <dbl>      <dbl>
#>  1       1  468.          4
#>  2       2  197.          4
#>  3       3  253.          4
#>  4       4  419.          4
#>  5       5  174.          4
#>  6       6  388.          4
#>  7       7  579.          4
#>  8       8  346.          4
#>  9       9  364.          4
#> 10      10  158.          4
#> # ℹ 35 more rows

Next, we compute the stratum-level mean $\bar{y}_h$ and standard deviation $s_h$ by grouping plot_summary by stratum_id. We also compute the number of plots within each stratum for convenient reference in the next step.

stratum_summary <- plot_summary %>%
  group_by(stratum_id) %>%
  summarize(
    n_h = n(),
    y_bar_h = mean(y_hi),
    s_h = sd(y_hi)
  )

stratum_summary

#> # A tibble: 4 × 4
#>   stratum_id   n_h y_bar_h   s_h
#>        <dbl> <int>   <dbl> <dbl>
#> 1          1     2   311.   38.3
#> 2          2    10   474.  140. 
#> 3          3     2    63.0  51.0
#> 4          4    31   348.  108.

These stratum-level estimates are useful when reporting means, totals, and associated uncertainty for individual strata. For now, however, we focus on estimating the overall population across strata.

The estimators for the stratified population mean and its standard error are given in (15.2) and (15.5), respectively. Both require $n_h$, $A_h$, and the stratum-level summaries. To streamline the workflow, we first join the stratum areas to stratum_summary and then apply the estimators to compute 95% confidence intervals.

# Join stratum areas.
stratum_area <- tibble(
  stratum_id = 1:4,
  A_h = c(1.24, 8.93, 2.25, 22.58)
)

stratum_summary <- stratum_summary %>%
  left_join(
    stratum_area,
    by = "stratum_id"
  )

# Estimate population mean and 95% CI.
pop_summary <- stratum_summary %>%
  summarize(
    A = sum(A_h),
    y_bar = sum(A_h * y_bar_h) / A,
    s_y_bar = sqrt(1 / A^2 * sum((A_h^2 * s_h^2) / n_h)),
    t = qt(df = sum(n_h - 1), p = 1 - 0.05 / 2),
    y_ci_lower = y_bar - t * s_y_bar,
    y_ci_upper = y_bar + t * s_y_bar
  )

pop_summary %>%
  select(y_bar, y_ci_lower, y_ci_upper)

#> # A tibble: 1 × 3
#>   y_bar y_ci_lower y_ci_upper
#>   <dbl>      <dbl>      <dbl>
#> 1  360.       326.       395.

To estimate the population total and its confidence interval, we can simply scale the per-hectare estimates in pop_summary by the population’s total area of 35 ha.

pop_summary %>%
  transmute(
    t_hat = A * y_bar,
    t_ci_lower = A * y_ci_lower,
    t_ci_upper = A * y_ci_upper
  )

#> # A tibble: 1 × 3
#>    t_hat t_ci_lower t_ci_upper
#>    <dbl>      <dbl>      <dbl>
#> 1 12615.     11411.     13819.

Alternatively, and mathematically equivalently, we can build the population total and its confidence interval directly from stratum-level total estimates. In the code below, the mutate() computes $\hat{t}_h$ and $s_{\hat{t}_h}$, and then, following (15.9) and (15.10), the summarize() computes the population total estimate $\hat{t}$, its standard error $s_{\hat{t}}$, and the corresponding confidence interval.

stratum_summary %>%
  mutate(
    t_hat_h = A_h * y_bar_h,
    s_t_hat_h = A_h * s_h / sqrt(n_h)
  ) %>%
  summarize(
    H = n(),
    n = sum(n_h),
    t_hat = sum(t_hat_h),
    s_t_hat = sqrt(sum(s_t_hat_h^2)),
    t = qt(df = n - H, p = 1 - 0.05 / 2),
    t_ci_lower = t_hat - t * s_t_hat,
    t_ci_upper = t_hat + t * s_t_hat
  ) %>%
  select(t_hat, t_ci_lower, t_ci_upper)

#> # A tibble: 1 × 3
#>    t_hat t_ci_lower t_ci_upper
#>    <dbl>      <dbl>      <dbl>
#> 1 12615.     11411.     13819.

15.7.1 Checking coverage and allowable error

How well did our estimates perform in terms of uncertainty quantification? The census-based AGB for the Harvard Forest is 341.17 Mg/ha, with a forest-wide total of 11,941 Mg. Our estimated mean and total, with 95% confidence intervals, were 360.44 (326.03, 394.84) Mg/ha and 12,615 (11,411, 13,819) Mg, respectively. Based on this sample, both confidence intervals successfully captured the true values.

If the theory holds, we’d expect about 95% of confidence intervals from repeated samples to cover the true mean. Because we have a census, we can perform repeated sampling to empirically check that the coverage rate matches this expectation. We can also evaluate whether our optimum allocation strategy meets the target allowable error of $E = 35$ Mg/ha with 95% confidence.

Using the estimated $n$ and $n_h$, we drew 1,000 samples from the population using stratified SYS. For each, we computed the mean AGB and its 95% confidence interval. These 1,000 replicate estimates are stored in "datasets/HF/strat_sys_agb_ests_1000reps.csv". The code below reads in these estimates and computes two key empirical rates: the proportion of intervals that covered the true mean, and the proportion with absolute error less than the allowable error $E = 35$ Mg/ha. Both rates are very close to 95%, which is strong evidence that the design-based theory is holding up in this example!

agb_ests_reps <- read_csv("datasets/HF/strat_sys_agb_ests_1000reps.csv")

# Compute the true mean AGB (Mg/ha) from the census.
y_true <- read_csv("datasets/HF/HF_trees.csv") %>%
  filter(dbh_cm > 12.7) %>% 
  summarize(y_true = sum(agb_Mg) / A) %>%
  pull()

# Percent of confidence intervals that cover the true mean.
mean(y_true > agb_ests_reps$ci_lower_95 & 
     y_true < agb_ests_reps$ci_upper_95) * 100

#> [1] 98.1

# Percent of estimates within the allowable error E
mean(abs(agb_ests_reps$y_bar - y_true) < E) * 100

#> [1] 97.3

15.7.2 Efficiency gains from stratification

Stratified sampling is especially effective when variation within strata is smaller than variation between them, allowing for more precise estimates than an unstratified sample of the same size. In our Harvard Forest example, we defined four strata. With the exception of the small Stratum 3, the others were fairly similar, so we might not expect dramatic efficiency gains. As a reminder, efficiency here means achieving greater precision—that is, a smaller standard error—for the same number of plots compared to an unstratified design.

To compare estimator performance (i.e., unstratified vs. stratified), we repeated the procedure from Section 15.7.1, drawing 1,000 independent samples from the same population. For each sample, we computed the mean AGB and its 95% confidence interval using both unstratified and stratified estimators.

The unstratified replicate estimates are stored in "datasets/HF/sys_agb_ests_1000reps.csv". The stratified replicates are the same as those used in Section 15.7.1. For each estimator, we computed the average 95% confidence interval width across replicates. These were 84 and 73 (Mg/ha) for the unstratified and stratified cases, respectively.

To quantify the gain in precision from stratification, we applied the same logic used in (11.39), replacing variances with average confidence interval widths. This substitution is reasonable because CI width is directly proportional to the standard error and provides an intuitive, interpretable measure of estimator precision. The resulting percent gain in precision was 13%, reflecting the reduction in average interval width achieved by incorporating strata.

This confirms that even modest stratification can yield measurable gains in precision, making it a worthwhile method when stratifying variables are available.

15.7.3 Group-level estimates from stratified data

So far, we’ve focused on estimating AGB within strata and for the overall population. In many applications, however, we’re also interested in estimates for subgroups of trees within each stratum and across the entire forest. Common examples include estimates by species, age class, forest type, product type, or functional group.

Let’s walk through one such example.

Eastern hemlock (Tsuga canadensis) is under threat throughout the eastern US due to the spread of hemlock woolly adelgid (HWA), an invasive insect that has caused widespread mortality. Because HWA is present in the region of Massachusetts where our data were collected, it’s reasonable to ask how much AGB would be lost if all eastern hemlock trees in the forest were killed.¹⁵⁸

We begin by creating a new column in plot_trees called species_groups, which assigns each tree to one of two groups of interest:

"Eastern hemlock" (species == "Tsuga canadensis");
"Other" (species != "Tsuga canadensis").

Both species_groups and plot_id are converted to factors. This prepares the data for use with complete(), which relies on factor levels to insert rows for combinations of plot and group that may not appear in the observed data.

plot_trees <- plot_trees %>%
  mutate(
    species_groups = factor(if_else(species == "Tsuga canadensis",
                                    "Eastern hemlock",
                                    "Other")),
    plot_id = factor(plot_id)
  )

As noted in Chapter 12, our convention is to retain plots with no trees by recording a tree_count of zero. This ensures that all sampled plots—including those with zero trees of a given group—are preserved in summaries and downstream calculations. The key point is that all sampled plots contribute an observation for each species group, even if that observation is zero. We used a similar approach when building stand and stock tables in Section 12.6.3.

Next, we compute plot-level AGB for each species group.

plot_summary <- plot_trees %>%
  group_by(plot_id, species_groups) %>%
  summarize(
    y_hi = sum(TF * tree_count * agb_Mg),
    stratum_id = first(stratum_id),
    .groups = "drop"
  )

This yields one row per plot per group when trees from that group are present. If no trees from a group were observed on a given plot, that combination doesn’t appear yet.

The code below uses complete() to add the missing plot-group combinations, inserting explicit zeros for plots where a species group wasn’t observed. We then use tidyr::fill() to restore stratum_id for the newly created rows. fill() restores (i.e., fills) missing values in selected columns using the next or previous entry, which is specified by the .direction argument. Grouping by plot_id before calling fill() ensures values are only propagated within plots and not across strata.

plot_summary_filled <- plot_summary %>%
  complete(plot_id, species_groups, fill = list(y_hi = 0)) %>%
  group_by(plot_id) %>%
  fill(stratum_id, .direction = "downup") %>%
  ungroup()

At this point, the dataset contains one observation per sampled plot per species group. Plots with zero biomass for a given group are represented by valid zero-valued observations, which still count toward the stratum-specific sample size and subsequent estimates. This distinction matters: zeros are data, not missing values, and they reflect true absence of a species on a plot.

We now summarize to the stratum level, computing the mean AGB per hectare, $\bar{y}_h$, along with standard errors and 95% confidence intervals for each species group.

stratum_summary <- plot_summary_filled %>%
  group_by(stratum_id, species_groups) %>%
  summarize(
    n_h = n(),
    y_bar_h = mean(y_hi),
    s_h = sd(y_hi),
    s_y_bar_h = s_h / sqrt(n_h),
    t = qt(df = n_h - 1, p = 1 - 0.05 / 2),
    y_ci_lower_h = y_bar_h - t * s_y_bar_h,
    y_ci_upper_h = y_bar_h + t * s_y_bar_h,
    .groups = "drop"
  )

stratum_summary %>%
  select(stratum_id, species_groups, y_bar_h, y_ci_lower_h, y_ci_upper_h)

#> # A tibble: 8 × 5
#>   stratum_id species_groups  y_bar_h y_ci_lower_h y_ci_upper_h
#>        <dbl> <fct>             <dbl>        <dbl>        <dbl>
#> 1          1 Eastern hemlock    89.8       -678.         857. 
#> 2          1 Other             221.        -891.        1333. 
#> 3          2 Eastern hemlock   255.         101.         410. 
#> 4          2 Other             219.         110.         327. 
#> 5          3 Eastern hemlock     0            0            0  
#> 6          3 Other              63.0       -396.         522. 
#> 7          4 Eastern hemlock    48.3         24.3         72.2
#> 8          4 Other             300.         258.         341.

These results show that Stratum 2 carries about 250 Mg/ha of eastern hemlock, which is substantially more than any of the other strata. Not surprisingly, with small sample sizes of $n_h = 2$, Strata 1 and 3 have low precision, resulting in wide confidence intervals (and negative lower bounds; see Section 12.5 for interpretation and discussion). No eastern hemlock was observed on the plots in Stratum 3, so its estimated mean is zero. To help visualize these stratum-level differences, Figure 15.5 maps the estimated mean eastern hemlock AGB per hectare in each stratum.

Estimated mean aboveground biomass (AGB) of eastern hemlock (Tsuga canadensis) per hectare in each stratum. Estimates are based on plot-level measurements and stratified sampling.

FIGURE 15.5: Estimated mean aboveground biomass (AGB) of eastern hemlock (Tsuga canadensis) per hectare in each stratum. Estimates are based on plot-level measurements and stratified sampling.

To estimate forest-wide AGB by species group, we follow the same steps used for the combined-species estimates. As before, we can either scale the forest-wide mean AGB per hectare by the total forest area or work directly from stratum-level total estimates. The estimator logic is unchanged; the only substantive modification to the code is the addition of group_by(species_groups). This illustrates an important advantage of the workflow: once the estimators are written correctly, extending them to subgroups requires no new statistical machinery—only a change in how the data are grouped.

We compute species-group-specific forest-wide mean AGB per hectare and its standard error using the stratified mean estimators in (15.2) and (15.5), respectively. Again, the only change from before is the addition of group_by(species_groups).

# Join the stratum areas.
stratum_summary <-  stratum_summary %>% 
  left_join(stratum_area, by = "stratum_id")

# Estimate population mean and 95% CI.
pop_summary <- stratum_summary %>%
  group_by(species_groups) %>% 
  summarize(
    A = sum(A_h),
    y_bar = sum(A_h * y_bar_h) / A,
    s_y_bar = sqrt(1 / A^2 * sum((A_h^2 * s_h^2) / n_h)),
    t = qt(df = sum(n_h - 1), p = 1 - 0.05 / 2),
    y_ci_lower = y_bar - t * s_y_bar,
    y_ci_upper = y_bar + t * s_y_bar
  )

pop_summary %>%
  select(y_bar, y_ci_lower, y_ci_upper)

#> # A tibble: 2 × 3
#>   y_bar y_ci_lower y_ci_upper
#>   <dbl>      <dbl>      <dbl>
#> 1  99.5       61.0       138.
#> 2 261.       224.        298.

Scaling these mean estimates by the total forest area yields equivalent forest-wide total estimates.

pop_summary %>%
  transmute(
    t_hat = A * y_bar,
    t_ci_lower = A * y_ci_lower,
    t_ci_upper = A * y_ci_upper
  )

#> # A tibble: 2 × 3
#>   t_hat t_ci_lower t_ci_upper
#>   <dbl>      <dbl>      <dbl>
#> 1 3482.      2134.      4831.
#> 2 9133.      7838.     10428.

Alternatively, we can work the population totals up directly from stratum-level totals. In the code below, mutate() computes $\hat{t}_h$ and $s_{\hat{t}_h}$, and then, following (15.9) and (15.10), summarize() computes the population total estimate $\hat{t}$, its standard error $s_{\hat{t}}$, and the corresponding confidence interval. Again, the only difference from the combined-species estimates is the addition of group_by(species_groups).

stratum_summary %>%
  mutate(
    t_hat_h = A_h * y_bar_h,
    s_t_hat_h = A_h * s_h / sqrt(n_h)
  ) %>%
  group_by(species_groups) %>% 
  summarize(
    H = n(),
    n = sum(n_h),
    t_hat = sum(t_hat_h),
    s_t_hat = sqrt(sum(s_t_hat_h^2)),
    t = qt(df = n - H, p = 1 - 0.05 / 2),
    t_ci_lower = t_hat - t * s_t_hat,
    t_ci_upper = t_hat + t * s_t_hat
  ) %>%
  select(t_hat, t_ci_lower, t_ci_upper)

#> # A tibble: 2 × 3
#>   t_hat t_ci_lower t_ci_upper
#>   <dbl>      <dbl>      <dbl>
#> 1 3482.      2134.      4831.
#> 2 9133.      7838.     10428.

So, to answer our initial question, how much AGB would be lost if all eastern hemlock trees in the forest were killed by HWA? We estimate a total loss of 3,482 Mg, with a 95% confidence interval of (2,134, 4,831) Mg. The majority of this biomass loss comes from Stratum 2.

15.8 Advantages and disadvantages of stratified sampling

Stratified sampling is widely used in forest inventory, especially when auxiliary information is available to define meaningful strata. Below, we summarize some key advantages and disadvantages of this design.

15.8.1 Advantages

Improved precision: Stratification improves precision when units are relatively homogeneous within strata (low within-stratum variance) and heterogeneous across strata. By reducing within-stratum variability (and with sensible allocation), stratified sampling reduces the overall variance of the estimator relative to unstratified designs.
Supports subregion estimation: Stratification enables valid estimation not just for the entire population, but also for each individual stratum. This is useful when separate summaries are needed for forest types, disturbance regimes, management units, ownership classes, or other subgroups.
Enables efficient allocation: With optimum allocation, more samples can be directed to strata that are larger or more variable, improving efficiency without increasing total cost.
Compatible with other designs: Different sampling strategies—such as SRS, SYS, or others—can be used within strata, depending on operational needs or design goals. Design-based stratified estimators remain valid provided the within-stratum estimators are valid for the chosen design and sampling is independent across strata.
Robust to poor strata choices: Even with poor strata assignment, stratified sampling often performs comparably to unstratified sampling and is unlikely to perform much worse, particularly when allocation is roughly proportional or near-optimal.
Supports post-stratification: If stratum membership isn’t known at the time of sampling, post-stratification can still be used. When post-strata explain variation in the parameter of interest and stratum weights are known, post-stratification can improve precision and supports domain-level summaries.

15.8.2 Disadvantages

Area requirements for pre-stratification: A key requirement of pre-stratification is that the size (or area) of each stratum must be known to support design-based estimation. This can be a limitation if strata are based on incomplete or evolving information.
Limited allocation flexibility in post-stratification: Post-stratified designs don’t allow proportional or optimum allocation, since sample sizes per stratum aren’t controlled in advance. They may also result in too few (or zero) samples in some strata to estimate variance reliably.
Added bookkeeping: Stratified sampling requires tracking stratum membership, managing stratum-specific calculations, and combining results with appropriate weights.

15.9 Exercises

15.9.1 Mountain Creek tract analysis

These data come from managed lands in the southern Oregon Coast Range owned by the Cow Creek Band of Umpqua Tribe of Indians. The project area is the Mountain Creek tract, a one-square-mile section of the Public Land Survey System (see Figure 15.6). The aerial image is the Exhibit B map for Mountain Creek, the standard map used in project documents to show unit boundaries and other features.

The Mountain Creek tract is divided into discrete management units, which serve as the strata for sampling. Some units are intended for overstory harvest, while others are designated for regeneration or other management purposes. Estimates were needed for each unit as well as for the entire tract, including overstory summaries and regeneration counts to support future silvicultural planning.

There are 13 units, ranging in area from 3.16 acres to 91.41 acres. The total area across all units is 412.58 acres. The inventory followed a systematic (SYS) design, with $n$ = 73 sampling locations—about one point every 6 acres.

At each location, overstory trees (trees greater than 5 inches DBH) were tallied using a 20 BAF (English units). In addition, a 1/100-th acre fixed-area subplot was established to measure regeneration (trees less than or equal to 5 inches DBH).

While the intent was to implement a standard systematic design, a few minor deviations appear to have occurred during field layout. In operational inventory settings, it’s not uncommon to encounter such departures from the intended sampling plan. Although these modifications represent a shift from the original design and should be made with care, we’ll assume for this exercise that any changes were implemented in a way consistent with systematic sampling and remain valid for estimation. This lets us proceed with design-based inference under the assumption that the overall sampling plan remains approximately systematic. Situations like this reflect the realities of working with field data and provide a useful opportunity to think critically about how design assumptions align with the data we actually receive as analysts.

Figure 15.6 shows the unit boundaries and sampling locations for the Mountain Creek tract. A copy of the inventory map is also available as a PDF in "datasets/CCBUTI/CCBUTI_inventory_map.pdf".

FIGURE 15.6: Management unit boundaries and systematic sample layout for timber cruise and regeneration survey conducted on lands owned by the Cow Creek Band of Umpqua Tribe of Indians in the southern Oregon Coast Range. Overstory trees were tallied using a 20 BAF point sample, and regeneration was measured on a 1/100-th acre fixed-area subplot at each location.

The data are provided in two separate files located in the "datasets/CCBUTI" directory: "CCBUTI_trees.csv" contains tree-level overstory and regeneration measurements, and "CCBUTI_units.csv" contains unit information, including unit identifiers and areas. The code below reads each file and takes a quick look at their structure.

ccbuti_trees <- read_csv("datasets/CCBUTI/CCBUTI_trees.csv")
ccbuti_trees %>% glimpse()

#> Rows: 288
#> Columns: 8
#> $ unit_id    <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3…
#> $ point_id   <dbl> 6, 6, 6, 6, 6, 7, 7, 8, 8, 8, 20, …
#> $ type       <chr> "Overstory", "Regeneration", "Rege…
#> $ tree_count <dbl> 0, 6, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1…
#> $ spp_code   <chr> NA, "df", "df", "df", "df", NA, "d…
#> $ DBH_in     <dbl> NA, 2.0, 3.0, 4.0, 1.0, NA, 2.0, 6…
#> $ height_ft  <dbl> NA, 15, 17, 20, 15, NA, 15, 29, 16…
#> $ vol_cuft   <dbl> NA, NA, NA, NA, NA, NA, NA, 1.3461…

ccbuti_units <- read_csv("datasets/CCBUTI/CCBUTI_units.csv")
ccbuti_units %>% glimpse()

#> Rows: 13
#> Columns: 2
#> $ unit_id <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
#> $ acres   <dbl> 20.73, 56.33, 33.30, 91.41, 3.16, 35.…

Rows in ccbuti_trees correspond to measured trees (both overstory and regeneration), or placeholders for sampling locations without trees. Sampling locations (point_id) are labeled as shown in Figure 15.6; note that point_id is not nested within unit_id. Columns in ccbuti_trees are defined as follows.

unit_id unit identifier in which the sampling point falls.
point_id unique identifier for each sampling point (not nested within unit_id).
type tree type, either "Overstory" (BAF 20 at each point) or "Regeneration" (1/100-th acre fixed-area subplot co-located at each point).
tree_count number of trees a given row represents. A value of zero is a placeholder and means no trees of that type were observed. When count is zero, all subsequent columns have an NA value.
spp_code species code matching the Code column in Table 15.2.
DBH_in tree diameter at breast height (DBH) in inches. Overstory trees are those with DBH $>$ 5 inches, and regeneration trees are those with DBH $\leq$ 5 inches.
height_ft tree height in feet.
vol_cuft overstory tree merchantable volume (ft$^3$) to a 4-in top. Regeneration trees have NA. Volume estimates are based on species, DBH, and height using the US National Scale Volume and Biomass Estimators (NSVB) (Westfall et al. 2024).

TABLE 15.2: Species codes and associated species names recorded on the Mountain Creek tract in the southern Oregon Coast Range.
Species list
Code	Species name
wp	western white pine
wh	western hemlock
df	Douglas fir

Rows in ccbuti_units correspond to harvest units and columns are defined as follows.

unit_id unit identifier (common to both the tree and unit files).
acres unit area in acres.

In the following exercises, you’ll use these datasets to compute overstory tree basal area, density (trees per unit area), and merchantable volume estimates, as well as regeneration density summaries at both the unit- and tract-level.

For most exercises, we provide the desired output—your task is to write the tidyverse code to reproduce it.

Exercise 15.1 What is the total number of points across all units? Confirm your answer using Figure 15.6 or the inventory map "datasets/CCBUTI/CCBUTI_inventory_map.pdf".

As you check the map, notice that points are not labeled consecutively. This is not an issue, just something to be aware of—we only care that each point has a unique ID value.¹⁵⁹

#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    73

Exercise 15.2 What are the distinct unit_id values and how many units are there? Confirm your answer using Figure 15.6 or the inventory map "datasets/CCBUTI/CCBUTI_inventory_map.pdf".

Note that there is no unit ID 1 in these data, but the units are consecutively numbered starting from ID 2.

#> # A tibble: 13 × 1
#>    unit_id
#>      <dbl>
#>  1       2
#>  2       3
#>  3       4
#>  4       5
#>  5       6
#>  6       7
#>  7       8
#>  8       9
#>  9      10
#> 10      11
#> 11      12
#> 12      13
#> 13      14

Exercise 15.3 What is the total number of points in each unit? Arrange the results in ascending order by unit_id. Confirm your answer using Figure 15.6.

When we apply the stratified estimator using units as strata, remember that we need at least two observations in each stratum to compute its sample variance. Do all units meet this requirement?

Hint, start with a pipe from ccbuti_trees and use a sequence of functions like group_by(), summarize(), and arrange(). You will also need a function in summarize() that counts the number of distinct points within each unit.

#> # A tibble: 13 × 2
#>    unit_id n_points
#>      <dbl>    <int>
#>  1       2        3
#>  2       3       10
#>  3       4        7
#>  4       5       15
#>  5       6        3
#>  6       7        5
#>  7       8        3
#>  8       9       10
#>  9      10        3
#> 10      11        3
#> 11      12        3
#> 12      13        4
#> 13      14        4

Exercise 15.4 Of the total number of sampling locations, how many overstory points and regeneration plots have no trees (i.e., tree_count of zero)?

Hint, one approach is to sum the occurrences of tree_count == 0 by type, since a tree_count of zero indicates no trees were measured at that location.

#> # A tibble: 2 × 2
#>   type         zero_trees
#>   <chr>             <int>
#> 1 Overstory            30
#> 2 Regeneration         22

Exercise 15.5 Using ccbuti_trees, compute overstory species abundance as a percent of total overstory trees, arranged in descending order. What is the most abundant overstory species?

Species percent abundance is defined as \[\begin{equation} p_j = 100 \frac{m_j}{\sum_{j=1}^{J} m_j}, \tag{15.18} \end{equation}\] where $m_j$ is the overstory stem count for species $j$ (the sum of tree_count for overstory trees only) and $J$ is the number of overstory species.

Getting the right answer is a bit of a process. We suggest the following steps.

Filter for overstory trees only.
Filter out rows with missing species codes (these are placeholders for points with no trees).
Group the data by species code.
Compute m_j = sum(tree_count), via a summarize().
Compute the total number of trees M = sum(m_j) and p_j = 100 * m_j / M, via mutate().
Arrange species in descending order of p_j.
Print a small table with species code, m_j, and p_j. As an internal check, make sure the p_j values sum to 100.

We suggest starting with a pipe from ccbuti_trees and using filter(), group_by(), summarize(), mutate(), and arrange(). Alternatively, count(spp_code, wt = tree_count, name = "m_j") can replace steps 3-4.¹⁶⁰

You should get the following output.

#> # A tibble: 3 × 4
#>   spp_code   m_j     M    p_j
#>   <chr>    <dbl> <dbl>  <dbl>
#> 1 df         104   113 92.0  
#> 2 wh           8   113  7.08 
#> 3 wp           1   113  0.885

Note, in the output above M is an intermediate value and is the denominator of (15.18), i.e., the total number of trees across all species.

Exercise 15.6 Modify your code from Exercise 15.5 to compute regeneration species abundance as a percent of total regeneration trees, arranged in descending order. What is the most abundant regeneration species?

You should get the following output.

#> # A tibble: 2 × 4
#>   spp_code   m_j     M   p_j
#>   <chr>    <dbl> <dbl> <dbl>
#> 1 df         134   144 93.1 
#> 2 wh          10   144  6.94

Exercise 15.7 The next several exercises work toward stratified estimates for unit- and tract-level overstory basal area, density, and volume.

Begin by creating a subset of overstory trees from ccbuti_trees. Then add a tree factor column so that each tree has its own tree factor. Recall from Section 12.4 that, for point sampling, the tree factor is a function of the BAF and the tree’s DBH.

Following the methods in Section 12.4 and the specific computing steps in Section 12.6.2, we called our overstory subset o_ccbuti_trees and our tree factor column TF. When adding the TF values, use an ifelse() within your mutate() call to ensure a TF value of zero when tree_count is zero.

For visualization, we placed the new TF column immediately after tree_count. Part of the resulting tibble is shown below.

o_ccbuti_trees

#> # A tibble: 141 × 9
#>    unit_id point_id type      tree_count    TF spp_code
#>      <dbl>    <dbl> <chr>          <dbl> <dbl> <chr>   
#>  1       2        6 Overstory          0   0   <NA>    
#>  2       2        7 Overstory          0   0   <NA>    
#>  3       2        8 Overstory          1 102.  df      
#>  4       3       20 Overstory          0   0   <NA>    
#>  5       3       21 Overstory          0   0   <NA>    
#>  6       3       22 Overstory          0   0   <NA>    
#>  7       3       23 Overstory          1  89.5 df      
#>  8       3       24 Overstory          0   0   <NA>    
#>  9       3       25 Overstory          0   0   <NA>    
#> 10       3       26 Overstory          0   0   <NA>    
#> # ℹ 131 more rows
#> # ℹ 3 more variables: DBH_in <dbl>, height_ft <dbl>,
#> #   vol_cuft <dbl>

Exercise 15.8 Using o_ccbuti_trees from Exercise 15.7, compute point-level summaries for per-acre basal area, tree density, and volume. Call the resulting tibble point_summary.

In our output below, BA_hi, trees_hi, and vol_hi represent per-acre basal area, density, and volume, respectively. The _hi suffix indicates that these are values for the $i$-th point within the $h$-th unit (stratum). We grouped by both unit_id and point_id. While it would have been sufficient to group by point_id alone (because point IDs are unique across all units), we included unit_id so it’s available later when computing unit-level estimates in Exercise 15.9.

Hint, the general steps for generating these point-level summaries are illustrated in Section 12.6.2. Some care is needed when computing vol_hi because vol_cuft values associated with points that have no trees (i.e., tree_count == 0) are NA. We’d suggest using an ifelse() within mutate() in your piped sequence to change those vol_cuft NA values to 0; otherwise, the computation vol_hi = sum(TF * tree_count * vol_cuft) will include NA values and the result will be NA.

point_summary

#> # A tibble: 73 × 5
#> # Groups:   unit_id [13]
#>    unit_id point_id BA_hi trees_hi vol_hi
#>      <dbl>    <dbl> <dbl>    <dbl>  <dbl>
#>  1       2        6     0      0       0 
#>  2       2        7     0      0       0 
#>  3       2        8    20    102.    137.
#>  4       3       20     0      0       0 
#>  5       3       21     0      0       0 
#>  6       3       22     0      0       0 
#>  7       3       23    20     89.5   175.
#>  8       3       24     0      0       0 
#>  9       3       25     0      0       0 
#> 10       3       26     0      0       0 
#> # ℹ 63 more rows

Exercise 15.9 Using point_summary from Exercise 15.8, compute unit-level estimates for the per-acre sample mean ($\bar{y}_h$) and standard deviation ($s_h$) for basal area, density, and volume. Also compute the sample size within each unit (number of points $n_h$).

We’ll call these unit-level summaries unit_summary and add the _h suffix to indicate that these are the unit estimates. In our output, the unit-level sample means are BA_bar_h, tree_bar_h, and vol_bar_h, and the sample standard deviations are s_BA_h, s_tree_h, and s_vol_h.

Note, the estimation methods are laid out in Sections 15.2 and 15.5, and the illustration in Section 15.7 can be adapted to this setting; for example, the stratum_summary in Section 15.7 is analogous to the unit_summary you’re computing here.

unit_summary %>% glimpse()

#> Rows: 13
#> Columns: 8
#> $ unit_id     <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1…
#> $ n_h         <int> 3, 10, 7, 15, 3, 5, 3, 10, 3, 3, …
#> $ BA_bar_h    <dbl> 6.6667, 8.0000, 0.0000, 20.0000, …
#> $ s_BA_h      <dbl> 11.547, 13.984, 0.000, 21.381, 0.…
#> $ trees_bar_h <dbl> 33.953, 41.612, 0.000, 94.422, 0.…
#> $ s_trees_h   <dbl> 58.808, 71.630, 0.000, 100.517, 0…
#> $ vol_bar_h   <dbl> 45.705, 74.946, 0.000, 184.072, 0…
#> $ s_vol_h     <dbl> 79.163, 133.023, 0.000, 231.089, …

Exercise 15.10 Given the unit-level sample estimates in unit_summary from Exercise 15.9, you can compute unit-specific standard errors and confidence intervals for both means and totals. This involves applying the standard error estimators for the mean and total, then calculating their corresponding confidence intervals as detailed in Section 11.3. We’ll return to that in Exercise 15.12. For now, let’s pause and build some intuition about this tract based on the unit-level estimates in unit_summary.

In addition to basal area, density, and volume estimates, compute the quadratic mean diameter (QMD) for each unit. In forestry, when we talk about mean or average tree diameter, we’re almost always referring to the quadratic mean, not the arithmetic mean.¹⁶¹ Because it weights larger trees more heavily, QMD is preferred over the arithmetic mean diameter for describing stand structure and development. QMD represents the DBH of a tree with the stand’s average basal area.

Unlike basal area, density, or volume, which can be expressed on a per-unit-area basis, QMD is a tree-level mean summarizing the average size of individual trees in a stand.

Foresters use QMD, together with tree density, to understand stand structure—how large the trees are, how many occupy a given area, and where a stand might be in its development using stand management diagrams. These measures help evaluate growing space and competition, describe stand conditions, and guide management actions such as thinning.

Following from the quadratic mean in (11.13), we define the QMD as \[\begin{equation} \sqrt{\frac{\sum^m_{i = 1}DBH_i^2}{m}}, \tag{15.19} \end{equation}\] where $m$ is the number of trees.

If we have basal area and tree density expressed on a per-unit-area basis, then an equivalent expression to (15.19) is \[\begin{equation} \sqrt{\frac{\text{BA}}{(c\cdot m)}}, \tag{15.20} \end{equation}\] where BA is basal area, $m$ is tree density, and $c$ is a unit conversion constant (if BA is in square feet then $c=\pi/(144\cdot4)\approx 0.005454$, and if BA is in square meters then $c=\pi/(10000\cdot4)\approx 0.0000785$). See Curtis and Marshall (2000) and Ducey and Kershaw (2023) for details on QMD’s long history of use in forestry, its relationship with stand parameters like volume, and alternative formulations.

For this exercise, compute each unit’s QMD using the basal area and density estimates from Exercise 15.9. So following (15.20) we’re computing \[\begin{equation} \text{QMD}_h = \sqrt{\frac{\overline{\text{BA}}_{h}}{c\cdot\overline{\text{trees}}_{h}}}, \tag{15.21} \end{equation}\] where $\overline{\text{BA}}_{h}$ and $\overline{\text{trees}}_{h}$ are the per-acre mean basal area (ft$^2$/ac) and number of trees within the $h$-th unit, and $c$ is defined in (15.20). In unit_summary, $\overline{\text{BA}}_{h}$ and $\overline{\text{trees}}_{h}$ correspond to BA_bar_h and trees_bar_h, respectively.

Write a piped sequence to:

compute QMD_h using (15.21) (use mutate() and wrap the formula in ifelse() to avoid dividing by zero when trees_bar_h == 0);
select only unit_id, BA_bar_h, trees_bar_h, vol_bar_h, and the computed QMD_h;
print the results arranged in descending order of QMD_h.

Do not add QMD_h back into unit_summary—we only want to inspect it for this exercise. After running your code, compare your results to the output provided to check your work.

Based on density, volume, and QMD, what general patterns of stand development do you observe across the units?

#> # A tibble: 13 × 5
#>    unit_id BA_bar_h trees_bar_h vol_bar_h QMD_h
#>      <dbl>    <dbl>       <dbl>     <dbl> <dbl>
#>  1      11   127.         127.     3523.  13.5 
#>  2      10    46.7         58.3     932.  12.1 
#>  3      13    55           83.3    1092.  11.0 
#>  4      14    65          123.     1771.   9.84
#>  5       9    66          125.     1313.   9.83
#>  6      12    40          103.      643.   8.44
#>  7       7    16           64.9     186.   6.72
#>  8       5    20           94.4     184.   6.23
#>  9       2     6.67        34.0      45.7  6   
#> 10       3     8           41.6      74.9  5.94
#> 11       4     0            0         0    0   
#> 12       6     0            0         0    0   
#> 13       8     0            0         0    0

Exercise 15.11 Let’s push on with computing the tract-level estimates using unit_summary from Exercise 15.9. Join the unit area stored in ccbuti_units to unit_summary. Then, to stay consistent with our notation, rename the area column acres to A_h.

Here’s the result after the join and rename (we also relocated A_h to immediately follow unit_id, just for visualization).

unit_summary %>% glimpse()

#> Rows: 13
#> Columns: 9
#> $ unit_id     <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1…
#> $ A_h         <dbl> 20.73, 56.33, 33.30, 91.41, 3.16,…
#> $ n_h         <int> 3, 10, 7, 15, 3, 5, 3, 10, 3, 3, …
#> $ BA_bar_h    <dbl> 6.6667, 8.0000, 0.0000, 20.0000, …
#> $ s_BA_h      <dbl> 11.547, 13.984, 0.000, 21.381, 0.…
#> $ trees_bar_h <dbl> 33.953, 41.612, 0.000, 94.422, 0.…
#> $ s_trees_h   <dbl> 58.808, 71.630, 0.000, 100.517, 0…
#> $ vol_bar_h   <dbl> 45.705, 74.946, 0.000, 184.072, 0…
#> $ s_vol_h     <dbl> 79.163, 133.023, 0.000, 231.089, …

Exercise 15.12 Following Exercise 15.11, unit_summary has everything needed to apply the stratified estimators for means, standard errors, and confidence intervals for basal area, density, and volume.

To keep things simple and stay focused as we work toward the tract-level estimates, we’ll narrow our attention to volume.

From unit_summary, select the columns unit_id, A_h, n_h, vol_bar_h, and s_vol_h, and call the resulting subset unit_summary_vol.

unit_summary_vol

#> # A tibble: 13 × 5
#>    unit_id   A_h   n_h vol_bar_h s_vol_h
#>      <dbl> <dbl> <int>     <dbl>   <dbl>
#>  1       2 20.7      3      45.7    79.2
#>  2       3 56.3     10      74.9   133. 
#>  3       4 33.3      7       0       0  
#>  4       5 91.4     15     184.    231. 
#>  5       6  3.16     3       0       0  
#>  6       7 35.9      5     186.    219. 
#>  7       8 16.4      3       0       0  
#>  8       9 58.1     10    1313.    809. 
#>  9      10 21.0      3     932.    463. 
#> 10      11 13.9      3    3523.   3294. 
#> 11      12 18.6      3     643.    423. 
#> 12      13 13.2      4    1092.    718. 
#> 13      14 30.6      4    1771.   1516.

Exercise 15.13 Using unit_summary_vol from Exercise 15.12, apply the stratified mean and standard error estimators (15.2) and (15.5), respectively, and compute 95% confidence intervals for mean volume. These are per-acre estimates (ft$^3$/ac).

In the output below, we left the total tract area (A), standard error, and $\tval$-value so you can check your steps. The estimates show a mean volume of 615.72 ft$^3$/ac with a 95% confidence interval of (422.98, 808.46) ft$^3$/ac. Given that many units have little or no overstory, such a low tract-level estimate should not be surprising.

#> Rows: 1
#> Columns: 6
#> $ A            <dbl> 412.58
#> $ vol_bar      <dbl> 615.72
#> $ s_vol_bar    <dbl> 96.355
#> $ t            <dbl> 2.0003
#> $ vol_ci_lower <dbl> 422.98
#> $ vol_ci_upper <dbl> 808.46

Exercise 15.14 Make a few modifications to your code from Exercise 15.13 to apply the stratified total and standard error estimators in (15.3) and (15.6), respectively, and compute 95% confidence intervals for total volume. These are tract-level totals (ft$^3$).

#> Rows: 1
#> Columns: 6
#> $ A            <dbl> 412.58
#> $ vol_hat      <dbl> 254033
#> $ s_vol_hat    <dbl> 39754
#> $ t            <dbl> 2.0003
#> $ vol_ci_lower <dbl> 174514
#> $ vol_ci_upper <dbl> 333553

Exercise 15.15 For an internal check, assume there is no stratification and compute the tract-level estimate of mean volume with its standard error. Starting from o_ccbuti_trees, apply the point-sampling estimators from Section 12.4, using the computing steps in Section 12.6.2. Your output should match the output below. Recall, these are per-acre estimates (ft$^3$/ac).

How does this mean and its standard error compare with the stratified results from Exercise 15.13? In particular, how do the standard errors (i.e., s_vol_bar) compare, and why are there differences? Does stratification improve statistical efficiency here?

#> Rows: 1
#> Columns: 6
#> $ n            <int> 73
#> $ vol_bar      <dbl> 608.94
#> $ s_vol_bar    <dbl> 128.84
#> $ t            <dbl> 1.9935
#> $ vol_ci_lower <dbl> 352.11
#> $ vol_ci_upper <dbl> 865.78

Exercise 15.16 Let’s turn to the regeneration data. Recall, this survey was conducted on fixed-area 1/100-acre plots that were co-located with the overstory sampling locations. Our focus here is on regeneration within individual units, not on tract-level estimates, so we’ll stop after generating unit-level summaries and will not apply tract-level stratified estimators.

Begin by creating a subset of regeneration trees from ccbuti_trees. Then add a tree factor column so that each tree has its own tree factor. Recall from Section 12.3 that under plot sampling the tree factor is only a function of the plot area, and therefore will be the same for all trees.

Following methods in Section 12.3 and the computing steps in Section 12.6.1, we called our regeneration subset r_ccbuti_trees and our tree factor column TF. For visualization, we placed the new TF column immediately after tree_count. Part of the resulting tibble is shown below.

r_ccbuti_trees %>% glimpse()

#> Rows: 147
#> Columns: 9
#> $ unit_id    <dbl> 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3…
#> $ point_id   <dbl> 6, 6, 6, 6, 7, 8, 8, 20, 20, 21, 2…
#> $ type       <chr> "Regeneration", "Regeneration", "R…
#> $ tree_count <dbl> 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ TF         <dbl> 100, 100, 100, 100, 100, 100, 100,…
#> $ spp_code   <chr> "df", "df", "df", "df", "df", "df"…
#> $ DBH_in     <dbl> 2.0, 3.0, 4.0, 1.0, 2.0, 1.0, 5.0,…
#> $ height_ft  <dbl> 15, 17, 20, 15, 15, 16, 26, 26, 26…
#> $ vol_cuft   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA…

Exercise 15.17 Using r_ccbuti_trees from Exercise 15.16, compute plot-level summaries for per-acre tree density. Call the resulting tibble plot_summary.

In our output below, trees_hi represents per-acre tree density. The _hi suffix indicates that these are values for the $i$-th plot within the $h$-th unit. We grouped by both unit_id and point_id. While grouping by point_id alone would have been sufficient (because point IDs are unique across all units), we included unit_id so it will be available later when computing unit-level estimates in Exercise 15.18.

plot_summary

#> # A tibble: 73 × 3
#> # Groups:   unit_id [13]
#>    unit_id point_id trees_hi
#>      <dbl>    <dbl>    <dbl>
#>  1       2        6      900
#>  2       2        7      100
#>  3       2        8      200
#>  4       3       20      200
#>  5       3       21      400
#>  6       3       22      200
#>  7       3       23      100
#>  8       3       24      200
#>  9       3       25      300
#> 10       3       26      200
#> # ℹ 63 more rows

Exercise 15.18 Using plot_summary from Exercise 15.17, compute the unit-level sample size and mean number of regeneration trees per acre, along with the associated standard error. Assign your output to unit_summary. It should match our output below (which we arranged in decreasing order of trees_bar_h).

You can perform some simple checks on your output here. For example, there is a single regeneration tree measured in Unit 13 across 4 plots, hence the estimated 25 trees per acre (i.e., 100/4). Similarly, there are no regeneration trees in Units 10 and 11, which is why those units have estimates of zero.

While the survey data do provide an estimate of regeneration within each unit, the sample sizes are quite small, and the resulting standard errors can be large. If more precise estimates are needed for management decisions, one might consider ways to increase the sample size without incurring substantial additional cost. For example, you could combine similar units to increase sample size. Alternatively, cluster designs like those discussed in Section 17 could be used to place more regeneration subplots at each sampling location.

unit_summary %>% arrange(desc(trees_bar_h))

#> # A tibble: 13 × 4
#>    unit_id   n_h trees_bar_h s_trees_bar_h
#>      <dbl> <int>       <dbl>         <dbl>
#>  1       7     5        620          193. 
#>  2       8     3        567.          88.2
#>  3       2     3        400          252. 
#>  4      12     3        267.          66.7
#>  5       4     7        229.          60.6
#>  6       3    10        210           27.7
#>  7       6     3        200            0  
#>  8       5    15        173.          38.4
#>  9      14     4        125           62.9
#> 10      13     4         25           25  
#> 11       9    10         10           10  
#> 12      10     3          0            0  
#> 13      11     3          0            0

Exercise 15.19 It’s interesting to examine the relationship between density and mean DBH for these data. We generally expect density to decrease as mean DBH increases, because stands in early successional stages lose individual trees to growth-based crowding and other sources of mortality.

The code below calculates the quadratic mean DBH for each unit using the individual tree measurements. You’ll note the code uses the quadratic mean DBH formula (15.19). We use weighted.mean() to appropriately weight each DBH measurement; the weight effectively repeats the given value according to the tree’s count and tree factor.¹⁶²

After computing these means, we join them to unit_summary from Exercise 15.18 and arrange by trees_bar_h. Your job is to follow the code and decide whether there is any apparent relationship between mean DBH and density.

tree_level <- r_ccbuti_trees %>% 
  group_by(unit_id) %>%
  summarize(
    QMD_h = sqrt(weighted.mean(DBH_in^2, w = tree_count * TF))
    )

unit_summary %>% 
  left_join(tree_level) %>% 
  arrange(desc(trees_bar_h))

#> Joining with `by = join_by(unit_id)`

#> # A tibble: 13 × 5
#>    unit_id   n_h trees_bar_h s_trees_bar_h  QMD_h
#>      <dbl> <int>       <dbl>         <dbl>  <dbl>
#>  1       7     5        620          193.    2.69
#>  2       8     3        567.          88.2   1.61
#>  3       2     3        400          252.    2.58
#>  4      12     3        267.          66.7   2.69
#>  5       4     7        229.          60.6   1.97
#>  6       3    10        210           27.7   3.65
#>  7       6     3        200            0     3.13
#>  8       5    15        173.          38.4   3.13
#>  9      14     4        125           62.9   3.20
#> 10      13     4         25           25     0   
#> 11       9    10         10           10     2   
#> 12      10     3          0            0   NaN   
#> 13      11     3          0            0   NaN

References

Curtis, Robert O., and David D. Marshall. 2000. “Technical Note: Why Quadratic Mean Diameter?” Western Journal of Applied Forestry 15 (3): 137–39.

Ducey, Mark J., and John A. Kershaw. 2023. “Alternative Expressions for Stand Diameter in Complex Forests.” Forest Ecosystems 10: 100114.

Orwig, David A., Jonathan R. Thompson, Nicholas A. Povak, Megan Manner, Donald Niebyl, and David R. Foster. 2012. “A Foundation Tree at the Precipice: Tsuga Canadensis Health After the Arrival of Adelges Tsugae in Central New England.” Ecosphere 3 (1): art10.

Westfall, James A., John W. Coulston, Andrew N. Gray, John D. Shaw, Philip J. Radtke, David M. Walker, Aaron R. Weiskittel, et al. 2024. “A National-Scale Tree Volume, Biomass, and Carbon Modeling System for the United States.” General Technical Report WO-104. Washington, DC: U.S. Department of Agriculture, Forest Service, Northern Research Station.

We use a for loop here to save space. Compare with the hardcoded approach in Section 11.3.6.1.↩︎
The bounds object is projected in meters. You can check an sf object’s distance units using st_crs(bounds)$units or units::deparse_unit(st_distance(bounds)).↩︎
This works because plot_id is unique across strata. If that’s not the case, for example if plot IDs are reused within each stratum, we’d need to include stratum_id in the grouping to avoid mixing trees from different strata that share the same plot ID.↩︎
At the time of writing, much of the hemlock at Harvard Forest and across the broader region had already succumbed to HWA (D. A. Orwig et al. 2012).↩︎
Run this code to see the distinct point IDs. You can confirm this by running ccbuti_trees %>% distinct(point_id) %>% arrange(point_id) %>% print(n = Inf), and this code if you’d like to see which points are missing in what would otherwise be a consecutive ordering setdiff(min(ccbuti_trees$point_id):max(ccbuti_trees$point_id), point_ids).↩︎
Take a look at the count() manual page to understand the suggested argument values.↩︎
While quadratic mean diameter is commonly abbreviated QMD in the US, it’s often written as $D_g$ elsewhere, where $D$ stands for diameter and the $g$ subscript comes from Grundfläche, German for basal area.↩︎
This demonstrates an alternative approach to the one used in Exercise 15.10, which started from stand-level estimates. Both approaches will produce the same QMD.↩︎

Want to know when the book is for sale? Enter your email so we can let you know.