This book is a work in progress and is Open Review. We want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right hand corner of the page

Chapter 18 Multistage sampling

Multistage sampling is a generalization of cluster sampling in which selection occurs in two or more successive stages. Rather than drawing a sample directly from the population, units are chosen in stages: larger primary sampling units (PSUs) are selected first, and smaller secondary sampling units (SSUs) are then sampled within them.

While multistage designs can involve any number of stages, two-stage sampling is by far the most common. In a typical two-stage design:

a sample of PSUs is selected (e.g., stands, management units, compartments, or ownerships that partition the population of interest);
within each sampled PSU, a sample of SSUs is selected.

This structure offers considerable flexibility. It allows sampling intensity to vary across PSUs, accommodates operational constraints, and reduces travel effort by concentrating field measurements within a subset of selected PSUs.

18.1 Connection to cluster sampling

Two-stage sampling is a generalization of cluster sampling. The cluster design introduced in Section 17 corresponds to a special case of two-stage sampling in which all SSUs within each selected PSU are measured.

Two-stage sampling relaxes this requirement. Instead of measuring every SSU within a selected PSU, only a subset of SSUs is sampled. This distinction is important. In single-stage cluster sampling, sampling variability reflects only differences among PSUs, because SSU-level variation is fully observed within each cluster. In two-stage sampling, variability arises from both stages: between-PSU variation due to first-stage selection and within-PSU variation due to second-stage sampling. As a result, estimators and variance expressions must explicitly account for uncertainty introduced at both stages.

18.2 Why use multistage sampling?

Multistage sampling is widely used in forest inventory because it matches how forests are organized and offers a practical way to balance statistical efficiency, cost, and field logistics. By selecting PSUs first and then sampling SSUs within them, it lets you concentrate effort where it has the biggest impact. In particular, it can:

use ecological or management units (e.g., stands, compartments, ownership blocks) as convenient PSUs;
reduce navigation and travel by allowing crews to work intensively within a smaller number of selected PSUs;
incorporate auxiliary information available at the PSU level (e.g., stand area, prior estimates, remote-sensing predictors) to improve precision;
align sampling effort with expected variability or importance of different PSUs.

The number of SSUs per PSU can vary, different sampling methods can be used at each stage, and unequal probabilities can be introduced to target PSUs or SSUs expected to contribute more to the population total or mean.

18.3 Strategies for selecting PSUs

The first stage of two-stage sampling involves selecting PSUs. Common strategies include the following.

Simple random sampling (SRS): select PSUs with equal probability when no auxiliary information is available.
Stratification: if PSU-level variables are known (e.g., stand area, stand development stage, forest type), stratify PSUs and sample within strata to improve estimates (see Section 15.1).
Systematic sampling: if PSUs can be arranged spatially or sorted by auxiliary information related to the parameter of interest, select PSUs systematically with a random start to ensure broad coverage.
Probability proportional to size (PPS): if each PSU has a size-related measure (e.g., area or predicted biomass), select PSUs with probabilities proportional to that measure.

These approaches can be combined. For example, PSUs might be stratified by forest type and then selected using SRS, SYS, or PPS within each stratum. Stratum-specific two-stage estimates for means, totals, and standard errors can then be combined as described in Section 17.6.

Availability of auxiliary information usually guides the choice of first-stage design. Without auxiliary data, equal-probability sampling at the PSU stage is the default. When size- or variability-related variables are available and correlated with the parameter of interest, PPS designs can improve efficiency by allocating sampling effort in proportion to each PSU’s contribution.

The appropriate second-stage estimator depends on how SSUs are conceptualized within each PSU. In some applications, SSUs form a finite, list-based set of units. In others, SSUs represent sampling locations placed within a continuous area, as in point or plot sampling under an areal sampling frame. To accommodate these cases, we present estimator forms suited to both interpretations.

We present the estimators in their most general forms, allowing for PSUs of differing sizes and varying numbers of SSUs per PSU. When PSU sizes are equal or the number of SSUs per PSU is constant, the estimators simplify. These special cases are treated in standard survey sampling texts such as Cochran (1977), Thompson (2012), S. L. Lohr (2019), and Särndal, Swensson, and Wretman (1992).

While two-stage estimators can look lengthy at first glance, they follow clear, step-by-step calculations. With the tools in Section 10.2, they’re straightforward to implement, and each yields a population mean and total with associated standard errors.

18.4 Estimation using two-stage sampling

We use the following notation:

\(N\) number of PSUs in the population;
\(M_i\) number of SSUs in the \(i\)-th PSU;
\(M = \sum_{i=1}^N M_i\) total number of SSUs in the population;
\(n\) number of PSUs in the sample;
\(m_i\) number of sampled SSUs in the \(i\)-th sampled PSU;
\(y_{ij}\) value of the \(j\)-th SSU in the \(i\)-th PSU;
\(\bar{y}_i = \frac{1}{m_i}\sum_{j=1}^{m_i} y_{ij}\) sample mean within PSU \(i\);
\(s_i^2 = \frac{1}{m_i - 1} \sum_{j=1}^{m_i} (y_{ij} - \bar{y}_i)^2\) sample variance within PSU \(i\).

This notation and the estimators that follow assume a finite population with a fixed set of \(N\) PSUs (and their associated SSUs). Section 18.5 describes how closely related estimators are used when working from an areal sampling frame.

18.4.1 Sampling PSUs with equal probability

18.4.1.1 Unbiased estimators

An unbiased estimator for the population total \(\tau\) is \[\begin{equation} \hat{t} = \frac{N}{n}\sum_{i=1}^n M_i \bar{y}_i. \tag{18.1} \end{equation}\]

Given \(\hat{t}\), the population mean \(\mu\) is estimated as \[\begin{equation} \bar{y} = \frac{\hat{t}}{M}. \tag{18.2} \end{equation}\]

The standard error for \(\hat{t}\) combines variation among PSU totals and variation among SSUs within PSUs \[\begin{equation} s_{\hat{t}} = \sqrt{ \underbrace{\frac{N^2}{n}\left(1 - \frac{n}{N}\right)s^2}_{\text{between-PSU variance}} + \underbrace{\frac{N}{n} \sum_{i = 1}^n \frac{M_i^2}{m_i}\left(1 - \frac{m_i}{M_i}\right)s_i^2}_{\text{within-PSU variance}} }, \tag{18.3} \end{equation}\] where \[\begin{equation} s^2 = \frac{1}{n - 1} \sum_{i = 1}^n \left(M_i \bar{y}_i - \frac{\hat{t}}{N}\right)^2 \end{equation}\] is the sample variance among PSU totals.

Given \(s_{\hat{t}}\), the standard error for \(\bar{y}\) is \[\begin{equation} s_{\bar{y}} = \frac{s_{\hat{t}}}{M}. \tag{18.4} \end{equation}\]

As shown in (18.3), total variance is partitioned into two components.

The first term (between-PSU variance) captures variation among PSU totals and includes a FPC because PSUs are sampled without replacement.
The second term (within-PSU variance) captures variation among SSUs within each PSU. SSUs are also sampled without replacement, so this term includes its own FPC.

This estimator is unbiased for both the population total and mean, as well as their associated variance estimators.

Confidence intervals for the population mean and total follow \[\begin{equation} \bar{y} \pm \tval_{\text{df},1-\frac{\alpha}{2}}\, s_{\bar{y}} \tag{18.5} \end{equation}\] and \[\begin{equation} \hat{t} \pm \tval_{\text{df},1-\frac{\alpha}{2}}\, s_{\hat{t}}. \tag{18.6} \end{equation}\] Here, \(\text{df} = n - 1\), reflecting that PSUs—not SSUs—are the independent sampling units.

18.4.1.2 Ratio estimators

The ratio estimator provides an alternative to the unbiased estimator and often yields more stable or efficient estimates of the population mean and total, particularly when PSU sizes vary. Although biased, the ratio estimator often represents a reasonable tradeoff for gains in precision. We quantify this bias in Section 18.6.2 using a forest inventory dataset

The population mean \(\mu\) is estimated using \[\begin{equation} \bar{y} = \frac{\sum_{i=1}^n M_i\bar{y}_i}{\sum_{i=1}^n M_i}. \tag{18.7} \end{equation}\]

The standard error of \(\bar{y}\) is estimated as \[\begin{equation} s_{\bar{y}} = \frac{1}{M}\sqrt{\underbrace{\frac{N^2}{n}\left(1-\frac{n}{N}\right)\frac{1}{n-1}\sum_{i=1}^n\left(M_i\bar{y}_i - M_i\bar{y}\right)^2}_{\text{between-PSU variance}} + \underbrace{\frac{N}{n}\sum_{i=1}^n \frac{M_i^2}{m_i}\left(1-\frac{m_i}{M_i}\right)s_i^2}_{\text{within-PSU variance}}}. \tag{18.8} \end{equation}\]

Given the sample mean and its standard error, the population total and its standard error are \[\begin{equation} \hat{t} = M\bar{y} \tag{18.9} \end{equation}\] and \[\begin{equation} s_{\hat{t}} = M s_{\bar{y}}. \tag{18.10} \end{equation}\]

Confidence intervals follow (18.5) and (18.6), with \(\text{df}=n-1\).

The ratio estimator for the population mean in (18.7) doesn’t require knowledge of \(M\). However, \(M\) is required to compute the standard error of \(\bar{y}\) and to estimate the population total and its standard error. When \(M\) is unknown, it may be estimated by \[\begin{equation} \hat{M} = \frac{N}{n}\sum_{i=1}^n M_i, \end{equation}\] and \(\hat{M}\) substituted for \(M\) in (18.8), (18.9), and (18.10).

18.4.2 Sampling PSUs with probability proportional to size

The PPS two-stage estimator extends the two-stage framework introduced above by allowing PSUs to be selected with unequal probabilities while incorporating SSU sampling within each selected PSU. When auxiliary information describing PSU size is available prior to sampling, efficiency can often be improved by selecting PSUs with PPS. Here, “size” refers to any nonnegative quantity correlated with the PSU total.

We assume the following design elements.

Each PSU has a known per-draw selection probability \(p_i\), with \(p_i \propto x_i\) for some nonnegative size measure \(x_i\) and \(\sum_{i=1}^N p_i = 1\). A common choice is \(x_i = M_i\), giving \(p_i = M_i / M\) where \(M = \sum_{i=1}^N M_i\).
PSU sampling is conducted with replacement, producing \(n\) independent draws.
- If a PSU is selected more than once, a new SSU sample is drawn each time, and each draw yields its own estimate \(\bar{y}_i\).
- Because selection is with replacement, draws are independent and no finite population correction is applied.

Within each selected PSU, SSUs are sampled so that the resulting sample mean \(\bar{y}_i\) is an unbiased estimate of the PSU mean, for example using SRS or SYS within the PSU.

The population total \(\tau\) is estimated using \[\begin{equation} \hat{t} = \frac{1}{n} \sum_{i=1}^n \frac{M_i \bar{y}_i}{p_i}, \tag{18.11} \end{equation}\] where the sum is taken over the \(n\) PSUs selected with replacement. A PSU may therefore appear multiple times in the sample, and each appearance contributes its own independently estimated \(\bar{y}_i\).

The population mean is \[\begin{equation} \bar{y} = \frac{\hat{t}}{M}. \tag{18.12} \end{equation}\]

An unbiased estimator of the standard error of \(\hat{t}\) is \[\begin{equation} s_{\hat{t}} = \sqrt{\frac{1}{n(n - 1)} \sum_{i = 1}^n \left( \frac{M_i \bar{y}_i}{p_i} - \hat{t} \right)^2 }. \tag{18.13} \end{equation}\]

Given \(s_{\hat{t}}\), the standard error of the mean is \[\begin{equation} s_{\bar{y}} = \frac{s_{\hat{t}}}{M}. \tag{18.14} \end{equation}\]

Confidence intervals for the mean and total follow (18.5) and (18.6).

PPS sampling is particularly useful when PSU sizes vary widely and are correlated with the parameter of interest. Allocating sampling effort proportional to PSU size often yields smaller standard errors than equal-probability sampling.

18.5 Two-stage sampling under an areal sampling frame

Two-stage sampling is commonly used in forest inventory under an areal sampling frame. Here we focus on a setting that arises in many forestry applications: a finite set of PSUs (e.g., stands, management units, or ownerships), with SSUs selected within each sampled PSU using point or plot sampling to characterize local conditions.

The population of interest is defined by the collection of PSUs and their combined spatial extent, with total area \(A\). The PSUs need not form a single continuous region; they may be spatially disjoint. As discussed in Section 12.2, an areal sampling frame implies an effectively infinite set of possible sampling locations. Under two-stage sampling, this interpretation applies at the SSU stage: SSUs are treated as points or plots sampled from the areal extent of each PSU. In contrast, the PSUs themselves constitute a finite set of discrete units. Because only the first stage samples from a finite population, the FPC applies at the PSU stage but not at the SSU stage.

Let \(A_i\) be the area of PSU \(i\), with total population area \(A = \sum_{i=1}^N A_i\). Suppose \(m_i\) SSUs are observed in PSU \(i\), and define the SSU sample mean within PSU \(i\) as \[\begin{equation} \bar{y}_i = \frac{1}{m_i}\sum_{j=1}^{m_i} y_{ij}. \end{equation}\] Each SSU measurement \(y_{ij}\) is expressed on a per-unit-area basis (e.g., tons per hectare), so \(\bar{y}_i\) represents a per-unit-area estimate for PSU \(i\). This assumes that SSUs are sampled so that \(\bar{y}_i\) is an unbiased estimator of the PSU mean, for example using SRS or SYS within the PSU. Because the SSU population is treated as effectively infinite, the number of SSUs \(m_i\) can vary across PSUs without changing the interpretation of \(\bar{y}_i\).

The goal is to estimate the population mean per unit area, \(\mu\), and the corresponding total \(\tau = A\mu\). Estimators in Section 18.4 extend directly to this setting with only a change of notation. We illustrate this by adapting the two-stage ratio and PPS estimators to the areal sampling frame.

18.5.1 Ratio estimator

The two-stage ratio estimator for the population mean per unit area carries over directly from the finite-population setting. Replacing PSU size \(M_i\) with PSU area \(A_i\) and the population total \(M\) with the total area \(A\), the estimator for \(\mu\) is \[\begin{equation} \bar{y} = \frac{\sum_{i=1}^n A_i\bar{y}_i}{\sum_{i=1}^n A_i}. \end{equation}\] The corresponding total is \(\hat{t} = A\bar{y}\).

The standard error is obtained by modifying the two-stage ratio variance in (18.8) to reflect the areal sampling frame. Because SSUs are sampled from an effectively infinite population within each PSU, the SSU-level FPC is omitted, while the PSU-level FPC is retained. The resulting estimator is \[\begin{equation} s_{\bar{y}} = \frac{1}{A}\sqrt{\underbrace{\frac{N^2}{n}\left(1-\frac{n}{N}\right)\frac{1}{n-1}\sum_{i=1}^n\left(A_i\bar{y}_i - A_i\bar{y}\right)^2}_{\text{between-PSU variance}} + \underbrace{\frac{N}{n}\sum_{i=1}^n \frac{A_i^2}{m_i}s_i^2}_{\text{within-PSU variance}}}. \end{equation}\]

The first term reflects between-PSU variation, while the second term reflects within-PSU variation arising from second-stage sampling.

The standard error of the total is \(s_{\hat{t}} = A s_{\bar{y}}\) and confidence intervals for the mean and total follow (18.5) and (18.6), with \(\text{df}=n-1\).

18.5.2 PPS estimator

The PPS estimators in Section 18.4.2 extend in the same way. We again replace \(M_i\) with \(A_i\) and \(M\) with \(A\) in (18.11), (18.12), (18.13), and (18.14), interpreting \(A_i \bar{y}_i\) as the contribution of PSU \(i\) to the population total.

A natural choice of size measure for PPS is the PSU area, so that \(p_i = A_i/A\), assigning per-draw selection probabilities proportional to area under the assumption that larger PSUs tend to contribute more to the total. Other choices of \(p_i\) are possible when auxiliary variables correlated with the parameter of interest are available (e.g., predicted biomass or historical volume).

As before, PSUs are selected with replacement, and each time PSU \(i\) is selected it yields an independent estimate \(\bar{y}_i\) based on a new second-stage sample of SSUs.

18.6 Illustrative analyses

18.6.1 Portland park tree inventory

To illustrate two-stage estimators under a finite first and second stage, we return to the Portland parks dataset introduced in Section 17.7.1, where the goal is to estimate total tree carbon across the city’s parks. Because we have a complete census of all parks and all trees, we can directly compare sample-based estimates with the true population total, \(\tau\) = 26,349 tons (2,000 lbs).

In the single-stage cluster sampling analysis in Section 17.7.1.3, the PPS estimator outperformed the unstratified and stratified equal-probability estimators in terms of accuracy, precision, and confidence interval coverage. In that setting, stratification provided little additional benefit once PPS was used.

To maintain continuity with those results and highlight how PPS behaves under a two-stage design, we focus this illustration on the PPS estimator. We begin by demonstrating the estimator for a single sample, and then evaluate its performance under repeated sampling so the results can be compared directly with the single-stage cluster sampling analysis.

Recall from Section 17.7.1 that parks serve as PSUs and trees within parks are SSUs. Following the same setup, we assume the number of trees within each park is known (i.e., \(M_i\) is known for \(i = 1,\ldots,N\)), and define the PSU selection probabilities as \(p_i = M_i / M\), where \(M = \sum_{i=1}^N M_i\).

To match the PSU sample size used in Section 17.7.1, we select \(n = 25\) parks using PPS with replacement. Then, specific to the two-stage design, for each selected park we use SRS without replacement to select \(m = 20\) trees for measurement. If a selected park contains fewer than \(20\) trees, all trees in that park are measured (i.e., \(m_i = \min(20, M_i)\)).¹⁸⁰

18.6.1.1 PPS two-stage estimator

For each sampled park, the \(m_i\) measured trees are used to compute the sample mean \(\bar{y}_i\). These park-level quantities along with park identifiers, values of \(M_i\), and selection probabilities \(p_i\) are provided in "datasets/PDX/two_stage_pps_park_sample_n25_m20.csv" and read in below.

ppd_parks <- 
  read_csv("datasets/PDX/two_stage_pps_park_sample_n25_m20.csv")
ppd_parks %>% glimpse()

#> Rows: 25
#> Columns: 4
#> $ park_id <dbl> 87, 94, 58, 10, 70, 109, 92, 11, 36, …
#> $ M_i     <dbl> 565, 403, 261, 66, 647, 71, 48, 181, …
#> $ p_i     <dbl> 0.0223700, 0.0159560, 0.0103338, 0.00…
#> $ y_bar_i <dbl> 0.46042, 0.90021, 1.05700, 1.89893, 0…

Two features of the design are worth emphasizing. First, PSUs are selected with replacement using PPS, so the same park can—and often does—appear multiple times in the sample. Second, each appearance of a park triggers a new, independent second-stage sample of trees, yielding a different realized value of \(\bar{y}_i\). Both features are evident below, where parks 81, 87, and 94 each appear twice in the sample but with different \(\bar{y}_i\) values due to independent SSU sampling.

ppd_parks %>% arrange(desc(p_i))

#> # A tibble: 25 × 4
#>    park_id   M_i    p_i y_bar_i
#>      <dbl> <dbl>  <dbl>   <dbl>
#>  1      79  1016 0.0402   1.40 
#>  2     141   804 0.0318   0.890
#>  3      83   668 0.0264   0.416
#>  4      70   647 0.0256   0.918
#>  5      81   633 0.0251   0.718
#>  6      81   633 0.0251   0.730
#>  7      87   565 0.0224   0.460
#>  8      87   565 0.0224   0.505
#>  9      94   403 0.0160   0.900
#> 10      94   403 0.0160   0.876
#> # ℹ 15 more rows

The PPS estimator and its standard error follow (18.11) and (18.13), respectively, and are implemented below.

est_pps <- pps_parks %>%
  summarize(
    n = n(),
    t_hat = mean(M_i * y_bar_i / p_i),
    s_t_hat = sqrt(1 / (n * (n - 1)) *
      sum((M_i * y_bar_i / p_i - t_hat)^2)),
    t = qt(df = n - 1, p = 1 - 0.05 / 2),
    t_ci_lower = t_hat - t * s_t_hat,
    t_ci_upper = t_hat + t * s_t_hat
  )

est_pps %>%
  select(t_hat, t_ci_lower, t_ci_upper)

#> # A tibble: 1 × 3
#>    t_hat t_ci_lower t_ci_upper
#>    <dbl>      <dbl>      <dbl>
#> 1 25885.     21411.     30360.

The census-based total tree carbon across all parks is \(\tau\) = 26,349 tons. For this sample, the two-stage PPS estimate and its 95% confidence interval are 25,885 (21,411, 30,360) tons, which contains the true value.

To assess the estimator’s performance more generally—and to compare directly with the single-stage cluster results—we next turn to a repeated-sampling analysis.

18.6.1.2 Repeated-sampling comparison

We evaluate the two-stage PPS estimator under repeated sampling using the same general setup as in Section 17.7.1.3, drawing 500,000 samples from the Portland parks population. As before, each sample yields an estimate of the same fixed population total, allowing us to assess estimator performance across many realizations rather than from a single sample.

Across repeated samples, the two-stage PPS estimator for total tree carbon (tons) is centered very close to the true population total. The mean relative difference across samples is -0.018%, consistent with the expectation that the PPS estimator is centered on the true population total under the sampling design. The estimator shows a mean relative standard error (RSE) of 11.3% and achieves 95% confidence interval coverage of 94.6%.

These results can be compared directly with the single-stage PPS cluster estimator reported in the third row of Table 17.1. Because the single-stage design measures all SSUs within each selected PSU, whereas the two-stage design measures only a subset, some loss of precision is expected under two-stage sampling. In this case, however, the loss is modest: the RSE increases only slightly, from 10.1% under single-stage cluster sampling to 11.3% under the two-stage design.

Many of the sampled PSUs contain hundreds of trees, so restricting measurement to just 20 trees per PSU at the second stage might initially feel like a substantial loss of information. For this population, however, that level of subsampling is sufficient to produce stable estimates of PSU-level mean tree carbon. The resulting gains in operational efficiency therefore come with only a small increase in variance.

Taken together, these results point to a practical takeaway that parallels the single-stage cluster-sampling case. When PSU size is strongly associated with the population total, most efficiency gains come from PPS selection at the PSU stage. Once PSUs are selected efficiently, only a modest number of SSUs are needed within each PSU to obtain stable PSU-level means and reliable estimates of population totals.

18.6.2 Mass timber feedstock analysis

In this example, we apply the two-stage ratio and PPS estimators from Sections 18.4.1.2 and 18.4.2 to an areal-frame setting, with a finite first stage and an areal second stage as described in Section 18.5. The data are simulated, but the structure reflects analyses commonly encountered in applied forestry.

Suppose a mass timber mill and fabrication firm is exploring the feasibility of locating a facility in the region shown in Figure 18.1. The firm expects the state’s Department of Natural Resources (DNR) to be its primary supplier and wants to assess how much suitable biomass is available on DNR-managed lands within a practical hauling distance of the proposed site.

This type of pre-operational assessment is often referred to as a feedstock study, where feedstock refers to the volume of accessible wood available to support sustained operation of a mill or processing facility. Consultants and analysts routinely conduct such studies using limited field data, combined with spatial and stand-level auxiliary information, to estimate both total available biomass and its spatial distribution.

Conifer and mixed hardwood/softwood stands managed by the Michigan Department of Natural Resources within a one-hour drive of the proposed mass timber mill and fabrication facility. These stands define the population of primary sampling units (PSUs). A simple random sample of 50 PSUs was selected for inventory.

FIGURE 18.1: Conifer and mixed hardwood/softwood stands managed by the Michigan Department of Natural Resources within a one-hour drive of the proposed mass timber mill and fabrication facility. These stands define the population of primary sampling units (PSUs). A simple random sample of 50 PSUs was selected for inventory.

To identify the region of interest, the firm determines that a one-hour travel time is the upper limit for economically feasible wood hauling. We use this threshold to compute an isochrone—a polygon showing the area reachable within a fixed travel time using the road network. In this case, the isochrone includes all locations accessible within one hour of the proposed facility.¹⁸¹

The fabrication process relies primarily on softwoods, though the firm is experimenting with some hardwood inputs. Our parameters of interest are the mean per-acre and total dry weight of aboveground biomass (AGB) on all DNR conifer and mixed hardwood/softwood stands within the one-hour isochrone.

The population consists of \(N\) = 8,927 stands, and the cruising budget supports a sample of \(n\) = 50. Stands serve as the PSUs, and the SSUs are inventory plots placed within each stand. We’ll use 1/10-acre circular plots laid out on a regular 7-by-7 chain (462 ft) grid—a systematic design commonly used in operational inventory work throughout the region.

Figure 18.1 shows the full PSU population along with 50 PSUs selected using SRS. Figure 18.2 highlights a region containing several selected PSUs, and Figure 18.3 provides a closer view of these PSUs and their SSUs.

FIGURE 18.2: Zoomed-in view from Figure 18.1, highlighting three selected primary sampling units (PSUs).

The three primary sampling units (PSUs) shown in Figure 18.2. PSU IDs are 1895, 1791, and 1948 for (a), (b), and (c), respectively. Secondary sampling units (SSUs) were placed using systematic sampling. Each inventory plot is colored by the observed aboveground biomass (AGB).

FIGURE 18.3: The three primary sampling units (PSUs) shown in Figure 18.2. PSU IDs are 1895, 1791, and 1948 for (a), (b), and (c), respectively. Secondary sampling units (SSUs) were placed using systematic sampling. Each inventory plot is colored by the observed aboveground biomass (AGB).

In the sections that follow, we’ll apply the two-stage equal-probability ratio and PPS estimators to estimate both mean and total AGB. Because the data are simulated, we know the true population values \(\mu\) = 30.69 tons/ac and \(\tau\) = 11,100,216 tons. This lets us evaluate and compare each estimator’s performance against known parameters.

As you work through these sections, you’ll see that point estimates (means and totals) are similar across estimators, while the PPS estimator produces a smaller estimated standard error than the equal-probability ratio estimator. Because these results come from a single sample, however, it’s unclear whether the observed differences reflect true estimator performance or simply sample-to-sample variability. To explore this question, we turn to a repeated-sampling analysis in Section 18.6.2.3.

18.6.2.1 Estimation using an equal probability sampling

Under this design, each PSU in the population has an equal probability of selection. The following data files are used in this analysis.

datasets/mass_timber/all_psu.csv contains the full set of PSUs in the population. Each row is a PSU, and columns hold a unique PSU index and area in acres.
datasets/mass_timber/sampled_equal_prob_psu.csv contains the sampled PSUs. These were drawn from the full set of PSUs using dplyr::slice_sample() with \(n\)=50 and all arguments set to their default values.
datasets/mass_timber/sampled_ssu_equal_prob_psu.csv contains the SSUs for each sampled PSU. Each row corresponds to a tree measured on the 1/10-acre circular plot, and columns are:
- psu_id and ssu_id, which together uniquely identify the trees measured on a given plot;
- tree_count, which indicates how many trees are associated with each row (used to account for zero-tree plots and trees double-counted via the mirage method);
- agb_ton_tree, the dry weight AGB in tons for each tree, used to compute the \(y_{ij}\)’s.

The code below reads and takes a quick look at each data file.

all_psu <- read_csv("datasets/mass_timber/all_psu.csv")
all_psu %>% glimpse()

#> Rows: 8,927
#> Columns: 2
#> $ psu_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,…
#> $ ac     <dbl> 44.565, 21.690, 51.851, 131.325, 75.43…

sampled_psu <- 
  read_csv("datasets/mass_timber/sampled_equal_prob_psu.csv")
sampled_psu %>% glimpse()

#> Rows: 50
#> Columns: 2
#> $ psu_id <dbl> 1017, 8004, 4775, 8462, 4050, 8789, 13…
#> $ ac     <dbl> 30.787, 18.785, 11.723, 16.064, 38.443…

ssu_trees <- 
  read_csv("datasets/mass_timber/sampled_ssu_equal_prob_psu.csv")
ssu_trees %>% glimpse()

#> Rows: 7,750
#> Columns: 4
#> $ psu_id       <dbl> 1017, 1017, 1017, 1017, 1017, 10…
#> $ ssu_id       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ tree_count   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ agb_ton_tree <dbl> 0.16786, 0.16999, 0.14626, 0.151…

We’ll now rename a few columns to align with our notation and create variables needed in subsequent steps.

all_psu <- all_psu %>% rename(A_i = ac)
sampled_psu <- sampled_psu %>% rename(A_i = ac)

N <- nrow(all_psu)
n <- nrow(sampled_psu)

A <- sum(all_psu$A_i)

# Add the tree factor for 1/10-acre plot.
r <- sqrt(0.1 * 43560 / pi) # Plot radius (ft).
ssu_trees <- ssu_trees %>% mutate(TF = 43560 / (pi * r^2))

We implement the ratio estimator for the population mean and total, along with their standard errors, as defined in (18.7), (18.9), (18.8), and (18.10). The estimators are applied in two steps, using the areal second-stage setup described in Section 18.5.

First, SSU-level information is summarized within each sampled PSU, including the number of SSUs \(m_i\), the sample mean \(\bar{y}_i\), and the sample variance \(s_i^2\). Second, quantities are aggregated across sampled PSUs to produce the population estimates. In this application, PSU size is measured by area, \(A_i\), which is joined to psu_summary in the final piped expression below.

psu_summary <- ssu_trees %>%
  group_by(psu_id, ssu_id) %>%
  summarize(
    y_ij = sum(TF * tree_count * agb_ton_tree),
  ) %>%
  group_by(psu_id) %>%
  summarize(
    y_bar_i = mean(y_ij),
    m_i = n(),
    s_sq_i = 1 / (m_i - 1) * sum((y_ij - y_bar_i)^2) #Or var(y_ij).
  ) %>%
  left_join(sampled_psu, by = join_by(psu_id))

pop_summary <- psu_summary %>%
  summarize(
    ## Ratio estimator for the mean (per unit area).
    y_bar = sum(A_i * y_bar_i) / sum(A_i),

    ## Standard error of the mean.
    s_y_bar = (1 / A) * sqrt(
      ## Between-PSU variance (with PSU-level FPC).
      (N^2 / n) * (1 - n / N) *
        (1 / (n - 1)) *
        sum((A_i * y_bar_i - A_i * y_bar)^2) +

      ## Within-PSU variance (no SSU-level FPC under areal frame).
      (N / n) *
        sum(A_i^2 / m_i * s_sq_i)
    ),

    ## Total and its standard error.
    t_hat   = A * y_bar,
    s_t_hat = A * s_y_bar
  )

pop_summary

#> # A tibble: 1 × 4
#>   y_bar s_y_bar     t_hat  s_t_hat
#>   <dbl>   <dbl>     <dbl>    <dbl>
#> 1  35.6    3.02 12860648. 1092949.

Our estimated mean and total, with 95% confidence intervals, were 35.56 (29.49, 41.63) tons/ac and 12,860,648 (10,664,284, 15,057,013) tons, respectively. Based on this sample, both confidence intervals captured the true values, which are \(\mu\) = 30.69 tons/ac and \(\tau\) = 11,100,216 tons.

18.6.2.2 Estimation using a PPS sample

Here we apply the PPS estimators defined in Section 18.4.2. We again use a sample of 50 PSUs; however, instead of equal-probability selection, PSUs are sampled proportional to their area. In addition to all_psu from the previous section, the following two data files are used in this analysis.

Because PSU sampling is conducted with replacement under PPS, the same PSU may be selected more than once. Each selection is independent and is associated with its own second-stage sample of SSUs (and, therefore, its own set of tree measurements). As a result, some additional care is needed when organizing the data: we append a sample index to the psu_id values so that tree measurements from different selections of the same PSU are treated as distinct groups in subsequent workflows.¹⁸²

datasets/mass_timber/sampled_pps_psu.csv contains the sampled PSUs. Each row is a PSU, and columns hold a unique PSU index and area in acres. These PSUs were selected using dplyr::slice_sample() with the weight_by argument set to each PSU’s area, applying the desired selection probabilities \(p_i = A_i / A\). The argument replace = TRUE allows for selection with replacement.
datasets/mass_timber/sampled_ssu_pps_psu.csv contains the SSUs for each sampled PSU. Each row corresponds to a tree measured on the 1/10-acre circular plot, and columns are:
- psu_id and ssu_id, which together uniquely identify the trees measured on a given plot and sample;
- tree_count, which indicates how many trees are associated with each row (used to account for zero-tree plots and trees double-counted via the mirage method);
- agb_ton_tree, the dry weight AGB in tons for each tree, used to compute the \(y_{ij}\)’s.

The code below reads and takes a quick look at each data file.

sampled_psu <- read_csv("datasets/mass_timber/sampled_pps_psu.csv")
sampled_psu %>% glimpse()

#> Rows: 50
#> Columns: 2
#> $ psu_id <chr> "316_1", "323_1", "358_1", "420_1", "5…
#> $ ac     <dbl> 19.939, 67.284, 188.677, 128.584, 120.…

ssu_trees <- read_csv("datasets/mass_timber/sampled_ssu_pps_psu.csv")
ssu_trees %>% glimpse()

#> Rows: 16,586
#> Columns: 4
#> $ psu_id       <chr> "316_1", "316_1", "316_1", "316_…
#> $ ssu_id       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,…
#> $ tree_count   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ agb_ton_tree <dbl> 0.00294279, 0.00183660, 0.005673…

Rename a few columns to align with our notation and creating variables needed in subsequent steps. The set of all PSUs all_psu, N, and A are the same as the previous section.

sampled_psu <- sampled_psu %>% 
  rename(A_i = ac)

# Add the tree factor for 1/10-acre plots
r <- sqrt(0.1 * 43560 / pi)
ssu_trees <- ssu_trees %>% mutate(TF = 43560 / (pi * r^2))

Given the PPS sample, we apply the mean and total estimators defined in (18.11) and (18.12), along with their associated standard errors in (18.13) and (18.14).

As with the equal probability estimators, we break the process into two steps. First, we summarize SSU information within each PSU, join to sampled_psu to add the area column A_i, and compute the selection probabilities p_i, both of which are needed for the second step. Second, we compute the required quantities across all sampled PSUs to produce the overall estimates and standard errors.

psu_summary <- ssu_trees %>%
  group_by(psu_id, ssu_id) %>%
  summarize(y_ij = sum(TF * tree_count * agb_ton_tree)) %>%
  summarize(
    y_bar_i = mean(y_ij),
    m_i = n(),
    s_sq_i = 1 / (m_i - 1) * sum((y_ij - y_bar_i)^2) #Or var(y_ij).
  ) %>%
  left_join(sampled_psu, by = join_by(psu_id)) %>%
  mutate(p_i = A_i / A)

psu_summary %>% glimpse()

#> Rows: 50
#> Columns: 6
#> $ psu_id  <chr> "1017_1", "1069_1", "1418_1", "1530_1…
#> $ y_bar_i <dbl> 27.0608, 47.5176, 51.1689, 48.8034, 3…
#> $ m_i     <int> 5, 7, 39, 10, 23, 24, 43, 43, 9, 3, 5…
#> $ s_sq_i  <dbl> 324.535, 139.027, 145.808, 215.842, 5…
#> $ A_i     <dbl> 30.787, 29.630, 216.419, 41.660, 128.…
#> $ p_i     <dbl> 0.000085126, 0.000081929, 0.000598408…

pop_summary <- psu_summary %>%
  summarize(
    n = n(),
    t_hat = 1 / n * sum(A_i * y_bar_i / p_i),
    s_t_hat = sqrt(1 / (n * (n - 1)) * 
                       sum((A_i * y_bar_i / p_i - t_hat)^2)
                     ),
    y_bar = t_hat / A,
    s_y_bar = s_t_hat / A
  )

pop_summary

#> # A tibble: 1 × 5
#>       n     t_hat s_t_hat y_bar s_y_bar
#>   <int>     <dbl>   <dbl> <dbl>   <dbl>
#> 1    50 11979201. 672744.  33.1    1.86

Our estimated mean and total, with 95% confidence intervals, were 33.12 (29.38, 36.86) tons/ac and 11,979,201 (10,627,272, 13,331,131) tons, respectively. Based on this sample, both confidence intervals captured the true values, which are \(\mu\) = 30.69 tons/ac and \(\tau\) = 11,100,216 tons.

18.6.2.3 Repeated-sampling comparison

Here we assess the performance of the two-stage ratio and PPS estimators using 1,000 samples drawn under the sampling approach described above, where we select \(n = 50\) PSUs at the first stage and then sample SSUs within each PSU using a systematic grid of fixed-area plots. For each estimator, we record point estimates, standard errors, and confidence intervals. Performance is evaluated under repeated sampling using the following summaries.

Mean difference — the average difference between the estimate from each sample and the true population mean (see Section 11.2.3 for the formal definition). Under repeated sampling, this quantity is an empirical estimate of bias; values closer to zero indicate that the estimator is centered near the true mean.
RMSE — the root mean squared error, which combines both the mean difference and sampling variability into a single measure of estimator accuracy. Smaller values indicate better overall performance.
Confidence interval width — the average width of the 95% confidence interval, which reflects estimator precision. Narrower intervals are preferred, provided that confidence interval coverage remains close to its nominal level.
Confidence interval coverage — the proportion of confidence intervals that contain the true population mean, expressed as a percentage. This measures how well the variance estimator is calibrated and should be close to the nominal value of 95%.

TABLE 18.1: Comparison of the two-stage equal-probability ratio and PPS estimators for feedstock mean aboveground biomass under repeated sampling from the Michigan feedstock population. Columns show mean difference, root mean squared error (RMSE), and 95% confidence interval width and coverage, based on 1,000 PSU samples each of size \(n\) = 50 and a SYS second-stage sample.
				95% Confidence interval
PSU design	Estimator	Mean difference (tons/ac)	RMSE (tons/ac)	Width (tons/ac)	Coverage (%)
Equal prob.	Ratio	-0.39	2.6	9.2	92
PPS	PPS	-0.35	1.8	7.0	94

Summaries across replicates are given in Table 18.1. The results are consistent with patterns seen earlier in this chapter for PPS designs.

The mean difference is very small for both estimators relative to the population mean. Under equal-probability selection, the average difference is -0.39 tons/ac, while under PPS it’s -0.35 tons/ac. At this scale, these values are negligible, indicating that both estimators are effectively centered on the true mean for this problem. For PPS, any residual departure from zero is expected to diminish as the number of repeated samples increases.

The primary differences between the estimators appear in accuracy and precision. The RMSE decreases from 2.6 tons/ac under equal-probability sampling to 1.8 tons/ac under PPS. This reduction reflects the efficiency gains achieved by selecting PSUs with probability proportional to area, which concentrates sampling effort on stands that contribute more to the population total. The average 95% confidence interval width declines from 9.2 tons/ac under equal-probability sampling to 7 tons/ac under PPS, indicating higher precision under the PPS design.

Confidence interval coverage is closer to the nominal 95% level under PPS as well, although both estimators achieve coverage rates that are near the target. This suggests that uncertainty is reasonably well characterized in both cases, with PPS providing slightly better calibration.

Overall, while both estimators perform well, PPS provides clear advantages in accuracy, precision, and interval performance. These results reinforce the role of PPS as an effective design choice for two-stage inventories when reliable auxiliary information is available at the PSU level.

18.7 Advantages and disadvantages of two-stage sampling

Two-stage sampling is a flexible design that balances statistical efficiency with operational practicality. It’s especially useful when the population is naturally grouped into larger units (PSUs) that can be subdivided into smaller units (SSUs).

18.7.1 Advantages

Fieldwork can be concentrated within a manageable set of PSUs, reducing travel time and logistical complexity.
The first-stage design is flexible: PSUs can be selected with equal probability or with PPS when reliable auxiliary size information is available.
When SSUs within a PSU are relatively homogeneous, measuring only a subset of SSUs can yield results similar to single-stage cluster sampling, while avoiding the cost of measuring all SSUs. This can lead to substantial gains in efficiency with little loss in precision.
Well suited for inventories over large, heterogeneous regions where direct sampling of all SSUs is impractical.

18.7.2 Disadvantages

If within-PSU variability is high and only a small number of SSUs are measured within each PSU, precision can be lower than for single-stage designs in which all SSUs are measured.
When PSU sizes vary widely and the parameter of interest is strongly related to size, equal-probability PSU selection can be inefficient, producing larger standard errors than ratio or PPS designs.

References

Cochran, William G. 1977. Sampling Techniques. Third Edition. John Wiley.

Giraud, Timothée. 2022. “osrm: Interface Between R and the OpenStreetMap-Based Routing Service OSRM.” Journal of Open Source Software 7 (78): 4574.

Lohr, Sharon L. 2019. Sampling: Design and Analysis. Chapman & Hall/CRC Texts in Statistical Science. CRC Press.

Särndal, C. E., B. Swensson, and J. H. Wretman. 1992. Model Assisted Survey Sampling. Springer Series in Statistics. Springer-Verlag.

Thompson, Steven K. 2012. Sampling. Third Edition. Wiley Series in Probability and Statistics. Hoboken, NJ: John Wiley & Sons.

We could imagine a potentially even more efficient design that applies 3P sampling (Section 16.6) to estimate mean carbon within each selected park.↩︎
This type of road-based travel-time polygon can be computed in R using the osrm package, which provides an interface to the Open Source Routing Machine. The package supports travel-time and routing calculations using OpenStreetMap’s road network data. Other tools and packages are also available for generating isochrones, depending on your needs and data sources (Giraud 2022).↩︎
For this particular PSU sample, there happen to be no repeat selections, so all psu_id values have a _1 suffix. This is not true for all samples used in the repeated-sampling analysis in Section 18.6.2.3.↩︎

Want to know when the book is for sale? Enter your email so we can let you know.