This book is a work in progress and is Open Review. We want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right hand corner of the page

Chapter 19 Multistage sampling

Multistage sampling generalizes cluster sampling by allowing selection to occur in two or more successive stages. Rather than drawing a sample directly from the population, units are selected hierarchically: larger units are sampled first, followed by progressively smaller, nested units. In principle, this process can involve any number of stages, depending on how the population is organized and how data collection is implemented.

While multistage designs can involve more than two stages, two-stage sampling is by far the most common in practice. In a typical two-stage design:

  1. a sample of PSUs is selected (e.g., stands, management units, compartments, or ownerships that partition the population of interest);

  2. within each sampled PSU, a sample of SSUs is selected.

This structure offers considerable flexibility. It allows sampling intensity to vary across PSUs, accommodates operational constraints, and reduces travel effort by concentrating field measurements within a subset of selected PSUs.

19.1 Connection to cluster sampling

Two-stage sampling is a generalization of cluster sampling. The cluster design introduced in Chapter 18 corresponds to a special case of two-stage sampling in which all SSUs within each selected PSU are measured.

Two-stage sampling relaxes this requirement. Instead of measuring every SSU within a selected PSU, only a subset of SSUs is sampled. This distinction is important. In single-stage cluster sampling, sampling variability reflects only differences among PSUs, because SSU-level variation is fully observed within each cluster. In two-stage sampling, variability arises from both stages: between-PSU variation due to first-stage selection and within-PSU variation due to second-stage sampling. As a result, estimators and variance expressions must explicitly account for uncertainty introduced at both stages.

19.2 When multistage sampling is useful

Multistage sampling is widely used in forest inventory because it matches how forests are organized and offers a practical way to balance statistical efficiency, cost, and field logistics. By selecting PSUs first and then sampling SSUs within them, it lets you concentrate effort where it has the biggest impact. In particular, it can:

  • use ecological or management units (e.g., stands, compartments, ownership blocks) as convenient PSUs;

  • reduce navigation and travel by allowing crews to work intensively within a smaller number of selected PSUs;

  • incorporate auxiliary information available at the PSU level (e.g., stand area, prior estimates, remote-sensing predictors) to improve precision;

  • align sampling effort with expected variability or importance of different PSUs.

The number of SSUs per PSU can vary, different sampling methods can be used at each stage, and unequal probabilities can be introduced to target PSUs or SSUs expected to contribute more to the population total or mean.

19.3 Strategies for selecting PSUs

The first stage of two-stage sampling involves selecting PSUs. Common strategies include the following.

  1. Simple random sampling (SRS): select PSUs with equal probability when no auxiliary information is available.

  2. Stratification: if PSU-level variables are known (e.g., stand area, stand development stage, forest type), stratify PSUs and sample within strata to improve estimates.

  3. Systematic sampling: if PSUs can be arranged spatially or sorted by auxiliary information related to the parameter of interest, select PSUs systematically with a random start to ensure broad coverage.

  4. Probability proportional to size (PPS): if each PSU has a size measure that’s related to the parameter of interest (e.g., area or predicted biomass), select PSUs with probabilities proportional to that measure.

These approaches can be combined. For example, PSUs might be stratified by forest type and then selected using SRS, SYS, or PPS within each stratum. Stratum-specific two-stage estimates for means, totals, and standard errors can then be combined as described in Section 18.6.

Availability of auxiliary information usually guides the choice of first-stage design. Without auxiliary data, equal-probability sampling at the PSU stage is the default. When size- or variability-related variables are available and correlated with the parameter of interest, PPS designs can improve efficiency by allocating sampling effort in proportion to each PSU’s contribution.

The appropriate second-stage estimator depends on how SSUs are conceptualized within each PSU. In some applications, SSUs form a finite, list-based set of units. In others, SSUs represent sampling locations placed within a continuous area, as in point or plot sampling under an areal sampling frame. To accommodate these cases, we present estimator forms suited to both interpretations.

We present the estimators in their most general forms, allowing for PSUs of differing sizes and varying numbers of SSUs per PSU. When PSU sizes are equal or the number of SSUs per PSU is constant, the estimators simplify. These special cases are treated in standard survey sampling texts such as Cochran (1977), Thompson (2012), Lohr (2019), and Särndal et al. (1992).

While two-stage estimators can look lengthy at first glance, they follow clear, step-by-step calculations. With the tools in Section 11.2, they’re straightforward to implement, and each yields a population mean and total with associated standard errors.

19.4 Estimation using two-stage sampling

We use the following notation:

  • \(N\) number of PSUs in the population;
  • \(M_i\) number of SSUs in the \(i\)-th PSU;
  • \(M = \sum_{i=1}^N M_i\) total number of SSUs in the population;
  • \(n\) number of PSUs in the sample;
  • \(m_i\) number of sampled SSUs in the \(i\)-th sampled PSU;
  • \(y_{ij}\) value of the \(j\)-th SSU in the \(i\)-th PSU;
  • \(\bar{y}_i = \frac{1}{m_i}\sum_{j=1}^{m_i} y_{ij}\) sample mean within PSU \(i\);
  • \(s_i^2 = \frac{1}{m_i - 1} \sum_{j=1}^{m_i} (y_{ij} - \bar{y}_i)^2\) sample variance within PSU \(i\).

This notation and the estimators that follow assume a finite population with a fixed set of \(N\) PSUs (and their associated SSUs). Section 19.5 describes how closely related estimators are used when working from an areal sampling frame.

19.4.1 Sampling PSUs with equal probability

19.4.1.1 Unbiased estimators

An unbiased estimator for the population total \(\tau\) is \[\begin{equation} \hat{t} = \frac{N}{n}\sum_{i=1}^n M_i \bar{y}_i. \tag{19.1} \end{equation}\]

The standard error of \(\hat{t}\) combines variation among PSU totals and variation among SSUs within PSUs and is estimated as \[\begin{equation} s_{\hat{t}} = \sqrt{ \underbrace{\frac{N^2}{n}\left(1 - \frac{n}{N}\right)s^2}_{\text{between-PSU variance}} + \underbrace{\frac{N}{n} \sum_{i = 1}^n \frac{M_i^2}{m_i}\left(1 - \frac{m_i}{M_i}\right)s_i^2}_{\text{within-PSU variance}} }, \tag{19.2} \end{equation}\] where \[\begin{equation} s^2 = \frac{1}{n - 1} \sum_{i = 1}^n \left(M_i \bar{y}_i - \frac{\hat{t}}{N}\right)^2 \end{equation}\] is the sample variance among PSU totals.

Given \(\hat{t}\), the population mean \(\mu\) is estimated as \[\begin{equation} \bar{y} = \frac{\hat{t}}{M}, \tag{19.3} \end{equation}\] with standard error \[\begin{equation} s_{\bar{y}} = \frac{s_{\hat{t}}}{M}. \tag{19.4} \end{equation}\]

As shown in (19.2), total variance is partitioned into two components.

  • The first term (between-PSU variance) captures variation among PSU totals and includes a FPC because PSUs are sampled without replacement.

  • The second term (within-PSU variance) captures variation among SSUs within each PSU. SSUs are also sampled without replacement, so this term includes its own FPC.

This estimator is unbiased for both the population total and mean, as well as their associated variance estimators.

Confidence intervals for the population mean and total follow \[\begin{equation} \bar{y} \pm \tval_{\text{df},1-\frac{\alpha}{2}}s_{\bar{y}} \tag{19.5} \end{equation}\] and \[\begin{equation} \hat{t} \pm \tval_{\text{df},1-\frac{\alpha}{2}}s_{\hat{t}}. \tag{19.6} \end{equation}\] Here, \(\text{df} = n - 1\), reflecting that PSUs—not SSUs—are the independent sampling units.

19.4.1.2 Ratio estimators

The ratio estimator provides an alternative to the unbiased estimator and often yields more stable or efficient estimates of the population mean and total, particularly when PSU sizes vary. Although biased, the ratio estimator often represents a reasonable tradeoff for gains in precision. We quantify this bias in Section 19.6.2 using a forest inventory dataset.

The population mean \(\mu\) is estimated using \[\begin{equation} \bar{y} = \frac{\sum_{i=1}^n M_i\bar{y}_i}{\sum_{i=1}^n M_i}. \tag{19.7} \end{equation}\]

The standard error of \(\bar{y}\) is estimated as \[\begin{equation} s_{\bar{y}} = \frac{1}{M}\sqrt{\underbrace{\frac{N^2}{n}\left(1-\frac{n}{N}\right)\frac{1}{n-1}\sum_{i=1}^n\left(M_i\bar{y}_i - M_i\bar{y}\right)^2}_{\text{between-PSU variance}} + \underbrace{\frac{N}{n}\sum_{i=1}^n \frac{M_i^2}{m_i}\left(1-\frac{m_i}{M_i}\right)s_i^2}_{\text{within-PSU variance}}}. \tag{19.8} \end{equation}\]

Given the sample mean \(\bar{y}\) and its standard error \(s_{\bar{y}}\), the population total and its standard error are estimated as \[\begin{equation} \hat{t} = M\bar{y} \tag{19.9} \end{equation}\] and \[\begin{equation} s_{\hat{t}} = M s_{\bar{y}}. \tag{19.10} \end{equation}\]

Confidence intervals follow (19.5) and (19.6), with \(\text{df}=n-1\).

The ratio estimator for the population mean in (19.7) doesn’t require knowledge of \(M\). However, \(M\) is required to compute the standard error of \(\bar{y}\) and to estimate the population total and its standard error. When \(M\) is unknown, it may be estimated by \[\begin{equation} \hat{M} = \frac{N}{n}\sum_{i=1}^n M_i, \end{equation}\] and \(\hat{M}\) substituted for \(M\) in (19.8), (19.9), and (19.10).

19.4.2 Sampling PSUs with probability proportional to size

The PPS two-stage estimator extends the two-stage framework introduced above by allowing PSUs to be selected with unequal probabilities while incorporating SSU sampling within each selected PSU. When auxiliary information describing PSU size is available prior to sampling, efficiency can often be improved by selecting PSUs with PPS. Here, “size” refers to any nonnegative quantity correlated with the PSU total.

We assume the following design elements.

  • Each PSU has a known per-draw selection probability \(p_i\), with \(p_i \propto x_i\) for some nonnegative size measure \(x_i\) and \(\sum_{i=1}^N p_i = 1\). A common choice is \(x_i = M_i\), giving \(p_i = M_i / M\) where \(M = \sum_{i=1}^N M_i\).

  • PSU sampling is conducted with replacement, producing \(n\) independent draws.

    • If a PSU is selected more than once, a new SSU sample is drawn each time, and each draw yields its own estimate \(\bar{y}_i\).
    • Because selection is with replacement, draws are independent and no finite population correction is applied.

Within each selected PSU, SSUs are sampled so that the resulting sample mean \(\bar{y}_i\) is an unbiased estimate of the PSU mean, for example using SRS or SYS within the PSU.

The population total \(\tau\) is estimated using \[\begin{equation} \hat{t} = \frac{1}{n} \sum_{i=1}^n \frac{M_i \bar{y}_i}{p_i}, \tag{19.11} \end{equation}\] where the sum is taken over the \(n\) PSUs selected with replacement. A PSU may therefore appear multiple times in the sample, and each appearance contributes its own independently estimated \(\bar{y}_i\).

The standard error of \(\hat{t}\) is estimated as \[\begin{equation} s_{\hat{t}} = \sqrt{\frac{1}{n(n - 1)} \sum_{i = 1}^n \left( \frac{M_i \bar{y}_i}{p_i} - \hat{t} \right)^2 }. \tag{19.12} \end{equation}\]

Given \(\hat{t}\), the estimator for the population mean is \[\begin{equation} \bar{y} = \frac{\hat{t}}{M}, \tag{19.13} \end{equation}\] with standard error \[\begin{equation} s_{\bar{y}} = \frac{s_{\hat{t}}}{M}. \tag{19.14} \end{equation}\]

Confidence intervals for the mean and total follow (19.5) and (19.6).

PPS sampling is particularly useful when PSU sizes vary widely and are correlated with the parameter of interest. Allocating sampling effort proportional to PSU size often yields smaller standard errors than equal-probability sampling.

19.5 Two-stage sampling under an areal sampling frame

Two-stage sampling is commonly used in forest inventory under an areal sampling frame. Here we focus on a setting that arises in many forestry applications: a finite set of PSUs (e.g., stands, management units, or ownerships), with SSUs selected within each sampled PSU using point or plot sampling to characterize local conditions.

The population of interest is defined by the collection of PSUs and their combined spatial extent, with total area \(A\). The PSUs need not form a single continuous region; they may be spatially disjoint. As discussed in Section 13.2, an areal sampling frame implies an effectively infinite set of possible sampling locations. Under two-stage sampling, this interpretation applies at the SSU stage: SSUs are treated as points or plots sampled from the areal extent of each PSU. In contrast, the PSUs themselves constitute a finite set of discrete units.

Let \(A_i\) be the area of PSU \(i\), with total population area \(A = \sum_{i=1}^N A_i\). Suppose \(m_i\) SSUs are observed in PSU \(i\), and define the SSU sample mean within PSU \(i\) as \[\begin{equation} \bar{y}_i = \frac{1}{m_i}\sum_{j=1}^{m_i} y_{ij}. \end{equation}\] Each SSU measurement \(y_{ij}\) is expressed on a per-unit-area basis (e.g., tons per hectare), so \(\bar{y}_i\) represents a per-unit-area estimate for PSU \(i\). This assumes that SSUs are sampled so that \(\bar{y}_i\) is an unbiased estimator of the PSU mean, for example using SRS or SYS within the PSU. Because the SSU population is treated as effectively infinite, the number of SSUs \(m_i\) can vary across PSUs without changing the interpretation of \(\bar{y}_i\).

The goal is to estimate the population mean per unit area, \(\mu\), and the corresponding total \(\tau = A\mu\). Estimators in Section 19.4 extend directly to this setting with only a change of notation. We illustrate this by adapting the two-stage ratio and PPS estimators to the areal sampling frame.

19.5.1 Ratio estimator

The two-stage ratio estimator for the population mean per unit area carries over directly from the finite-population setting. Replacing PSU size \(M_i\) with PSU area \(A_i\) and the population total \(M\) with the total area \(A\), the estimator for \(\mu\) is \[\begin{equation} \bar{y} = \frac{\sum_{i=1}^n A_i\bar{y}_i}{\sum_{i=1}^n A_i}. \end{equation}\] The corresponding total is \(\hat{t} = A\bar{y}\).

The standard error is obtained by modifying the two-stage ratio variance in (19.8) to reflect the areal sampling frame. Because SSUs are sampled from an effectively infinite population within each PSU, the SSU-level FPC is omitted, while the PSU-level FPC is retained. The resulting estimator is \[\begin{equation} s_{\bar{y}} = \frac{1}{A}\sqrt{\underbrace{\frac{N^2}{n}\left(1-\frac{n}{N}\right)\frac{1}{n-1}\sum_{i=1}^n\left(A_i\bar{y}_i - A_i\bar{y}\right)^2}_{\text{between-PSU variance}} + \underbrace{\frac{N}{n}\sum_{i=1}^n \frac{A_i^2}{m_i}s_i^2}_{\text{within-PSU variance}}}. \end{equation}\]

The first term reflects between-PSU variation, while the second term reflects within-PSU variation arising from second-stage sampling.

The standard error of the total is \(s_{\hat{t}} = A s_{\bar{y}}\) and confidence intervals for the mean and total follow (19.5) and (19.6), with \(\text{df}=n-1\).

19.5.2 PPS estimator

The PPS estimators in Section 19.4.2 extend in the same way. We again replace \(M_i\) with \(A_i\) and \(M\) with \(A\) in (19.11), (19.13), (19.12), and (19.14), interpreting \(A_i \bar{y}_i\) as the contribution of PSU \(i\) to the population total.

A natural choice of size measure for PPS is the PSU area, so that \(p_i = A_i/A\), assigning per-draw selection probabilities proportional to area under the assumption that larger PSUs tend to contribute more to the total. Other choices of \(p_i\) are possible when auxiliary variables correlated with the parameter of interest are available (e.g., predicted biomass or historical volume).

As before, PSUs are selected with replacement, and each time PSU \(i\) is selected it yields an independent estimate \(\bar{y}_i\) based on a new second-stage sample of SSUs.

19.6 Illustrative analyses

19.6.1 Portland park tree inventory

To illustrate two-stage estimators under a finite first and second stage, we return to the Portland parks dataset introduced in Section 18.7.1, where the goal is to estimate total tree carbon across the city’s parks. Because we have a complete census of all parks and all trees, we can directly compare sample-based estimates with the true population total, \(\tau\) = 26,349 tons.

In the single-stage cluster sampling analysis in Section 18.7.1.3, the PPS estimator outperformed the unstratified and stratified equal-probability estimators in terms of accuracy, precision, and confidence interval coverage. In that setting, stratification provided little additional benefit once PPS was used.

To maintain continuity with those results and highlight how PPS behaves under a two-stage design, we focus this illustration on the PPS estimator. We begin by demonstrating the estimator for a single sample, and then evaluate its performance under repeated sampling so the results can be compared directly with the single-stage cluster sampling analysis.

Recall from Section 18.7.1 that parks serve as PSUs and trees within parks are SSUs. Following the same setup, we assume the number of trees within each park is known (i.e., \(M_i\) is known for \(i = 1,\ldots,N\)), and define the PSU selection probabilities as \(p_i = M_i / M\), where \(M = \sum_{i=1}^N M_i\).

To match the PSU sample size used in Section 18.7.1, we select \(n = 25\) parks using PPS with replacement. Then, specific to the two-stage design, for each selected park we use SRS without replacement to select \(m = 20\) trees for measurement. If a selected park contains fewer than \(20\) trees, all trees in that park are measured (i.e., \(m_i = \min(20, M_i)\)).184

19.6.1.1 PPS two-stage estimator

For each sampled park, the \(m_i\) measured trees are used to compute the sample mean \(\bar{y}_i\). These park-level quantities along with park identifiers, values of \(M_i\), and selection probabilities \(p_i\) are provided in "datasets/PDX/two_stage_pps_park_sample_n25_m20.csv" and read in below.

ppd_parks <- 
  read_csv("datasets/PDX/two_stage_pps_park_sample_n25_m20.csv")
ppd_parks %>% glimpse()
#> Rows: 25
#> Columns: 4
#> $ park_id <dbl> 87, 94, 58, 10, 70, 109, 92, 11, 36, …
#> $ M_i     <dbl> 565, 403, 261, 66, 647, 71, 48, 181, …
#> $ p_i     <dbl> 0.0223700, 0.0159560, 0.0103338, 0.00…
#> $ y_bar_i <dbl> 0.46042, 0.90021, 1.05700, 1.89893, 0…

Two features of the design are worth emphasizing. First, PSUs are selected with replacement using PPS, so the same park can—and often does—appear multiple times in the sample. Second, each appearance of a park triggers a new, independent second-stage sample of trees, yielding a different realized value of \(\bar{y}_i\). Both features are evident below, where parks 81, 87, and 94 each appear twice in the sample but with different \(\bar{y}_i\) values due to independent SSU sampling.

ppd_parks %>% arrange(desc(p_i))
#> # A tibble: 25 × 4
#>    park_id   M_i    p_i y_bar_i
#>      <dbl> <dbl>  <dbl>   <dbl>
#>  1      79  1016 0.0402   1.40 
#>  2     141   804 0.0318   0.890
#>  3      83   668 0.0264   0.416
#>  4      70   647 0.0256   0.918
#>  5      81   633 0.0251   0.718
#>  6      81   633 0.0251   0.730
#>  7      87   565 0.0224   0.460
#>  8      87   565 0.0224   0.505
#>  9      94   403 0.0160   0.900
#> 10      94   403 0.0160   0.876
#> # ℹ 15 more rows

The PPS estimator and its standard error follow (19.11) and (19.12), respectively, and are implemented below.

est_pps <- pps_parks %>%
  summarize(
    n = n(),
    t_hat = mean(M_i * y_bar_i / p_i),
    s_t_hat = sqrt(1 / (n * (n - 1)) *
      sum((M_i * y_bar_i / p_i - t_hat)^2)),
    t = qt(df = n - 1, p = 1 - 0.05 / 2),
    t_ci_lower = t_hat - t * s_t_hat,
    t_ci_upper = t_hat + t * s_t_hat
  )

est_pps %>%
  select(t_hat, t_ci_lower, t_ci_upper)
#> # A tibble: 1 × 3
#>    t_hat t_ci_lower t_ci_upper
#>    <dbl>      <dbl>      <dbl>
#> 1 25885.     21411.     30360.

The census-based total tree carbon across all parks is \(\tau\) = 26,349 tons. For this sample, the two-stage PPS estimate and its 95% confidence interval are 25,885 (21,411, 30,360) tons, which contains the true value.

To assess the estimator’s performance more generally—and to compare directly with the single-stage cluster results—we next turn to a repeated-sampling analysis.

19.6.1.2 Repeated-sampling comparison

We evaluate the two-stage PPS estimator under repeated sampling using the same general setup as in Section 18.7.1.3, drawing 500,000 samples from the Portland parks population. As before, each sample yields an estimate of the same fixed population total, allowing us to assess estimator performance across many realizations rather than from a single sample.

Across repeated samples, the two-stage PPS estimator for total tree carbon (tons) is centered very close to the true population total. The mean relative difference across samples is -0.018%, consistent with the expectation that the PPS estimator is centered on the true population total under the sampling design. The estimator shows a mean relative standard error (RSE) of 11.3% and achieves 95% confidence interval coverage of 94.6%.

These results can be compared directly with the single-stage PPS cluster estimator reported in the third row of Table 18.1. Because the single-stage design measures all SSUs within each selected PSU, whereas the two-stage design measures only a subset, some loss of precision is expected under two-stage sampling. In this case, however, the loss is modest: the RSE increases only slightly, from 10.1% under single-stage cluster sampling to 11.3% under the two-stage design.

Many of the sampled PSUs contain hundreds of trees, so restricting measurement to just 20 trees per PSU at the second stage might initially feel like a substantial loss of information. For this population, however, that level of subsampling is sufficient to produce stable estimates of PSU-level mean tree carbon. The resulting gains in operational efficiency therefore come with only a small increase in variance.

Taken together, these results point to a practical takeaway that parallels the single-stage cluster-sampling case. When PSU size is strongly associated with the population total, most efficiency gains come from PPS selection at the PSU stage. Once PSUs are selected efficiently, only a modest number of SSUs are needed within each PSU to obtain stable PSU-level means and reliable estimates of population totals.

19.6.2 Mass timber feedstock analysis

In this example, we apply the two-stage ratio and PPS estimators from Sections 19.4.1.2 and 19.4.2 to an areal-frame setting, with a finite first stage and an areal second stage as described in Section 19.5. The data are simulated, but the structure reflects analyses commonly encountered in applied forestry.

Suppose a mass timber mill and fabrication firm is exploring the feasibility of locating a facility in the region shown in Figure 19.1. The firm expects the state’s Department of Natural Resources (DNR) to be its primary supplier and wants to assess how much suitable biomass is available on DNR-managed lands within a practical hauling distance of the proposed site.

This type of pre-operational assessment is often referred to as a feedstock study, where feedstock refers to the volume of accessible wood available to support sustained operation of a mill or processing facility. Consultants and analysts routinely conduct such studies using limited field data, combined with spatial and stand-level auxiliary information, to estimate both total available biomass and its spatial distribution.

Conifer and mixed hardwood/softwood stands managed by the Michigan Department of Natural Resources within a one-hour drive of the proposed mass timber mill and fabrication facility. These stands define the population of primary sampling units (PSUs). A simple random sample of 50 PSUs was selected for inventory.

FIGURE 19.1: Conifer and mixed hardwood/softwood stands managed by the Michigan Department of Natural Resources within a one-hour drive of the proposed mass timber mill and fabrication facility. These stands define the population of primary sampling units (PSUs). A simple random sample of 50 PSUs was selected for inventory.

To identify the region of interest, the firm determines that a one-hour travel time is the upper limit for economically feasible wood hauling. We use this threshold to compute an isochrone—a polygon showing the area reachable within a fixed travel time using the road network. In this case, the isochrone includes all locations accessible within one hour of the proposed facility.185

The fabrication process relies primarily on softwoods, though the firm is experimenting with some hardwood inputs. Our parameters of interest are the mean per-acre and total dry weight of AGB on all DNR conifer and mixed hardwood/softwood stands within the one-hour isochrone.

The population consists of \(N\) = 8,927 stands, and the cruising budget supports a sample of \(n\) = 50. Stands serve as the PSUs, and the SSUs are inventory plots placed within each stand. We’ll use 1/10-acre circular plots laid out on a regular 7-by-7 chain (462 ft) grid—a systematic design commonly used in operational inventory work throughout the region.

Figure 19.1 shows the full PSU population along with 50 PSUs selected using SRS. Figure 19.2 highlights a region containing several selected PSUs, and Figure 19.3 provides a closer view of these PSUs and their SSUs.

Zoomed-in view from Figure 19.1, highlighting three selected primary sampling units (PSUs).

FIGURE 19.2: Zoomed-in view from Figure 19.1, highlighting three selected primary sampling units (PSUs).

The three primary sampling units (PSUs) shown in Figure 19.2. PSU IDs are 1895, 1791, and 1948 for (a), (b), and (c), respectively. Secondary sampling units (SSUs) were placed using systematic sampling. Each inventory plot is colored by the observed aboveground biomass (AGB).

FIGURE 19.3: The three primary sampling units (PSUs) shown in Figure 19.2. PSU IDs are 1895, 1791, and 1948 for (a), (b), and (c), respectively. Secondary sampling units (SSUs) were placed using systematic sampling. Each inventory plot is colored by the observed aboveground biomass (AGB).

In the sections that follow, we’ll apply the two-stage equal-probability ratio and PPS estimators to estimate both mean and total AGB. Because the data are simulated, we know the true population values \(\mu\) = 30.69 tons/ac and \(\tau\) = 11,100,216 tons. This lets us evaluate and compare each estimator’s performance against known parameters.

As you work through these sections, you’ll see that point estimates (means and totals) are similar across estimators, while the PPS estimator produces a smaller estimated standard error than the equal-probability ratio estimator. Because these results come from a single sample, however, it’s unclear whether the observed differences reflect true estimator performance or simply sample-to-sample variability. To explore this question, we turn to a repeated-sampling analysis in Section 19.6.2.3.

19.6.2.1 Estimation using an equal probability sampling

Under this design, each PSU in the population has an equal probability of selection. The following data files are used in this analysis.

  • datasets/mass_timber/all_psu.csv contains the full set of PSUs in the population. Each row is a PSU, and columns hold a unique PSU index and area in acres.

  • datasets/mass_timber/sampled_equal_prob_psu.csv contains the sampled PSUs. These were drawn from the full set of PSUs using dplyr::slice_sample() with \(n\)=50 and all arguments set to their default values.

  • datasets/mass_timber/sampled_ssu_equal_prob_psu.csv contains the SSUs for each sampled PSU. Each row corresponds to a tree measured on the 1/10-acre circular plot, and columns are:

    • psu_id and ssu_id, which together uniquely identify the trees measured on a given plot;

    • tree_count, which indicates how many trees are associated with each row (used to account for zero-tree plots and trees double-counted via the mirage method);

    • agb_ton_tree, the dry weight AGB in tons for each tree, used to compute the \(y_{ij}\)’s.

The code below reads and takes a quick look at each data file.

all_psu <- read_csv("datasets/mass_timber/all_psu.csv")
all_psu %>% glimpse()
#> Rows: 8,927
#> Columns: 2
#> $ psu_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,…
#> $ ac     <dbl> 44.565, 21.690, 51.851, 131.325, 75.43…
sampled_psu <- 
  read_csv("datasets/mass_timber/sampled_equal_prob_psu.csv")
sampled_psu %>% glimpse()
#> Rows: 50
#> Columns: 2
#> $ psu_id <dbl> 1017, 8004, 4775, 8462, 4050, 8789, 13…
#> $ ac     <dbl> 30.787, 18.785, 11.723, 16.064, 38.443…
ssu_trees <- 
  read_csv("datasets/mass_timber/sampled_ssu_equal_prob_psu.csv")
ssu_trees %>% glimpse()
#> Rows: 7,750
#> Columns: 4
#> $ psu_id       <dbl> 1017, 1017, 1017, 1017, 1017, 10…
#> $ ssu_id       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ tree_count   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ agb_ton_tree <dbl> 0.16786, 0.16999, 0.14626, 0.151…

We’ll now rename a few columns to align with our notation and create variables needed in subsequent steps.

all_psu <- all_psu %>% rename(A_i = ac)
sampled_psu <- sampled_psu %>% rename(A_i = ac)

N <- nrow(all_psu)
n <- nrow(sampled_psu)

A <- sum(all_psu$A_i)

# Add the tree factor for 1/10-acre plot.
ssu_trees <- ssu_trees %>% mutate(TF = 10)

We implement the ratio estimator for the population mean and total, along with their standard errors, as defined in (19.7), (19.9), (19.8), and (19.10). The estimators are applied in two steps, using the areal second-stage setup described in Section 19.5.

First, SSU-level information is summarized within each sampled PSU, including the number of SSUs \(m_i\), the sample mean \(\bar{y}_i\), and the sample variance \(s_i^2\). The PSU area \(A_i\) is then joined to psu_summary. Second, these quantities are aggregated across sampled PSUs to produce the population estimates.

psu_summary <- ssu_trees %>%
  group_by(psu_id, ssu_id) %>%
  summarize(
    y_ij = sum(TF * tree_count * agb_ton_tree),
  ) %>%
  group_by(psu_id) %>%
  summarize(
    y_bar_i = mean(y_ij),
    m_i = n(),
    s_sq_i = 1 / (m_i - 1) * sum((y_ij - y_bar_i)^2) #Or var(y_ij).
  ) %>% # Join PSU areas.
  left_join(sampled_psu, by = join_by(psu_id))
pop_summary <- psu_summary %>%
  summarize(
    # Ratio estimator for the mean (per unit area).
    y_bar = sum(A_i * y_bar_i) / sum(A_i),

    # Standard error of the mean.
    s_y_bar = (1 / A) * sqrt(
      ## Between-PSU variance (with PSU-level FPC).
      (N^2 / n) * (1 - n / N) *
        (1 / (n - 1)) *
        sum((A_i * y_bar_i - A_i * y_bar)^2) +

      # Within-PSU variance (no SSU-level FPC under areal frame).
      (N / n) *
        sum(A_i^2 / m_i * s_sq_i)
    ),

    # Total and its standard error.
    t_hat   = A * y_bar,
    s_t_hat = A * s_y_bar
  )

pop_summary
#> # A tibble: 1 × 4
#>   y_bar s_y_bar     t_hat  s_t_hat
#>   <dbl>   <dbl>     <dbl>    <dbl>
#> 1  35.6    3.02 12860648. 1092949.

Our estimated mean and total, with 95% confidence intervals, were 35.56 (29.49, 41.63) tons/ac and 12,860,648 (10,664,284, 15,057,013) tons, respectively. Based on this sample, both confidence intervals captured the true values, which are \(\mu\) = 30.69 tons/ac and \(\tau\) = 11,100,216 tons.

19.6.2.2 Estimation using a PPS sample

Here we apply the PPS estimators defined in Section 19.4.2. We again use a sample of 50 PSUs; however, instead of equal-probability selection, PSUs are sampled proportional to their area. In addition to all_psu from the previous section, the following two data files are used in this analysis.

  • datasets/mass_timber/sampled_pps_psu.csv contains the sampled PSUs. Each row is a PSU, and columns hold a unique PSU index and area in acres. These PSUs were selected using dplyr::slice_sample() with the weight_by argument set to each PSU’s area, applying the desired selection probabilities \(p_i = A_i / A\). The argument replace = TRUE allows for selection with replacement.

  • datasets/mass_timber/sampled_ssu_pps_psu.csv contains the SSUs for each sampled PSU. Each row corresponds to a tree measured on the 1/10-acre circular plot, and columns are:

    • psu_id and ssu_id, which together uniquely identify the trees measured on a given plot and sample;

    • tree_count, which indicates how many trees are associated with each row (used to account for zero-tree plots and trees double-counted via the mirage method);

    • agb_ton_tree, the dry weight AGB in tons for each tree, used to compute the \(y_{ij}\)’s.

Because PSU sampling is conducted with replacement under PPS, the same PSU may be selected more than once. Each selection is independent and is associated with its own second-stage sample of SSUs (and, therefore, its own set of tree measurements). As a result, some additional care is needed when organizing the data: we append a sample index to the psu_id values so that tree measurements from different selections of the same PSU are treated as distinct groups in subsequent workflows.186

The code below reads and takes a quick look at each data file.

sampled_psu <- read_csv("datasets/mass_timber/sampled_pps_psu.csv")
sampled_psu %>% glimpse()
#> Rows: 50
#> Columns: 2
#> $ psu_id <chr> "316_1", "323_1", "358_1", "420_1", "5…
#> $ ac     <dbl> 19.939, 67.284, 188.677, 128.584, 120.…
ssu_trees <- read_csv("datasets/mass_timber/sampled_ssu_pps_psu.csv")
ssu_trees %>% glimpse()
#> Rows: 16,586
#> Columns: 4
#> $ psu_id       <chr> "316_1", "316_1", "316_1", "316_…
#> $ ssu_id       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,…
#> $ tree_count   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ agb_ton_tree <dbl> 0.00294279, 0.00183660, 0.005673…

Rename a few columns to align with our notation and creating variables needed in subsequent steps. The set of all PSUs all_psu, N, and A are the same as the previous section.

sampled_psu <- sampled_psu %>% 
  rename(A_i = ac)

# Add the tree factor for 1/10-acre plots
ssu_trees <- ssu_trees %>% mutate(TF = 10)

Given the PPS sample, we apply the mean and total estimators defined in (19.11) and (19.13), along with their associated standard errors in (19.12) and (19.14).

As with the equal probability estimators, we break the process into two steps. First, we summarize SSU information within each PSU, join to sampled_psu to add the area column A_i, and compute the selection probabilities p_i, both of which are needed for the second step. Second, we compute the required quantities across all sampled PSUs to produce the overall estimates and standard errors.

psu_summary <- ssu_trees %>%
  group_by(psu_id, ssu_id) %>%
  summarize(y_ij = sum(TF * tree_count * agb_ton_tree)) %>%
  summarize(
    y_bar_i = mean(y_ij),
    m_i = n(),
    s_sq_i = 1 / (m_i - 1) * sum((y_ij - y_bar_i)^2) #Or var(y_ij).
  ) %>%
  left_join(sampled_psu, by = join_by(psu_id)) %>%
  mutate(p_i = A_i / A)

psu_summary %>% glimpse()
#> Rows: 50
#> Columns: 6
#> $ psu_id  <chr> "1017_1", "1069_1", "1418_1", "1530_1…
#> $ y_bar_i <dbl> 27.0608, 47.5176, 51.1689, 48.8034, 3…
#> $ m_i     <int> 5, 7, 39, 10, 23, 24, 43, 43, 9, 3, 5…
#> $ s_sq_i  <dbl> 324.535, 139.027, 145.808, 215.842, 5…
#> $ A_i     <dbl> 30.787, 29.630, 216.419, 41.660, 128.…
#> $ p_i     <dbl> 0.000085126, 0.000081929, 0.000598408…
pop_summary <- psu_summary %>%
  summarize(
    n = n(),
    t_hat = 1 / n * sum(A_i * y_bar_i / p_i),
    s_t_hat = sqrt(1 / (n * (n - 1)) * 
                       sum((A_i * y_bar_i / p_i - t_hat)^2)
                     ),
    y_bar = t_hat / A,
    s_y_bar = s_t_hat / A
  )

pop_summary
#> # A tibble: 1 × 5
#>       n     t_hat s_t_hat y_bar s_y_bar
#>   <int>     <dbl>   <dbl> <dbl>   <dbl>
#> 1    50 11979201. 672744.  33.1    1.86

Our estimated mean and total, with 95% confidence intervals, were 33.12 (29.38, 36.86) tons/ac and 11,979,201 (10,627,272, 13,331,131) tons, respectively. Based on this sample, both confidence intervals captured the true values, which are \(\mu\) = 30.69 tons/ac and \(\tau\) = 11,100,216 tons.

19.6.2.3 Repeated-sampling comparison

Here we assess the performance of the two-stage ratio and PPS estimators using 1,000 samples drawn under the sampling approach described above, where we select \(n = 50\) PSUs at the first stage and then sample SSUs within each PSU using a systematic grid of fixed-area plots. For each estimator, we record point estimates, standard errors, and confidence intervals. Performance is evaluated under repeated sampling using the following summaries.

  • Mean difference — the average difference between the estimate from each sample and the true population mean (see Section 12.2.3 for the formal definition). Under repeated sampling, this quantity is an empirical estimate of bias; values closer to zero indicate that the estimator is centered near the true mean.

  • RMSE — the root mean squared error, which combines both the mean difference and sampling variability into a single measure of estimator accuracy. Smaller values indicate better overall performance.

  • Confidence interval width — the average width of the 95% confidence interval, which reflects estimator precision. Narrower intervals are preferred, provided that confidence interval coverage remains close to its nominal level.

  • Confidence interval coverage — the proportion of confidence intervals that contain the true population mean, expressed as a percentage. This measures how well the variance estimator is calibrated and should be close to the nominal value of 95%.

TABLE 19.1: Comparison of the two-stage equal-probability ratio and PPS estimators for feedstock mean aboveground biomass under repeated sampling from the Michigan feedstock population. Columns show mean difference, root mean squared error (RMSE), and 95% confidence interval width and coverage, based on 1,000 PSU samples each of size \(n\) = 50 and a SYS second-stage sample.
95% Confidence interval
PSU design Estimator Mean difference (tons/ac) RMSE (tons/ac) Width (tons/ac) Coverage (%)
Equal prob. Ratio -0.39 2.6 9.2 92
PPS PPS -0.35 1.8 7.0 94

Summaries across replicates are given in Table 19.1. The results are consistent with patterns seen earlier in this chapter for PPS designs.

The mean difference is very small for both estimators relative to the population mean. Under equal-probability selection, the mean difference is -0.39 tons/ac, while under PPS it’s -0.35 tons/ac. At this scale, these values are negligible, indicating that both estimators are effectively centered on the true mean for this problem. For PPS, any residual departure from zero is expected to diminish as the number of repeated samples increases.

The primary differences between the estimators appear in accuracy and precision. The RMSE decreases from 2.6 tons/ac under equal-probability sampling to 1.8 tons/ac under PPS. This reduction reflects the efficiency gains achieved by selecting PSUs with probability proportional to area, which concentrates sampling effort on stands that contribute more to the population total. The average 95% confidence interval width declines from 9.2 tons/ac under equal-probability sampling to 7 tons/ac under PPS, indicating higher precision under the PPS design.

Confidence interval coverage is closer to the nominal 95% level under PPS as well, although both estimators achieve rates near the target. This suggests that uncertainty is reasonably well characterized in both cases, with PPS performing slightly better.

Overall, while both estimators perform well, PPS provides clear advantages in accuracy, precision, and interval performance. These results reinforce the role of PPS as an effective design choice for two-stage inventories when reliable auxiliary information is available at the PSU level.

19.7 Advantages and disadvantages of two-stage sampling

Two-stage sampling is a flexible design that balances statistical efficiency with operational practicality. It’s especially useful when the population is naturally grouped into larger units (PSUs) that can be subdivided into smaller units (SSUs).

19.7.1 Advantages

  • Fieldwork can be concentrated within a manageable set of PSUs, reducing travel time and logistical complexity.

  • The first-stage design is flexible: PSUs can be selected with equal probability or with PPS when reliable auxiliary size information is available.

  • When SSUs within a PSU are relatively homogeneous, measuring only a subset of SSUs can yield results similar to single-stage cluster sampling, while avoiding the cost of measuring all SSUs. This can lead to substantial gains in efficiency with little loss in precision.

  • Well suited for inventories over large, heterogeneous regions where direct sampling of all SSUs is impractical.

19.7.2 Disadvantages

  • If within-PSU variability is high and only a small number of SSUs are measured within each PSU, precision can be lower than for single-stage designs in which all SSUs are measured.

  • When PSU sizes vary widely and the parameter of interest is strongly related to size, equal-probability PSU selection can be inefficient, producing larger standard errors than ratio or PPS designs.

19.8 Exercises

19.8.1 Two-stage sampling with equal-probability PSUs

In this exercise, you’ll work through the two-stage unbiased and ratio estimators in Section 19.4.1 using a small simulated dataset. The goal is to practice building a two-stage estimation workflow and to compare the unbiased and ratio estimators in a simple setting.

The two-stage sample is stored in the following files.

  • datasets/two_stage_sim_sampled_psu.csv contains the PSUs selected in the first stage using SRS. Each row is a sampled PSU, with columns:
    • psu_id PSU identifier;
    • M_i number of SSUs in that PSU.
  • datasets/two_stage_sim_sampled_ssu.csv contains the SSUs selected in the second stage from each sampled PSU. Each row is an observed SSU, with columns:
    • psu_id PSU identifier;
    • ssu_id SSU identifier within PSU;
    • y_ij measured SSU values.

For this population, the number of PSUs is \(N\) = 30 and the total number of SSUs is \(M\) = 1,510.

The code below reads the two-stage sample and shows the file structure.

sampled_psu <- read_csv("datasets/two_stage_sim_sampled_psu.csv")
ssu_sample <- read_csv("datasets/two_stage_sim_sampled_ssu.csv")

sampled_psu %>% glimpse()
#> Rows: 8
#> Columns: 2
#> $ psu_id <dbl> 17, 10, 27, 28, 19, 2, 21, 8
#> $ M_i    <dbl> 73, 37, 65, 56, 29, 23, 28, 62
ssu_sample %>% glimpse()
#> Rows: 80
#> Columns: 3
#> $ psu_id <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 8, 8, 8,…
#> $ ssu_id <dbl> 16, 15, 19, 13, 1, 23, 10, 11, 18, 12,…
#> $ y_ij   <dbl> 13.8246, 18.9466, 18.4690, 11.6913, 16…

Exercise 19.1 Using ssu_sample, summarize SSU measurements within each sampled PSU to compute:

  • \(m_i\) the number of sampled SSUs in PSU \(i\);
  • \(\bar{y}_i\) the sample mean within PSU \(i\);
  • \(s_i^2\) the sample variance within PSU \(i\).

Call your summarized output psu_summary. Then join the PSU sizes M_i from sampled_psu to psu_summary. Your resulting psu_summary should match ours below.

psu_summary %>% glimpse()
#> Rows: 8
#> Columns: 5
#> $ psu_id  <dbl> 2, 8, 10, 17, 19, 21, 27, 28
#> $ m_i     <int> 10, 10, 10, 10, 10, 10, 10, 10
#> $ y_bar_i <dbl> 12.967, 13.396, 14.635, 14.696, 10.66…
#> $ s_sq_i  <dbl> 14.3316, 18.4955, 88.2193, 6.7483, 5.…
#> $ M_i     <dbl> 23, 62, 37, 73, 29, 28, 65, 56

Exercise 19.2 Using (19.1), (19.2), (19.3), and (19.4), compute the two-stage unbiased estimators for the population total and mean, along with their standard errors and 95% confidence intervals.

Store your results in a tibble called unb_ests.

unb_ests %>% glimpse()
#> Rows: 1
#> Columns: 8
#> $ t_hat       <dbl> 18600
#> $ s_t_hat     <dbl> 2631.4
#> $ y_bar       <dbl> 12.318
#> $ s_y_bar     <dbl> 1.7426
#> $ y_bar_lower <dbl> 8.1973
#> $ y_bar_upper <dbl> 16.439
#> $ t_hat_lower <dbl> 12378
#> $ t_hat_upper <dbl> 24822

Exercise 19.3 Using (19.7), (19.8), (19.9), and (19.10), compute the two-stage ratio estimators for the population mean and total, along with their standard errors and 95% confidence intervals.

Store your results in a tibble called ratio_ests.

ratio_ests %>% glimpse()
#> Rows: 1
#> Columns: 8
#> $ y_bar       <dbl> 13.298
#> $ s_y_bar     <dbl> 0.39009
#> $ t_hat       <dbl> 20079
#> $ s_t_hat     <dbl> 589.03
#> $ y_bar_lower <dbl> 12.375
#> $ y_bar_upper <dbl> 14.22
#> $ t_hat_lower <dbl> 18687
#> $ t_hat_upper <dbl> 21472

Exercise 19.4 The data used for this exercise are simulated, so we know the population values:

  • \(\mu\) = 13.381;
  • \(\tau\) = 20,206.

Given your estimates from the unbiased and ratio estimators, consider the following questions.

  1. Which estimator is closer to the truth for this sample?
  2. Which estimator provides the more precise estimate for this sample (i.e., has the smaller standard error or narrower confidence interval)?
  3. Why is it not appropriate to infer something about an estimator’s overall performance based on a single sample?

19.8.2 Mass timber feedstock reanalysis

In Section 19.6.2.2, we used two-stage PPS to estimate the population mean and total AGB for the feedstock analysis. In this exercise, you’ll repeat that analysis using a different PPS sample of PSUs (i.e., different but equally likely). The goal is to reinforce the two-stage PPS estimation workflow and highlight some important features of PPS sampling.

As described in Section 19.6.2, the population is comprised of \(N\) = 8,927 stands with a total population area of \(A\) = 361,658 acres.

The following files are used in this exercise:

  • datasets/mass_timber/all_psu.csv contains the full PSU population and PSU areas (acres).
  • datasets/mass_timber/sampled_pps_psu_exercise.csv contains a PPS sample of PSUs drawn with replacement and per-draw selection probability \(p_i = A_i / A\).
  • datasets/mass_timber/sampled_ssu_pps_psu_exercise.csv contains the SSUs (inventory plot) measured within each selected PSU.

To get things started, we read in the data and rename some columns to align with the notation in Section 19.5.

all_psu <- read_csv("datasets/mass_timber/all_psu.csv")
sampled_psu <- read_csv("datasets/mass_timber/sampled_pps_psu_exercise.csv")
ssu_trees <- read_csv("datasets/mass_timber/sampled_ssu_pps_psu_exercise.csv")

all_psu <- all_psu %>% rename(A_i = ac)
sampled_psu <- sampled_psu %>% rename(A_i = ac)

Exercise 19.5 Because PSUs are sampled with replacement under PPS, the same PSU can be selected more than once. This happens in your current sample—in fact, one PSU is selected three times. Recall from Section 19.6.2.2 that we appended a sample index to the psu_id values in sampled_psu and ssu_trees so that measurements from different selections of the same PSU can be treated as distinct samples.

When you inspect sampled_psu, you’ll see that PSU ID 3718 was selected three times (e.g., run print(sampled_psu, n = Inf); sampled_psu %>% filter(str_detect(psu_id, "3718"))). To better understand why this PSU was selected so often, use its acreage to compute its per-draw selection probability. Then, using all_psu, create a histogram (or another plot) showing either the distribution of PSU areas or their per-draw selection probabilities. Where does PSU ID 3718 fall in this distribution (approximately—just describe its position)?

Exercise 19.6 Now start into the estimation workflow presented in Section 19.6.2.2. Begin by adding the tree factor for the 1/10-acre plots to ssu_trees, then form psu_summary, join the stand areas from sampled_psu, and create a new column that holds the selection probabilities \(p_i = A_i / A\).

Your psu_summary should look like the following.

psu_summary %>% glimpse()
#> Rows: 50
#> Columns: 6
#> $ psu_id  <chr> "1069_1", "148_1", "1670_1", "1712_1"…
#> $ y_bar_i <dbl> 50.95682, 28.77426, 39.54716, 0.48815…
#> $ m_i     <int> 5, 46, 122, 9, 64, 30, 45, 36, 28, 17…
#> $ s_sq_i  <dbl> 34.905752, 190.906975, 202.657507, 0.…
#> $ A_i     <dbl> 29.630, 243.296, 606.799, 54.829, 306…
#> $ p_i     <dbl> 0.000081929, 0.000672723, 0.001677823…

Exercise 19.7 Continuing with the workflow in Section 19.6.2.2, use psu_summary to form pop_summary, implementing (19.11), (19.12), (19.13), and (19.14) to compute:

  • the PPS estimators for the population total and mean;
  • their standard errors;
  • 95% confidence intervals for both quantities.

Store your results in pop_summary, which should match ours below.

Compare your estimates and confidence intervals to the true population values \(\mu\) = 30.69 tons/ac and \(\tau\) = 11,100,216 tons. Do the 95% confidence intervals from this PPS sample cover the true mean and total?

pop_summary %>% glimpse()
#> Rows: 1
#> Columns: 9
#> $ n           <int> 50
#> $ t_hat       <dbl> 12151220
#> $ s_t_hat     <dbl> 606451
#> $ y_bar       <dbl> 33.599
#> $ s_y_bar     <dbl> 1.6769
#> $ y_bar_lower <dbl> 30.229
#> $ y_bar_upper <dbl> 36.968
#> $ t_hat_lower <dbl> 10932512
#> $ t_hat_upper <dbl> 13369928

References

Cochran, William G. 1977. Sampling Techniques. Third Edition. John Wiley.
Giraud, Timothée. 2022. osrm: Interface Between R and the OpenStreetMap-Based Routing Service OSRM.” Journal of Open Source Software 7 (78): 4574.
Lohr, Sharon L. 2019. Sampling: Design and Analysis. Chapman & Hall/CRC Texts in Statistical Science. CRC Press.
Särndal, C. E., B. Swensson, and J. H. Wretman. 1992. Model Assisted Survey Sampling. Springer Series in Statistics. Springer-Verlag.
Thompson, Steven K. 2012. Sampling. Third Edition. Wiley Series in Probability and Statistics. John Wiley & Sons.

  1. We could imagine a potentially even more efficient design that applies 3P sampling (Section 17.6) to estimate mean carbon within each selected park.↩︎

  2. This type of road-based travel-time polygon can be computed in R using the osrm package, which provides an interface to the Open Source Routing Machine. The package supports travel-time and routing calculations using OpenStreetMap’s road network data. Other tools and packages are also available for generating isochrones, depending on your needs and data sources (Giraud 2022).↩︎

  3. For this particular PSU sample, there happen to be no repeat selections, so all psu_id values have a _1 suffix. This is not true for all samples used in the repeated-sampling analysis in Section 19.6.2.3.↩︎

Want to know when the book is for sale? Enter your email so we can let you know.