Chapter 17 Cluster sampling
Cluster sampling is commonly used when it’s impractical or costly to sample widely dispersed locations. In this design, the primary sampling unit (PSU) is a cluster of smaller secondary sampling units (SSUs), where measurements are taken. SSUs are nested within PSUs, introducing additional design considerations compared to the sampling designs introduced earlier.
Cluster sampling is particularly useful when:
- the cost of reaching widely dispersed sampling locations is high;
- operational constraints require grouping sampling locations for efficient data collection;
- measurement of a sampling location is inexpensive relative to travel costs;
- or the sampling frame is only available at the PSU level.
These conditions are common in forest inventory, where measurement costs are low relative to the cost of moving a field crew. Cluster designs reduce travel time by grouping nearby SSUs within the same PSU, allowing crews to measure multiple locations with minimal movement. This approach is especially useful when sampling large areas. For example, cluster designs are widely used in NFI programs, where travel costs are high and field crews must cover broad regions efficiently.
The tradeoff, however, is a potential loss of statistical efficiency. Because of spatial autocorrelation—the tendency for observations located close together to be more similar than those farther apart—SSUs in the same PSU often exhibit less variability than single locations sampled independently across space, such as in the SYS designs shown in Section 14. As a result, cluster sampling typically produces less precise estimates than a SRS with the same number of SSUs. This is a classic cost-precision tradeoff: cluster sampling reduces field effort but can increase sampling error.
17.1 Clusters, strata, and systematic sampling
Although clusters (PSUs) and strata both divide the population into groups, their motivation differs. In stratified sampling, the population is partitioned into non-overlapping subgroups (strata) using auxiliary variables, and sampling is conducted independently within each stratum. The goal is to increase precision by concentrating sampling effort within relatively homogeneous subgroups.
In cluster sampling, the goal is not to improve statistical efficiency directly, but to simplify or reduce the cost of data collection. PSUs are often based on geographic proximity or field logistics. A subset of PSUs is selected (e.g., via SRS, SYS, or another probability sampling method), and measurements are collected from all or some SSUs within each selected PSU.
The contrast can be summarized as follows.
- Stratification aims to reduce within-group variability and improve precision.
- Clustering groups nearby SSUs into PSUs to reduce travel effort, but because those SSUs tend to be similar, precision is often lower than if the same number of SSUs were sampled independently.
Both approaches are valid and often used together—for example, in stratified cluster sampling, where PSUs are selected independently within each stratum.
SYS, as described in Section 14, can be viewed conceptually as a form of cluster sampling with a single PSU consisting of \(n\) evenly spaced SSUs. This reinforces why SYS doesn’t have a design-based variance estimator (Section 14.5)—with only one randomly selected PSU, between-PSU variation cannot be estimated.
17.2 Elements of cluster sampling design
When implementing a cluster sampling design, several key elements must be considered.
Cluster layout and spacing. PSUs may be naturally defined or imposed by the analyst and selected using a probability sampling method. Spacing affects both cost and precision: widely spaced PSUs reduce redundancy due to spatial autocorrelation but increase travel effort, while closely spaced PSUs simplify fieldwork but provide less independent information. Under an areal sampling frame, PSUs act as anchor points around which SSUs are arranged.
Cluster shape and SSU configuration. If not naturally defined, SSUs are arranged in a fixed geometric pattern around an anchor point. This configuration shapes how local variability is represented, as well as field efficiency and measurement consistency. SSU size can be scaled to efficiently measure different variables, and nested configurations (e.g., for regeneration sampling) can be used when needed.
Number of clusters and SSUs per cluster. In a finite-population setting, each PSU contains a fixed and known set of SSUs. Under an areal sampling frame, the number, size, and arrangement of SSUs are part of the design. In either case, increasing the number of PSUs generally improves precision more than increasing the number of SSUs per PSU, since PSUs tend to provide more independent information.
Method of selecting clusters. PSUs can be selected using any valid probability sampling method. For finite populations, SRS and SYS are common. When auxiliary information is available, stratified sampling or PPS sampling may improve precision while maintaining logistical feasibility. Approaches for selecting PSUs under an areal sampling frame are discussed in Section 17.5.
These elements interact to determine the balance between statistical precision and operational cost. In forest inventory, the goal is often to design a layout that captures sufficient spatial variability while remaining cost-effective and logistically manageable.
17.3 Choosing the number of PSUs and SSUs
Unlike stratified sampling, where well-established formulas exist for allocating sample sizes across strata, sample size planning for clusters is more heuristic and context-specific. The main tradeoff is between the number of PSUs and the number of SSUs within each PSU. Adding SSUs improves precision only up to a point—when observations are spatially correlated, the marginal benefit of additional SSUs declines. Increasing the number of PSUs is usually more effective for reducing sampling error.
In most forest inventory settings, these decisions are shaped by cost and logistics. Analytical expressions for optimal allocation exist but require estimates of within- and between-PSU variability, which are rarely available before sampling begins. Some survey efforts use simulation to explore tradeoffs under realistic spatial and operational constraints. Additional discussion can be found in Freese (1962), S. L. Lohr (2019) Section 5.5, Cochran (1977) Section 9.6, and Brus (2022) Chapter 6.
There are many variations of cluster sampling. This section focuses on single-stage cluster sampling, where all SSUs within each selected PSU are measured. When observing every SSU is impractical, a subset of SSUs is sampled instead—a design known as two-stage (or multistage) sampling, covered in Section 18.
17.4 Estimation using cluster sampling
Here, we focus on single-stage cluster sampling, in which every SSU within a selected PSU is measured, and we introduce estimators for both equal- and unequal-probability selection of PSUs.
We use the following notation:
- \(N\) number of PSUs in the population;
- \(m_i\) number of SSUs in the \(i\)-th PSU;
- \(M = \sum_{i=1}^N m_i\) number of SSUs in the population;
- \(\bar{M} = M/N\) population mean number of SSUs per PSU;
- \(n\) number of PSUs in the sample;
- \(\bar{m} = \frac{1}{n}\sum_{i=1}^n m_i\) mean number of SSUs in the sampled PSUs;
- \(y_{ij}\) value of the \(j\)-th SSU in the \(i\)-th PSU;
- \(y_i = \sum_{j=1}^{m_i} y_{ij}\) total for PSU \(i\).
This notation and the estimators that follow assume a finite population with a fixed set of \(N\) PSUs (and their associated SSUs). Section 17.5 describes how closely related estimators are used when working from an areal sampling frame.
17.4.1 Sampling PSUs with equal probability
17.4.1.1 Unbiased estimators
An unbiased estimator for the population total \(\tau\) is \[\begin{equation} \hat{t} = \frac{N}{n}\sum_{i=1}^n y_i. \tag{17.1} \end{equation}\]
The corresponding unbiased estimator for the population mean \(\mu\) is \[\begin{equation} \bar{y} = \frac{\hat{t}}{M}. \tag{17.2} \end{equation}\]
The sample variance among PSU totals is \[\begin{equation} s^2 = \frac{1}{n - 1}\sum_{i=1}^n\left(y_i - \frac{\hat{t}}{N}\right)^2 \tag{17.3} \end{equation}\] and the standard error of the estimated total is \[\begin{equation} s_{\hat{t}} = N\sqrt{\left(1 - \frac{n}{N}\right)\frac{s^2}{n}}. \tag{17.4} \end{equation}\]
The corresponding standard error of the mean is \[\begin{equation} s_{\bar{y}} = \frac{s_{\hat{t}}}{M}. \tag{17.5} \end{equation}\]
Confidence intervals for the population mean and total are \[\begin{equation} \bar{y} \pm \tval_{\text{df},1-\frac{\alpha}{2}}\, s_{\bar{y}} \tag{17.6} \end{equation}\] and \[\begin{equation} \hat{t} \pm \tval_{\text{df},1-\frac{\alpha}{2}}\, s_{\hat{t}}. \tag{17.7} \end{equation}\] Here, \(\text{df} = n - 1\), reflecting that PSUs—not SSUs—are the independent sampling units.
17.4.1.2 Ratio estimators
Using (17.2) requires knowledge of \(M\) (the total number of SSUs). If \(M\) is unknown, the ratio estimator below can be used. Even when \(M\) is known, the ratio estimator often provides more stable or efficient estimates than the unbiased estimator.
The cluster ratio estimator for the population mean is \[\begin{equation} \bar{y} = \frac{\sum_{i=1}^n y_i}{\sum_{i=1}^n m_i}. \tag{17.8} \end{equation}\]
The standard error of \(\bar{y}\) is \[\begin{equation} s_{\bar{y}} = \sqrt{\left(1 - \frac{n}{N}\right)\frac{1}{n \bar{M}^2} \frac{\sum_{i=1}^n (y_i - \bar{y} m_i)^2}{n - 1}}, \tag{17.9} \end{equation}\] where \(\bar{M}\) is replaced by \(\bar{m}\) when \(\bar{M}\) is unknown.
The total is estimated as \[\begin{equation} \hat{t} = M\bar{y}, \tag{17.10} \end{equation}\] with standard error \[\begin{equation} s_{\hat{t}} = Ms_{\bar{y}}. \tag{17.11} \end{equation}\]
Confidence intervals for the mean and total follow (17.6) and (17.7), respectively.
17.4.2 Sampling PSUs with probability proportional to size
When auxiliary information describing cluster size is available prior to sampling, efficiency can be improved by selecting PSUs with probability proportional to size (PPS). Here, “size” refers to any nonnegative quantity correlated with the PSU total.173
We assume the following design elements.
Each PSU has a known per-draw selection probability \(p_i\), with \(p_i \propto x_i\) for some nonnegative size measure \(x_i\) (where \(\propto\) means “proportional to”), and \(\sum_{i=1}^N p_i = 1\). A common choice is \(x_i = m_i\), giving \(p_i = m_i / M\) where \(M = \sum_{i=1}^N m_i\).
Sampling is conducted with replacement, producing \(n\) independent draws. If a PSU is selected more than once, a new PSU-level measurement \(y_i\) is obtained. Because PSUs are selected with replacement, the draws are independent and no FPC is applied.
The estimator for the population total is \[\begin{equation} \hat{t} = \frac{1}{n} \sum_{i=1}^n \frac{y_i}{p_i}. \tag{17.12} \end{equation}\] The sum is taken over the \(n\) PSUs selected with replacement. Again, if the same PSU appears multiple times, its contributions are treated as independent terms.
The population mean is \[\begin{equation} \bar{y} = \frac{\hat{t}}{M}. \tag{17.13} \end{equation}\]
The standard error is \[\begin{equation} s_{\hat{t}} = \sqrt{\frac{1}{n(n - 1)} \sum_{i=1}^n \left( \frac{y_i}{p_i} - \hat{t} \right)^2 }. \tag{17.14} \end{equation}\]
Given \(s_{\hat{t}}\), the standard error for the mean is \[\begin{equation} s_{\bar{y}} = \frac{s_{\hat{t}}}{M}. \tag{17.15} \end{equation}\]
Confidence intervals for the mean and total follow (17.6) and (17.7), respectively.
This estimator is unbiased for both the total and the mean, and the associated variance estimators are unbiased as well. PPS sampling is particularly useful when cluster sizes vary substantially and are correlated with the parameter of interest.
17.5 Cluster sampling under an areal sampling frame
Cluster sampling under an areal sampling frame arises most often in large-area forest inventories, particularly NFIs and other long-term monitoring programs. In these settings, PSUs are located over a continuous geographic region using probability-based spatial designs—most commonly SYS or spatially balanced variants—and each PSU anchors a fixed configuration of SSUs, such as fixed-area plots or point samples.
SSUs within a cluster follow a simple, repeatable geometric layout relative to the PSU anchor. Common configurations include squares, rectangles, lines, or L-shaped arrangements. These layouts are often chosen for field efficiency, interpretability, and consistency over time. However, in some NFIs the orientation, spacing, and overall extent of the SSU configuration are also informed by anticipated environmental gradients and spatial autocorrelation, with the goal of improving information capture while remaining operationally efficient; see, e.g., the Swedish and Finnish designs described in Tomppo et al. (2010). Once defined, the SSU configuration is treated as part of the sampling design and applied uniformly across all PSUs.
This setting differs fundamentally from the finite-population cluster framework introduced earlier. PSUs aren’t selected from a finite list, and SSUs don’t correspond to a fixed population of elements. Instead, inference targets population means or totals per unit area over a geographic region of known area \(A\). Finite-population quantities such as \(N\) and \(M\) don’t appear explicitly, and finite population corrections aren’t used.
Because cluster layouts are fixed in space, SSUs inevitably interact with population and stratum boundaries. Operational NFIs address these boundary effects using a variety of design-based strategies that emphasize robustness and operational simplicity rather than exact local geometric correction. A useful conceptual contrast is between geometry-explicit inclusion-zone methods (e.g., Valentine et al. (2006) or Gregoire and Valentine (2008), Section 7.5.6) and the ratio-based estimators used in many operational inventories, including those employed by several NFIs (Tomppo et al. 2010). These approaches typically avoid plot-level boundary corrections (e.g., mirage or walkthrough methods) by handling boundary effects implicitly through design choices and estimation. An example of one such NFI approach is introduced in the next section and illustrated in Section 17.7.2.
17.5.1 Ratio estimation using SSU centers
Here, we describe a ratio estimator used in several operational NFIs, described by Tomppo (2006) (see Section 11.3.3.3), and consistent with the ratio-estimation framework introduced in Section 16.2.
A key requirement for this estimator is that PSUs are selected over a region that extends beyond the population (or stratum) boundary by a sufficient buffer distance. This buffer must be large enough that the full SSU configuration associated with any sampled PSU can occur entirely outside the population boundary. In other words, PSU locations are generated over a spatial region larger than the population of inferential interest, while estimation is restricted to the population itself. This design allows SSU center indicators to provide a consistent measure of sampling effort near boundaries and eliminates the need for explicit geometric boundary corrections (e.g., mirage or walkthrough methods). The role of this buffer, and its interaction with population boundaries, is illustrated in Section 17.7.2.
Each PSU contains a fixed configuration of SSUs determined by the sampling design. Tree-level expansion is defined at the SSU level, exactly as in plot and point sampling (Sections 12.3.1 and 12.4.1). Expansion factors aren’t modified for proximity to population or stratum boundaries.
The population mean per unit area is estimated using the cluster ratio estimator \[\begin{equation} \bar{y} = \frac{\sum_{i=1}^n y_i}{\sum_{i=1}^n z_i}, \tag{17.16} \end{equation}\] where \(n\) is the number of sampled PSUs. This is a ratio-of-means estimator in the sense of Section 16.2.1174, with PSU-level contributions aggregated across clusters and normalized by the total number of SSU centers falling within the population or stratum boundary.
The PSU-level numerator in (17.16) is \[\begin{equation} y_i = \sum_{j=1}^{m_i} y_{ij}, \tag{17.17} \end{equation}\] where \(m_i\) is the number of SSUs in PSU \(i\), as defined by the cluster layout, and \(y_{ij}\) is the per-unit-area contribution from SSU \(j\) in PSU \(i\) assigned to the population or stratum of interest. This quantity is computed as the sum of expanded tree-level measurements for trees that are both measured on SSU \(j\) and belong to the population or stratum, based on each tree’s location or condition.
Tree expansion factors are defined in the same way as in plot or point sampling and reflect each tree’s inclusion probability under the SSU design. Trees that fall outside the population or stratum do not contribute to \(y_{ij}\), even if they’re measured on the SSU.
The PSU-level denominator in (17.16) is \[\begin{equation} z_i = \sum_{j=1}^{m_i} z_{ij}, \tag{17.18} \end{equation}\] where \(z_{ij}\) is an SSU-level center indicator \[\begin{equation*} z_{ij} = \begin{cases} 1, & \text{if the center of SSU $j$ lies within the population (or stratum),} \\ 0, & \text{otherwise.} \end{cases} \end{equation*}\]
Trees may contribute to the numerator even when they occur on SSUs whose centers lie outside the population or stratum, while the denominator is defined entirely by center-point inclusion.
The standard error of \(\bar{y}\) is estimated as \[\begin{equation} s_{\bar{y}} = \sqrt{\frac{1}{n \bar{z}^2} \frac{\sum_{i=1}^n (y_i - \bar{y} z_i)^2}{n - 1}}, \tag{17.19} \end{equation}\] with \(\bar{z} = \frac{1}{n}\sum_{i=1}^n z_i\).
The population total is estimated as \[\begin{equation} \hat{t} = A \bar{y}, \tag{17.20} \end{equation}\] with standard error \[\begin{equation} s_{\hat{t}} = A s_{\bar{y}}. \tag{17.21} \end{equation}\]
The independent sampling units here are PSUs, not SSUs. All variability is assessed at the PSU level, and the ratio form of the estimator is essential whenever the number of SSU centers contributing to the denominator varies across PSUs, as commonly occurs near population or stratum boundaries.
This estimator provides a practical and elegant solution to boundary problems in large-area cluster sampling. Rather than redefining plots or performing explicit geometric corrections, boundary effects are handled implicitly by assigning expanded tree-level contributions to computation units and using center-point counts to represent sampling effort. Under systematic or spatially balanced sampling, over- and under-coverage near boundaries cancel in expectation, yielding unbiased estimates for large areas.
Two limitations are worth noting. First, because plots aren’t subdivided, per-plot means within a population or stratum aren’t available for divided plots. Second, for very small strata, it’s possible for trees to contribute to the numerator while no SSU centers fall within the stratum, leading to an undefined mean. These situations are rarely consequential in routine NFI applications focused on large-area reporting but matter for small-area estimation.
The Finnish NFI implements (17.16) using the SSU center indicator \(z_i\), a simple binary measure of whether each SSU center falls inside the population or stratum. This approach is easy for field crews to apply and keeps data collection efficient, but it carries the limitations noted above. The Swedish NFI uses the same ratio-estimation framework, but defines the denominator as the area of each SSU that falls within the population or stratum (Fridman et al. 2014). While this requires additional effort to determine partial SSU areas, it avoids zero-denominator cases and supports plot-level means for divided plots. Together, these two approaches highlight different, defensible tradeoffs between operational simplicity and statistical detail.
17.6 Stratified cluster sampling
Stratified cluster sampling extends the ideas introduced in stratified sampling (Section 15.2) to settings where clusters are used as the primary sampling units. The population is first divided into \(H\) non-overlapping strata, and a cluster sampling design is then applied independently within each stratum.
Conceptually, the stratification step is unchanged. The key difference is that, within each stratum, estimation is based on PSU-level quantities rather than individual observations. As a result, the stratum-level inputs to the stratified estimators are cluster-based means, totals, and standard errors.
Within each stratum \(h = 1, \ldots, H\):
- select \(n_h\) PSUs using an appropriate probability sampling method (e.g., SRS, SYS, or PPS);
- within each sampled PSU, measure all SSUs and compute PSU-level quantities (e.g., \(y_{hi}\) and any associated denominator terms);
- use the appropriate cluster estimator to obtain a stratum-level mean \(\bar{y}_h\) and its standard error \(s_{\bar{y}_h}\).
The choice of estimator in Step 3 depends on the sampling frame. For finite populations, unbiased, ratio, or PPS cluster estimators may be used (Sections 17.4.1.1, 17.4.1.2, and 17.4.2). Under an areal sampling frame, ratio estimators based on SSU center inclusion or related proportional-weighting approaches are typically used (Section 17.5).
Once stratum-level estimates are available, they are combined exactly as in standard stratified sampling. The overall mean is \[\begin{equation} \bar{y} = \frac{\sum_{h = 1}^H M_h \bar{y}_h}{M}, \tag{17.22} \end{equation}\] where \(M_h\) is the size of stratum \(h\) and \(M = \sum_{h = 1}^H M_h\).
In a finite-population setting, \(M_h\) and \(M\) correspond to counts of SSUs (or other appropriate size measures). Under an areal sampling frame, they correspond to stratum area \(A_h\) and total area \(A\), respectively (see Section 15.5). In practice, this amounts to replacing counts with areas while retaining the same estimator structure.
The standard error of the stratified mean is \[\begin{equation} s_{\bar{y}} = \sqrt{\frac{\sum_{h = 1}^H M_h^2 s_{\bar{y}_h}^2}{M^2}}, \tag{17.23} \end{equation}\] which follows directly from combining independent stratum-level estimates.
If stratum-level totals \(\hat{t}_h\) are estimated directly, the overall total and its standard error are \[\begin{equation} \hat{t} = \sum_{h = 1}^H \hat{t}_h \tag{17.24} \end{equation}\] and \[\begin{equation} s_{\hat{t}} = \sqrt{\sum_{h = 1}^H s_{\hat{t}_h}^2}. \tag{17.25} \end{equation}\] Confidence intervals for stratified cluster estimators follow the same form as in Section 15.2, with degrees of freedom \[\begin{equation} \text{df} = n - H, \end{equation}\] where \(n = \sum_{h = 1}^H n_h\) is the total number of sampled PSUs. This reflects that PSUs—not SSUs—are the independent sampling units.
17.7 Illustrative analyses
The following two examples demonstrate how the cluster sampling estimators introduced earlier behave in practice. To facilitate performance comparisons, both examples use datasets for which a complete census is available, allowing us to assess bias, precision, and interval coverage under repeated sampling.
The first example uses a Portland, Oregon park tree inventory to estimate total tree carbon across the city’s parks. This is a finite-population setting, and we examine the unbiased, ratio, and PPS estimators. The second example revisits the Harvard Forest dataset and illustrates the ratio estimator under an areal (infinite-population) sampling frame using fixed-area subplots arranged in PSU-level clusters.
These examples highlight implementation using efficient tidyverse workflows and show how estimator behavior differs under realistic forest inventory conditions.
17.7.1 Portland park tree inventory
The Portland Parks and Recreation Urban Forestry division undertook a major effort to collect a complete census of all trees in the city’s parks and streets, producing one of the most detailed municipal inventories in the US (DiSalvo, Fukuda, and Ramsey 2017).175 Here we focus on the park portion of that census. Each park was surveyed to record tree locations, species, and structural measurements. A primary objective of the census was to estimate environmental benefits for the city’s trees, including carbon storage. To do so, tree-level allometric equations from the i-Tree software suite (Nowak 2021) were used to derive carbon estimates.
For our analysis, we treat this census as the underlying population from which samples could be drawn. Mapping back to the notation of Section 17.4, there are \(N\) = 177 parks (PSUs) containing a total of \(M\) = 25,257 trees (SSUs). The true total carbon is \(\tau\) = 26,349 tons (1 ton = 2,000 pounds). Each park \(i\) contains \(m_i\) trees, each with a carbon value \(y_{ij}\).
FIGURE 17.1: Portland parks cluster sampling example with parks as primary sampling units (PSUs) and trees as secondary sampling units (SSUs). PSU selection uses probability proportional to size (PPS) based on the number of trees in each park. Close-up views of the parks labeled (a), (b), and (c) are provided in Figure 17.2.
Figure 17.1 shows an overview of the city’s parks, with parks treated as PSUs and trees as SSUs. Close-up views of three example parks are shown in Figure 17.2, illustrating the variation in park size and tree density that motivates the sampling approaches considered below.
FIGURE 17.2: Close-up views of the parks (PSUs) labeled (a), (b), and (c) in Figure 17.1. These correspond to park_id values 12, 79, and 170, which are Alberta Park, Laurelhurst Park, and Tanner Springs Park, respectively.
To make the example more realistic, we assume we do not have access to the full set of tree-level measurements, but we do know the total number of trees in each park. This assumption is increasingly plausible due to advances in remote sensing and individual tree detection, where identifying crowns and estimating tree counts from imagery has become routine, see, e.g., Ventura et al. (2024) and Lauri Mehtätalo et al. (2022). Under this setup, the sampling problem reduces to selecting which parks (PSUs) to visit for detailed tree carbon measurement (SSUs).
Before proceeding, it’s useful to explore the relationship between park size and total carbon. In practice we wouldn’t know carbon for every park, but we would know \(m_i\), and it’s reasonable to expect that parks with more trees store more carbon. Understanding this relationship helps evaluate which estimators are appropriate and whether their variance estimates are reliable. For exploratory purposes, we temporarily use the census to examine this relationship.
Figure 17.3 shows a strong association between tree count (\(m_i\)) and total carbon (\(y_i\)). The marginal histograms emphasize the heavy right-skew in both variables and the near-linear increase in \(y_i\) with \(m_i\).176
FIGURE 17.3: Relationship between the number of trees per park (\(m_i\)) and total carbon (\(y_i\)) across all Portland parks. Marginal histograms show the distribution of each variable.
The heavy right-tail in both park size and total carbon means that a small number of parks contribute a disproportionately large share of the population total. As discussed in Section 11.3.4, standard confidence intervals rely on the estimator’s sampling distribution being roughly normal. This can happen even when the underlying data distribution is skewed. But when the population is highly skewed and the sample size is modest, the sampling distribution may remain skewed as well. In these cases, normal-based confidence intervals (like the \(\tval\)-based ones we use here) can miss their intended coverage.177
One way to help alleviate this problem is to use a stratified design (Section 15) that ensures both large, high-carbon parks and smaller parks are represented in the sample. By forcing the sample to cover the full range of park sizes, stratification reduces the chance of missing influential large parks and limits their impact on the overall estimator. This often stabilizes the sampling distribution and improves precision, leading to confidence intervals with coverage much closer to the intended (i.e., nominal) level.
To explore these estimators and the effect of stratification, we implement:
- equal-probability PSU sampling with the unbiased estimator (Section 17.4.1.1);
- equal-probability PSU sampling with the ratio estimator (Section 17.4.1.2);
- PPS sampling with the PPS estimator (Section 17.4.2).
We then stratify the parks into two groups based on tree count (small versus large) and repeat these same estimators under a stratified cluster sampling design. Finally, using the full census, we carry out a repeated sampling analysis to compare bias, precision, and confidence interval coverage across all six estimators—the unstratified and stratified versions of the unbiased, ratio, and PPS estimators.
17.7.1.1 Applying the estimators without stratification
We begin by estimating total carbon using a sample of \(n = 25\) PSUs (parks) selected using SRS and PPS. Three data files are used in this analysis.
"datasets/PDX/all_parks.csv"contains the full set of PSUs in the population. Each row corresponds to a park, with columns for a unique park identifier (park_id), the number of trees in that park (\(m_i\)), and the corresponding total carbon (\(y_i\)), computed as the sum of that park’s \(y_{ij}\) tree values."datasets/PDX/cluster_equal_prob_park_sample_n25.csv"stores the 25-park sample drawn using SRS from"all_parks.csv". The code below generated this sample using the tibbleall_parksread from"all_parks.csv"."datasets/PDX/cluster_pps_park_sample_n25.csv"contains the 25-park sample drawn using PPS, with selection probabilities proportional to each park’s tree count (\(m_i\)). This sample was generated using the code below and selected parks are shown in Figure 17.1.Under this PPS-with-replacement design, the per-draw selection probability for park \(i\) is \(p_i = m_i / M\), consistent with the estimator in Section 17.4.2.
The code below reads and inspects each data file.
#> Rows: 177
#> Columns: 3
#> $ park_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12…
#> $ m_i <dbl> 26, 253, 100, 75, 110, 60, 318, 921, …
#> $ y_i <dbl> 7.85925, 143.06605, 160.04115, 7.3109…
equal_prob_parks <-
read_csv("datasets/PDX/cluster_equal_prob_park_sample_n25.csv")
equal_prob_parks %>% glimpse()#> Rows: 25
#> Columns: 3
#> $ park_id <dbl> 68, 167, 129, 162, 43, 14, 51, 85, 21…
#> $ m_i <dbl> 49, 108, 4, 96, 1, 90, 35, 35, 338, 9…
#> $ y_i <dbl> 37.60115, 30.29090, 2.20830, 43.03720…
#> Rows: 25
#> Columns: 3
#> $ park_id <dbl> 87, 94, 58, 10, 70, 109, 92, 11, 36, …
#> $ m_i <dbl> 565, 403, 261, 66, 647, 71, 48, 181, …
#> $ y_i <dbl> 308.2177, 382.0469, 421.2472, 108.513…
17.7.1.1.1 Equal-probability sampling: unbiased estimator
An unbiased estimator for the total and its standard error follow (17.1) and (17.4).
N <- nrow(all_parks)
# Apply the unbiased estimator.
est_unbiased <- equal_prob_parks %>%
summarize(
n = n(),
t_hat = N / n * sum(y_i),
s_sq = 1 / (n - 1) * sum((y_i - t_hat / N)^2),
s_t_hat = N * sqrt((1 - n / N) * s_sq / n),
t = qt(df = n - 1, p = 1 - 0.05 / 2),
t_ci_lower = t_hat - t * s_t_hat,
t_ci_upper = t_hat + t * s_t_hat
)
est_unbiased %>%
select(t_hat, t_ci_lower, t_ci_upper)#> # A tibble: 1 × 3
#> t_hat t_ci_lower t_ci_upper
#> <dbl> <dbl> <dbl>
#> 1 32277. 10724. 53830.
The census-based total carbon across all parks is \(\tau\)= 26,349 tons. The unbiased estimate for the total, with corresponding 95% confidence intervals, was 32,277 (10,724, 53,830) tons. Based on this sample, the confidence interval captured the true value.
17.7.1.1.2 Equal-probability sampling: ratio estimator
The ratio estimator for the mean and total, and their corresponding standard errors, follow (17.8), (17.10), (17.9), and (17.11).
N <- nrow(all_parks)
M <- sum(all_parks$m_i)
M_bar <- M / N
# Apply the ratio estimator.
est_ratio <- equal_prob_parks %>%
summarize(
n = n(),
y_bar = sum(y_i) / sum(m_i),
s_y_bar = sqrt((1 - n / N) * (1 / (n * M_bar^2)) *
sum((y_i - y_bar * m_i)^2) / (n - 1)),
t = qt(df = n - 1, p = 1 - 0.05 / 2),
y_ci_lower = y_bar - t * s_y_bar,
y_ci_upper = y_bar + t * s_y_bar
) %>%
mutate(
t_hat = M * y_bar,
t_ci_lower = M * y_ci_lower,
t_ci_upper = M * y_ci_upper
) %>%
select(t_hat, t_ci_lower, t_ci_upper)
est_ratio#> # A tibble: 1 × 3
#> t_hat t_ci_lower t_ci_upper
#> <dbl> <dbl> <dbl>
#> 1 28098. 19754. 36441.
Again, the census-based total carbon across all parks is \(\tau\)= 26,349 tons. The ratio estimate for the total, with corresponding 95% confidence intervals, was 28,098 (19,754, 36,441) tons. Based on this sample, the confidence interval captured the true value.
17.7.1.1.3 PPS sampling estimator
The PPS estimator for the total and its standard error follow (17.12) and (17.14).
M <- sum(all_parks$m_i)
# Apply the PPS estimator.
est_pps <- pps_parks %>%
mutate(p_i = m_i / M) %>%
summarize(
n = n(),
t_hat = (1 / n) * sum(y_i / p_i),
s_t_hat = sqrt(1 / (n * (n - 1)) * sum((y_i / p_i - t_hat)^2)),
t = qt(df = n - 1, p = 1 - 0.05 / 2),
t_ci_lower = t_hat - t * s_t_hat,
t_ci_upper = t_hat + t * s_t_hat
) %>%
select(t_hat, t_ci_lower, t_ci_upper)
est_pps#> # A tibble: 1 × 3
#> t_hat t_ci_lower t_ci_upper
#> <dbl> <dbl> <dbl>
#> 1 28029. 22735. 33323.
Again, the census-based total carbon across all parks is \(\tau\)= 26,349 tons. The PPS estimate for the total, with corresponding 95% confidence intervals, was 28,029 (22,735, 33,323) tons. Based on this sample, the confidence interval captured the true value.
As with the other designs, a single sample doesn’t tell us much about estimator performance. We can, however, see the CI narrowing as we move from the unbiased estimator to the ratio estimator and then to the PPS estimator. In Section 17.7.1.3, we’ll check whether this pattern holds across repeated samples or if it was just a quirk of this one sample.
17.7.1.2 Applying the estimators with stratification
Following the steps in Section 15.1, we begin by defining the strata. Because park-level tree counts (\(m_i\)) are known for all parks, they offer a natural basis for stratification. Figure 17.4 shows histograms of tree count (\(m_i\)) and total carbon (\(y_i\)), with colors indicating the two proposed strata. The strata were created by splitting the parks based on the cumulative proportion of total tree counts, separating parks with relatively few trees from those with many. The cutoff is subjective—the goal was simply to isolate the handful of parks with exceptionally high tree counts.
This figure also shows that using \(m_i\) for stratification produces nearly the same cutoff for total carbon. We can confirm this here only because the full population data are available. In most applications, we’d rely on the \(m_i\) values alone and would assume the strong positive relationship with total carbon seen earlier in Figure 17.3.
FIGURE 17.4: Histograms showing how Portland parks were divided into low- and high-tree-count strata based on the number of trees per park (\(m_i\)). The total carbon (\(y_i\)) distribution shows a nearly identical separation, suggesting that tree counts can serve as an effective stratifying variable.
The strong right-skew in both variables highlights that a small number of large parks account for a disproportionate share of the population total. Stratifying on \(m_i\) helps ensure these influential parks are represented in the sample, reducing extreme sample-to-sample variability in the resulting estimates. This approach aims to address the issues raised in Section 11.3.4, where strong skew can cause equal-probability confidence intervals to have unreliable coverage.
To keep the stratified and unstratified designs comparable, we select a total of \(n\) = 25 parks, allocating a higher sampling fraction (50%) to the high-tree-count stratum and a lower fraction (8%) to the low-tree-count stratum. In practice, such allocation might be driven by cost or logistics.
Sampling is carried out independently within each stratum. Under the equal-probability design, parks are selected with equal probability within their stratum. Under the PPS design, selection probabilities are proportional to each park’s number of trees (\(m_{hi}\)), so that \(p_{hi} = m_{hi}/M_h\).
The PSU-level sample data are provided in the following files. In addition to the columns park_id, m_i, and y_i, these files also include a column stratum indicating whether each park falls into the high- or low-tree-count stratum.
"datasets/PDX/strat_all_parks.csv"is the same as"all_parks.csv"from Section 17.7.1.1, but with the addition of thestratumcolumn."datasets/PDX/cluster_strat_equal_prob_park_sample_n25.csv"contains the park-level data for the stratified equal-probability sample."datasets/PDX/cluster_strat_pps_park_sample_n25.csv"contains the park-level data for the stratified PPS sample.
Following Section 17.6, we next apply the three estimators—unbiased, ratio, and PPS—within this stratified framework. Each estimator is applied within strata and then combined to obtain an overall estimate of the population total \(\tau\). As before, the results reflect a single realization of the sampling process and are intended for illustration.
17.7.1.2.1 Equal-probability stratified sampling: unbiased estimator
We first apply the unbiased cluster estimator within each stratum and then combine the stratum-level results to obtain the stratified estimate for the population total and its standard error, using (17.24) and (17.25), respectively.
# Read the stratified equal-probability sample.
strat_all_parks <-
read_csv("datasets/PDX/strat_all_parks.csv")
strat_equal_prob_parks <-
read_csv("datasets/PDX/cluster_strat_equal_prob_park_sample_n25.csv")
strat_size <- strat_all_parks %>%
group_by(stratum) %>%
summarize(N_h = n())
# Apply the unbiased estimator.
est_unbiased <- strat_equal_prob_parks %>%
left_join(strat_size, by = "stratum") %>%
group_by(stratum) %>%
summarize(
N_h = first(N_h),
n_h = n(),
t_hat_h = N_h / n_h * sum(y_i),
s_sq_h = 1 / (n_h - 1) * sum((y_i - t_hat_h / N_h)^2),
s_t_hat_h = N_h * sqrt((1 - n_h / N_h) * s_sq_h / n_h)
) %>% summarise(
t_hat = sum(t_hat_h),
s_t_hat = sqrt(sum(s_t_hat_h^2)),
H = n_distinct(stratum),
n = sum(n_h),
t = qt(df = n - H, p = 1 - 0.05 / 2),
t_ci_lower = t_hat - t * s_t_hat,
t_ci_upper = t_hat + t * s_t_hat
)
est_unbiased %>%
select(t_hat, t_ci_lower, t_ci_upper)#> # A tibble: 1 × 3
#> t_hat t_ci_lower t_ci_upper
#> <dbl> <dbl> <dbl>
#> 1 24379. 16333. 32425.
The census-based total carbon across all parks is \(\tau\)= 26,349 tons. The stratified unbiased estimate for the total, with corresponding 95% confidence intervals, was 24,379 (16,333, 32,425) tons. Based on this sample, the confidence interval captured the true value.
17.7.1.2.2 Equal-probability stratified sampling: ratio estimator
We apply the ratio estimator within each stratum and then combine the stratum-level estimates to obtain the stratified ratio estimate for the population total and its standard error, following (17.24) and (17.25).
strat_size <- strat_all_parks %>%
group_by(stratum) %>%
summarize(
N_h = n(),
M_h = sum(m_i),
M_bar_h = M_h / N_h
)
# Apply the ratio estimator.
est_ratio <- strat_equal_prob_parks %>%
left_join(strat_size, by = "stratum") %>%
group_by(stratum) %>%
summarize(
N_h = first(N_h),
M_h = first(M_h),
M_bar_h = first(M_bar_h),
n_h = n(),
y_bar_h = sum(y_i) / sum(m_i),
s_y_bar_h = sqrt((1 - n_h / N_h) * (1 / (n_h * M_bar_h^2)) *
sum((y_i - y_bar_h * m_i)^2) / (n_h - 1)),
t_hat_h = M_h * y_bar_h,
s_t_hat_h = M_h * s_y_bar_h
) %>% summarise(
t_hat = sum(t_hat_h),
s_t_hat = sqrt(sum(s_t_hat_h^2)),
H = n_distinct(stratum),
n = sum(n_h),
t = qt(df = n - H, p = 1 - 0.05 / 2),
t_ci_lower = t_hat - t * s_t_hat,
t_ci_upper = t_hat + t * s_t_hat
)
est_ratio %>%
select(t_hat, t_ci_lower, t_ci_upper)#> # A tibble: 1 × 3
#> t_hat t_ci_lower t_ci_upper
#> <dbl> <dbl> <dbl>
#> 1 26818. 21117. 32519.
Again, the census-based total carbon across all parks is \(\tau\)= 26,349 tons. The stratified ratio estimate for the total, with corresponding 95% confidence intervals, was 26,818 (21,117, 32,519) tons. Based on this sample, the confidence interval captured the true value.
17.7.1.2.3 PPS stratified sampling estimator
We apply the PPS estimator within each stratum and then combine the stratum-level total estimates and their associated standard errors to obtain the stratified PPS estimate for the population total and its standard error, following (17.24) and (17.25).
pps_parks <-
read_csv("datasets/PDX/cluster_strat_pps_park_sample_n25.csv")
strat_size <- strat_all_parks %>%
group_by(stratum) %>%
summarize(M_h = sum(m_i))
# Apply the PPS estimator.
est_pps <- pps_parks %>%
left_join(strat_size, by = "stratum") %>%
mutate(p_i = m_i / M_h) %>%
group_by(stratum) %>%
summarize(
n_h = n(),
t_hat_h = (1 / n_h) * sum(y_i / p_i),
s_t_hat_h = sqrt(1 / (n_h * (n_h - 1)) *
sum((y_i / p_i - t_hat_h)^2)),
) %>% summarise(
t_hat = sum(t_hat_h),
s_t_hat = sqrt(sum(s_t_hat_h^2)),
H = n_distinct(stratum),
n = sum(n_h),
t = qt(df = n - H, p = 1 - 0.05 / 2),
t_ci_lower = t_hat - t * s_t_hat,
t_ci_upper = t_hat + t * s_t_hat
)
est_pps %>%
select(t_hat, t_ci_lower, t_ci_upper)#> # A tibble: 1 × 3
#> t_hat t_ci_lower t_ci_upper
#> <dbl> <dbl> <dbl>
#> 1 23794. 19759. 27829.
Again, the census-based total carbon across all parks is \(\tau\)= 26,349 tons. The stratified PPS estimate for the total, with corresponding 95% confidence intervals, was 23,794 (19,759, 27,829) tons. Based on this sample, the confidence interval captured the true value.
Taken together, these single-sample results illustrate how each estimator behaves for one particular draw from the Portland population. They also show how stratification and PPS sampling aim to stabilize estimates when a few large parks dominate the total. To see how these patterns play out over many possible samples, and to compare the designs on equal footing, the next section presents a repeated-sampling study based on the full census.
17.7.1.3 Repeated-sampling comparison
In the previous sections, the estimators provided single-sample estimates of the population total and associated 95% confidence intervals. Those results are useful for illustration, but they reflect only one realization of the sampling process. Here, we examine how the estimators behave under repeated sampling by drawing many samples from the same finite population and summarizing their performance across repetitions. This allows us to assess whether the apparent gains from stratification and PPS sampling persist across many samples or were simply artifacts of the earlier single-sample examples.
Under repeated sampling, each sample yields an estimate of the same fixed population total. By examining how these estimates behave across many samples—how they vary from sample to sample, how often their confidence intervals contain the true total, and whether they tend to be centered near the true total—we can assess estimator performance more reliably than from a single sample.
| Design | Estimator | Mean difference (tons) | Mean relative (%) | RSE (%) | 95% Coverage |
|---|---|---|---|---|---|
| Equal prob. | Unbiased | -8.9 | -0.034 | 26.8 | 87.6 |
| Equal prob. | Ratio | -92.9 | -0.353 | 12.0 | 86.9 |
| PPS | PPS | 8.3 | 0.031 | 10.1 | 94.9 |
| Strat. equal prob. | Unbiased | 1.1 | 0.004 | 16.3 | 91.6 |
| Strat. equal prob. | Ratio | -56.3 | -0.214 | 9.9 | 89.0 |
| Strat. PPS | PPS | 2.3 | 0.009 | 10.1 | 94.8 |
The results in Table 17.1 summarize how the six estimators behave under repeated sampling from the Portland parks population. As seen earlier in Figure 17.3, total carbon per park is strongly right-skewed, with a small number of large parks contributing a disproportionate share of the population total. Strong skewness, coupled with modest sample sizes, can leave the sampling distribution skewed as well, which in turn affects estimator performance—especially confidence interval coverage.
The first two columns of Table 17.1 show the mean difference between the estimated total and the true population total across many repeated samples, along with the corresponding mean relative difference. These quantities indicate whether an estimator tends to overshoot or undershoot the true total on average. Under repeated sampling, this mean difference corresponds to the estimation bias defined in Section 11.2.3, but we report it here directly for descriptive clarity.
For the equal-probability unbiased estimator, we expect this mean difference to be zero under repeated sampling, and the results are consistent with that expectation. The PPS estimator shows similar expected behavior. The ratio estimator, which is not exactly unbiased, exhibits a very small average relative difference for both the unstratified and stratified versions (i.e., -0.353% and -0.214%, respectively), consistent with its repeated-sampling properties at this sample size.
The key differences among the estimators appear in the relative standard error (RSE %) and 95% coverage columns. Under unstratified equal-probability sampling, the unbiased estimator has a large RSE of about 26.8%, reflecting substantial sample-to-sample variability when a draw happens to miss one or more of the largest parks. Additionally, the unbiased estimator of the population total is known to be inefficient when cluster sizes are unequal and cluster totals are roughly proportional to \(m_i\) (see S. L. Lohr (2019), Section 5.3). The unstratified ratio estimator improves precision considerably, reducing the RSE to about 12%. Both unstratified equal-probability estimators have lower-than-nominal coverage (87.6% and 86.9%), a consequence of applying normal-based confidence intervals to a highly skewed population.
Stratifying on tree count helps stabilize both equal-probability estimators. For the unbiased estimator, the RSE drops from 26.8% to 16.3%, and for the ratio estimator it drops from 12% to 9.9%. Average confidence interval coverage improves modestly as well (91.6% and 89%). This pattern illustrates how stratification can reduce variability when a small number of large units dominate the population total.
The PPS estimators stand out most clearly. Both the unstratified and stratified PPS designs achieve low RSE (about 10.1%) and near-nominal coverage (94.8%). Their estimates are centered close to the true total, but their precision and interval performance are markedly better. For a population with the level of skewness seen in Figure 17.3, PPS sampling uses the tree counts to reliably locate the large, influential parks in nearly every sample, making it the most stable design among those considered.
We’ll return to this dataset in Section 18.6.1, where a two-stage design will be used to subsample trees for measurement within selected parks.
17.7.2 Harvard Forest biomass
To illustrate cluster sampling under an areal sampling frame, we return to the Harvard Forest dataset. We first apply the cluster sampling ratio estimator defined in Section 17.5 to a single unstratified sample, and then extend the same logic to a stratified setting using the estimators introduced in Section 17.6.
Our goal is to estimate average aboveground biomass (AGB) per hectare and total AGB across the forest, considering all stems with DBH \(\ge\) 12.7 cm.
We use a systematic design to place the PSUs. Each PSU is centered on a SYS grid point, and around each center we lay out three fixed-area circular SSUs (subplots). The subplots are positioned 15 meters from the PSU center at azimuths of \(90^\circ\), \(330^\circ\), and \(210^\circ\), each with a 5-meter radius. For estimation, tree-level contributions are assigned to the population based on tree location (i.e., whether trees fall within the subplot and within the boundary), while SSU centers determine the denominator of the cluster ratio estimator, as described in Section 17.5.1.
As described in Section 14, the sys_grid() function generates the systematic grid of PSU centers using a randomly chosen starting point. For this example, the spacing is 100-by-100 meters, which yields \(n = 42\) PSUs. Figure 17.5 shows the resulting layout.
FIGURE 17.5: Cluster sampling under an areal sampling frame at Harvard Forest. Primary sampling units (PSUs) are placed using a systematic design over a buffered population extent to allow complete cluster exclusion near boundaries. Secondary sampling units (SSUs) are fixed-area circular plots arranged in a triangle-shaped configuration around each PSU center. SSUs are clipped to the population boundary for tree inclusion, while SSU centers falling inside the boundary define the denominator of the ratio estimator.
The cluster layout shown in Figure 17.5 highlights several features that are important for estimation under an areal sampling frame. PSUs are placed using a systematic design over a buffered version of the Harvard Forest boundary. This buffer allows complete clusters to be installed even when their centers lie close to the forest edge. As a result, some PSUs appear outside the forest boundary itself.
Each PSU anchors a fixed configuration of three circular SSUs arranged in a triangle-shaped pattern. Near the boundary, some SSUs extend partially outside the forest and are therefore clipped to the boundary. Trees are included in estimation only if they fall within the portion of an SSU that lies inside the forest. In contrast, whether an SSU contributes to the denominator of the estimator depends solely on whether its center point lies within the forest.
The tree-level data used for estimation are stored in "datasets/HF/HF_cluster_sys.csv" and read into ssu_trees below.
#> Rows: 588
#> Columns: 4
#> $ psu_id <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
#> $ ssu_id <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3,…
#> $ z_ij <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ agb_Mg <dbl> 1.083639, 1.041079, 2.192626, 1.970485…
Each row of ssu_trees corresponds to a single tree measurement collected on the portion of an SSU that lies within the forest boundary. The identifiers psu_id and ssu_id indicate the PSU and SSU to which each tree belongs, reflecting the nested structure of the cluster design.
The column agb_Mg gives the tree’s aboveground biomass in metric tons. The indicator variable z_ij records whether the center of the SSU lies within the forest boundary (z_ij = 1) or outside (z_ij = 0). This variable plays a key role in defining the denominator of the cluster ratio estimator under an areal sampling frame.
Before proceeding with estimation, we verify the number of sampled PSUs and the number of SSUs associated with each PSU.
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 42
#> # A tibble: 2 × 1
#> m_i
#> <int>
#> 1 1
#> 2 3
In this realization, PSUs have either one or three SSUs that fall at least partially within the forest boundary.
We begin by computing SSU-level contributions. Tree-level AGB is expanded to a per-hectare basis using the standard tree factor for 5-meter-radius circular plots. Expanded tree values are then summed within each SSU to obtain \(y_{ij}\).
r <- 5 # Plot radius (m).
ssu_summary <- ssu_trees %>%
mutate(TF = 10000 / (pi * r^2)) %>%
group_by(psu_id, ssu_id) %>%
summarize(
y_ij = sum(TF * agb_Mg),
z_ij = first(z_ij),
.groups = "drop"
)
ssu_summary#> # A tibble: 112 × 4
#> psu_id ssu_id y_ij z_ij
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 138. 1
#> 2 2 1 891. 1
#> 3 2 2 564. 1
#> 4 2 3 371. 1
#> 5 3 1 42.9 1
#> 6 3 2 544. 1
#> 7 3 3 403. 1
#> 8 4 1 724. 1
#> 9 4 2 202. 1
#> 10 4 3 308. 1
#> # ℹ 102 more rows
Next, we aggregate to the PSU level. The PSU-level numerator contribution \(y_i\) is the sum of SSU contributions within the cluster. The corresponding denominator contribution \(z_i\) is the number of SSU centers that lie within the forest.
Using the cluster ratio estimator in (17.16) and its variance estimator in (17.19), we estimate the mean aboveground biomass per hectare. The forest-wide total is obtained by multiplying the mean by the known forest area \(A\).
A <- 35 # Total area (ha).
psu_summary %>%
summarize(
y_bar = sum(y_i) / sum(z_i),
n = n(),
z_bar = mean(z_i),
s_y_bar = sqrt(
1 / (n * z_bar^2) *
sum((y_i - y_bar * z_i)^2) / (n - 1)
),
t = qt(1 - 0.05 / 2, df = n - 1),
y_ci_lower = y_bar - t * s_y_bar,
y_ci_upper = y_bar + t * s_y_bar
) %>%
select(y_bar, y_ci_lower, y_ci_upper) %>%
mutate(
t_hat = A * y_bar,
t_ci_lower = A * y_ci_lower,
t_ci_upper = A * y_ci_upper
)#> # A tibble: 1 × 6
#> y_bar y_ci_lower y_ci_upper t_hat t_ci_lower t_ci_upper
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 355. 302. 409. 12441. 10579. 14302.
For reference, the census-based aboveground biomass for the Harvard Forest is 341.46 Mg/ha, with a forest-wide total of 11,951 Mg.
For this realization, the estimated mean and total, along with their 95% confidence intervals, successfully capture the census values.
17.7.2.1 Stratified cluster sampling
We now extend the same cluster design to a stratified setting using the Harvard Forest strata introduced in Section 15.7 and shown in Figure 15.1. As before, inference targets stratum-specific and overall means per unit area under an areal sampling frame, and estimation proceeds by applying the cluster ratio estimator separately within each stratum.178
The same systematic sample of PSUs used in the unstratified example is retained. For estimation within a given stratum, tree-level contributions are assigned to that stratum based on tree location. This means an SSU can contribute to a stratum’s numerator whenever part of the SSU overlaps the stratum and trees from that overlapped portion belong to it, even if the SSU center lies outside the stratum. In contrast, the denominator is based only on SSU center points: an SSU contributes one unit of sampling effort to stratum \(h\) if and only if its center lies in stratum \(h\).
As a result, PSUs near stratum boundaries may have numerator contributions coming from SSUs on either side of a boundary, while denominator contributions depend strictly on where SSU centers fall. This is handled naturally by the ratio form of the estimator described in Section 17.5.1.
FIGURE 17.6: Stratified cluster sampling under an areal sampling frame at Harvard Forest. The same systematic PSU layout is used across all strata. SSUs are clipped to stratum boundaries for tree inclusion, while SSU centers falling inside a stratum define the denominator of the cluster ratio estimator for that stratum.
The stratified cruise data are read below. As the glimpse() output shows, SSU IDs are nested within PSU IDs, which are in turn nested within stratum IDs. This nesting structure determines how summaries and estimators are computed.179
#> Rows: 590
#> Columns: 5
#> $ psu_id <dbl> 41, 41, 41, 41, 41, 41, 41, 41, 41…
#> $ ssu_id <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3…
#> $ stratum_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ z_ij <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ agb_Mg <dbl> 0.071185, 0.194714, 0.263038, 0.70…
We now walk through the estimation workflow described in Section 17.6. As before, AGB is expanded to a per-hectare basis at the SSU level, then aggregated to the PSU and stratum levels. For this stratified example we follow the same logic as Section 17.5.1, so the SSU-level numerator is based on tree locations while the SSU-center indicator z_ij defines the denominator contributions.
r <- 5 # Plot radius (m).
ssu_summary <- ssu_trees %>%
mutate(TF = 10000 / (pi * r^2)) %>%
group_by(stratum_id, psu_id, ssu_id) %>%
summarize(
y_hij = sum(TF * agb_Mg),
z_hij = first(z_ij)
)Next, for each stratum we compute PSU-level numerator and denominator contributions, then obtain stratum-level means and standard errors using the cluster ratio estimator in (17.16) and variance estimator in (17.19).
stratum_summary <- ssu_summary %>%
group_by(stratum_id, psu_id) %>%
summarize(
y_hi = sum(y_hij),
z_hi = sum(z_hij)
) %>%
group_by(stratum_id) %>%
summarize(
y_bar_h = sum(y_hi) / sum(z_hi),
n_h = n(),
z_bar_h = mean(z_hi),
s_y_bar_h = sqrt(
1 / (n_h * z_bar_h^2) *
sum((y_hi - y_bar_h * z_hi)^2) / (n_h - 1)
)
)
stratum_summary#> # A tibble: 4 × 5
#> stratum_id y_bar_h n_h z_bar_h s_y_bar_h
#> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 1 252. 2 2.5 89.2
#> 2 2 387. 12 2.17 36.4
#> 3 3 34.9 4 2 5.88
#> 4 4 390. 28 2.36 29.4
Given stratum-level estimates, we combine them using the stratified estimators (17.22) and (17.23). Under an areal sampling frame, stratum areas \(A_h\) play the same role that \(M_h\) does in the finite-population stratified estimators, so we weight stratum means by area and combine stratum variances accordingly.
stratum_summary <- stratum_summary %>%
left_join(stratum_area, by = "stratum_id")
stratum_summary %>%
summarize(
A = sum(A_h),
y_bar = sum(A_h * y_bar_h) / A,
s_y_bar = sqrt(sum(A_h^2 * s_y_bar_h^2) / A^2),
df = sum(n_h - 1),
t = qt(1 - 0.05 / 2, df = df),
ci_lower = y_bar - t * s_y_bar,
ci_upper = y_bar + t * s_y_bar
) %>%
select(y_bar, ci_lower, ci_upper)#> # A tibble: 1 × 3
#> y_bar ci_lower ci_upper
#> <dbl> <dbl> <dbl>
#> 1 361. 318. 404.
For this realization, the stratified cluster estimate yields a slightly narrower confidence interval than the unstratified estimate. Because this comparison is based on a single systematic sample, we defer interpretation and turn next to a repeated-sampling analysis to more reliably assess gains in precision attributable to stratification.
17.7.2.2 Comparing stratified and unstratified cluster sampling
In this section, we compare the performance of unstratified and stratified cluster sampling under an areal sampling frame using repeated systematic samples from the Harvard Forest population. In both cases, clusters are treated as collections of fixed-area SSUs, and estimation follows the SSU-center–based cluster ratio estimator described in Section 17.5.1, consistent with practice in the Finnish NFI.
For each design, we repeatedly generate a new systematic sample of PSUs over the buffered population extent, construct identical cluster layouts around each PSU center, and apply the same ratio estimator used in the single-sample examples above. Each iteration produces an estimate of mean AGB per hectare along with a 95% confidence interval. Empirical coverage and average interval width are then compared across designs.
The empirical coverage rates were 97.86% and 96.09% for the unstratified and stratified cluster estimators, respectively. The corresponding average 95% confidence interval widths were 97 and 85 Mg/ha. The percent reduction in average interval width due to stratification was 13%.
These results mirror the efficiency gains observed earlier in Section 15.7.2 for stratified sampling with single plots and show that effective stratification continues to improve precision when clusters are used as the primary sampling units under an areal sampling frame.
17.8 Advantages and disadvantages of cluster sampling
Cluster sampling is widely used in forest inventory because it can substantially reduce the cost and complexity of field data collection. It’s especially useful when measurement is inexpensive but travel between sampling locations is costly—a common situation in large-area surveys. For these reasons, cluster designs are a core feature of many NFI programs.
17.8.1 Advantages
Operational efficiency: By grouping nearby SSUs into PSUs, cluster sampling reduces travel time and limits the need to move frequently between dispersed sampling locations.
Simplified logistics: Field crews can complete several measurements within a single PSU, minimizing the time and effort required to reposition across the landscape.
Aligned with operational structure: Many forest monitoring programs are built around geographic or administrative PSUs, making cluster sampling a natural choice.
17.8.2 Disadvantages
Reduced statistical efficiency: SSUs in the same PSU are often spatially autocorrelated and provide less independent information than SSUs sampled across the full population.
Smaller effective sample size: Because randomization occurs at the PSU level, the effective sample size reflects the number of PSUs rather than the total number of SSUs, which can reduce precision.
Complex inclusion probabilities under areal frames: When clusters overlap boundaries or strata, tree-level inclusion probabilities can become difficult to express analytically. Several principled approaches exist for handling these situations, but each involves tradeoffs between geometric exactness, computational burden, and operational simplicity.
Despite these limitations, cluster sampling remains a practical and effective choice when logistical constraints are high and moderate precision is acceptable. In many large-area inventory settings, it provides a reasonable balance between field efficiency and statistical performance.
17.9 Exercises
17.9.1 Portland park stormwater filtration analysis
Take a moment to reread the description of the Portland park tree inventory in Section 17.7.1. The i-Tree software provides estimates for several tree-level quantities, which appear as columns in "datasets/PDX/pdx_trees.csv":
carbon_storage_ton: total carbon stored (tons; the variable analyzed in Section 17.7.1);carbon_sequestered_ton: annual carbon sequestered (tons);stormwater_cuft: annual volume of stormwater filtered (cubic feet);pollution_removal_oz: annual amount of air pollution removed (ounces);structural_value: replacement cost for a comparable tree (US dollars).
The data are read in below. In addition to the tree-level i-Tree variables listed above, the dataset includes a unique park and tree ID, inventory date, DBH (in), height (ft), and a species code. Each row represents a single tree in the park census.
#> Rows: 25,257
#> Columns: 11
#> $ park_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1…
#> $ tree_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8…
#> $ inv_date <dttm> 2017-09-16, 2017-09-1…
#> $ dbh_in <dbl> 15.5, 17.4, 14.0, 5.9,…
#> $ height_ft <dbl> 24, 65, 20, 21, 24, 36…
#> $ species <chr> "PRSE", "ACSA3", "PRSR…
#> $ carbon_storage_ton <dbl> 0.57615, 0.79335, 0.44…
#> $ carbon_sequestered_ton <dbl> 0.01735, 0.02005, 0.01…
#> $ stormwater_cuft <dbl> 38.1, 49.5, 28.7, 7.2,…
#> $ pollution_removal_oz <dbl> 10.2, 13.3, 7.7, 1.9, …
#> $ structural_value <dbl> 2566.67, 3300.09, 2125…
In Section 17.7.1, we used carbon_storage_ton (i.e., the \(y_{ij}\) values) to estimate total tree carbon across Portland parks using cluster sampling. The repeated sampling analysis in Section 17.7.1.3 showed that the PPS estimator consistently outperformed the equal-probability unbiased and ratio estimators.
In the exercises below, you’ll apply the PPS cluster sampling estimator to compute a population total using the stormwater_cuft i-Tree variable. The exercises walk through the estimation workflow starting from individual tree-level i-Tree estimates.
Here, pdx_trees represents the entire population of trees. That gives us the unusual luxury of knowing the truth, so we can directly check how well our estimators perform.
We’ll begin by computing the required population-level quantities, then take a PPS sample, explore the sample a bit, apply the estimator, and finally compare the result to the true total.
Exercise 17.1 Using pdx_trees, compute a park-level summary (i.e., PSU-level summary) that includes park_id, the number of trees (i.e., SSUs) in each park (\(m_i\)), and the park-level sum of stormwater_cuft, denoted \(y_i\) to match the PPS notation in Section 17.4.2.
Call this park-level summary all_parks. It should match the output shown below.
Hint, group by park_id before calling summarize() to compute \(m_i\) and \(y_i\).
#> # A tibble: 177 × 3
#> park_id m_i y_i
#> <dbl> <int> <dbl>
#> 1 1 26 807.
#> 2 2 253 14890.
#> 3 3 100 6620.
#> 4 4 75 982.
#> 5 5 110 8486.
#> 6 6 60 4510.
#> 7 7 318 15365.
#> 8 8 921 83556.
#> 9 9 5 32.5
#> 10 10 66 4803.
#> # ℹ 167 more rows
Exercise 17.2 We’ll use the number of trees in each park, \(m_i\), to define the selection probability \(p_i = m_i / M\), where \(M = \sum_{i=1}^N m_i\). In this way, parks with more trees—and hence a larger expected contribution to total stormwater filtration—are more likely to be selected.
Add a p_i column to all_parks via mutate(). You can compute \(M = \sum_{i=1}^N m_i\) directly from all_parks. Your modified all_parks should match the output below.
Hint, compute \(M\) first, then use it inside mutate().
#> # A tibble: 177 × 4
#> park_id m_i y_i p_i
#> <dbl> <int> <dbl> <dbl>
#> 1 1 26 807. 0.00103
#> 2 2 253 14890. 0.0100
#> 3 3 100 6620. 0.00396
#> 4 4 75 982. 0.00297
#> 5 5 110 8486. 0.00436
#> 6 6 60 4510. 0.00238
#> 7 7 318 15365. 0.0126
#> 8 8 921 83556. 0.0365
#> 9 9 5 32.5 0.000198
#> 10 10 66 4803. 0.00261
#> # ℹ 167 more rows
Exercise 17.3 What is the sum of the p_i column in all_parks? Is this consistent with the definition of selection probabilities given in Section 17.4.2?
Exercise 17.4 Use the code below to select a PPS sample of \(n = 25\) parks with replacement. This sample will be used for all subsequent estimation.
Read the help page for dplyr::slice_sample(), paying particular attention to the weight_by argument. This argument defines each park’s selection probability. Because slice_sample() internally standardizes the supplied weights to sum to one, we could specify weight_by = m_i. Here, we instead pass weight_by = p_i; the result is the same.
n <- 25
set.seed(1)
pps_parks <- all_parks %>%
slice_sample(n = n, weight_by = p_i, replace = TRUE)
pps_parks#> # A tibble: 25 × 4
#> park_id m_i y_i p_i
#> <dbl> <int> <dbl> <dbl>
#> 1 87 565 27030. 0.0224
#> 2 94 403 32376. 0.0160
#> 3 58 261 25910. 0.0103
#> 4 10 66 4803. 0.00261
#> 5 70 647 38606. 0.0256
#> 6 109 71 5612. 0.00281
#> 7 92 48 4074. 0.00190
#> 8 11 181 15970. 0.00717
#> 9 36 211 15458. 0.00835
#> 10 79 1016 84078. 0.0402
#> # ℹ 15 more rows
Exercise 17.5 Because pps_parks was sampled with replacement, some parks may appear multiple times in the sample.
Write code to count the number of occurrences of each park_id in the sample; call this count n_rep. Also include each park’s m_i and p_i values, and arrange the output in decreasing order of n_rep.
Does the output suggest that parks with more trees are more likely to be selected?
Hint, start with group_by(park_id) and then use summarize(). You can extract m_i and p_i using first().
pps_parks %>%
group_by(park_id) %>%
summarize(
n_rep = n(),
m_i = first(m_i),
p_i = first(p_i)
) %>%
arrange(desc(n_rep))#> # A tibble: 22 × 4
#> park_id n_rep m_i p_i
#> <dbl> <int> <int> <dbl>
#> 1 81 2 633 0.0251
#> 2 87 2 565 0.0224
#> 3 94 2 403 0.0160
#> 4 10 1 66 0.00261
#> 5 11 1 181 0.00717
#> 6 19 1 115 0.00455
#> 7 21 1 338 0.0134
#> 8 27 1 116 0.00459
#> 9 30 1 159 0.00630
#> 10 36 1 211 0.00835
#> # ℹ 12 more rows
Exercise 17.6 Finally, implement the PPS estimator for total stormwater filtered by all Portland park trees, along with its standard error and a 95% confidence interval, using (17.12), (17.14), and (17.7), respectively. Units are cubic feet per year.
#> # A tibble: 1 × 4
#> t_hat s_t_hat t_ci_lower t_ci_upper
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1830639. 114850. 1593600. 2067679.
References
This is the same idea introduced in point sampling (Section 12.4), where trees were selected with probability proportional to basal area, and in 3P sampling (Section 16.6), where trees were selected with probability proportional to prediction. Here, we apply the same concept to PSUs, selecting them with probability proportional to a measure of their size.↩︎
Strictly speaking, this estimator could be described as a ratio-of-totals, since it’s equivalent to the ratio of summed numerator and denominator values rather than the ratio of two sample means.↩︎
An R package,
pdxTrees(McConville and Caldwell 2020), archives these data and provides helper functions for extraction and visualization.↩︎The figure uses
ggMarginal()from theggExtrapackage to add marginal histograms to a scatterplot.↩︎Undercoverage is common when a few extreme units dominate the estimator’s variability, though overcoverage can also occur if the variance estimator is overly conservative.↩︎
Strictly speaking, stratified estimators assume independent sampling within strata. Here, a single systematic design is applied across all strata for simplicity.↩︎
Notice there are more rows in the stratified
ssu_treesthan in the unstratifiedssu_trees. This is because a few SSUs straddle stratum boundaries.↩︎