This book is a work in progress and is Open Review. We want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right hand corner of the page

Chapter 13 Sampling designs, implementation, and analysis

In Chapters 11 and 12, we focused on inference using sample data collected via a simple random sampling (SRS) design. In this chapter, we introduce several additional probability sampling designs commonly used in forestry. Recall that a sampling design is a set of rules for selecting units from a population and using the resulting data to generate statistically valid estimates for parameters of interest. We focus on four designs: 1) systematic sampling, 2) stratified sampling, 3) cluster sampling, and 4) multistage sampling. These designs differ in how they configure and select sampling units and in the estimators they prescribe. In most cases, a design is chosen to improve efficiency from both a statistical and field logistics standpoint.

As discussed in Section 12.1, forest inventory typically relies on an areal sampling frame, where sampling locations are identified by randomly placing points within the frame’s spatial extent. Under point sampling, a location might serve as the point from which a forester projects the discerning angle to identify measurement trees. Under plot sampling, the location might mark the center of a circular plot, a corner of a rectangular plot, or an anchor point for more complex plot shapes. Unless stated otherwise, the sampling designs described in this chapter can be used with either plot or point sampling—that is, the design governs the selection of sampling locations, while a separate set of rules determines how trees are selected around each location.

After introducing the designs, we shift attention to the use of auxiliary information, which we refer to as sampling with covariates. In forestry, we’re often interested in estimating parameters based on variables that are difficult or expensive to measure. Sampling with covariates improves precision by incorporating other variables that are easier to measure and correlated with the variable of interest.

The remainder of the chapter is organized as follows. Sections 13.2 through 13.5 describe systematic, stratified, cluster, and multistage sampling designs. Section 13.6 covers strategies for incorporating auxiliary information. As in other mensuration texts (Kershaw et al. 2016; Burkhart, Avery, and Bullock 2018), we present estimators associated with each design. Our added value is in showing how to implement these estimators using tidyverse tools and other R packages. Emphasis is placed on building computing workflows for efficient and repeatable analysis.

13.1 A gentle introduction to sf

Because sampling in forestry is inherently spatial—we’re selecting locations within an areal sampling frame—it’s helpful to start working with spatially aware data structures. In R, the primary tool for handling spatial vector data is the sf package.

If you’ve used GIS software before, many of the concepts will feel familiar. Spatial vector data come in three main types: points (e.g., sampling locations), lines (e.g., transects or travel paths), and polygons (e.g., stand boundaries or management units). The sf package represents each of these as a special kind of data frame with an attached geometry column.

You can work with sf objects much like regular data frames, using familiar tidyverse tools such as mutate() and filter(). We’ll introduce key sf data structures and functions as needed throughout the chapter. For more background, see the sf documentation at https://r-spatial.github.io/sf or the package vignette via vignette("sf1").

We’ll explore sf in more detail in Section 14.1.5.

13.2 Systematic sampling

Systematic sampling (SYS) is one of the most widely used probability sampling designs in forest inventory. Sampling locations are arranged in a fixed pattern across the population. To introduce randomness and ensure unbiased estimation of means and totals, the pattern—typically a rectangular grid—is positioned using a randomly chosen starting point. As illustrated in Figure 1.2, sampling locations are usually aligned along straight lines. The distance between locations on a line, known as the sampling interval, and the distance between lines can differ, creating a rectangular grid. This is commonly referred to as line-plot sampling, which is our focus here.

SYS designs are popular in forestry for several reasons. Compared to SRS, their advantages include:

  1. Sampling locations are evenly distributed across the population, improving the reliability of mean and total estimates. In contrast, SRS locations may, by chance, miss important spatial features.
  2. Sampling locations are easier and faster to find, reducing cost. The cruiser can follow straight lines between locations, which minimizes travel time and simplifies relocation during remeasurement.

The grid is typically positioned using GIS or spatial software (e.g., R), and the resulting points are uploaded to a GPS device for navigation. Alternatively, the cruiser may locate the first sampling point in the field and lay out the remainder using the sampling interval and grid orientation.

Key considerations when implementing a SYS design include:

  1. the pattern and orientation of the grid;
  2. the method used to randomly position the grid.

These points are covered in the next few sections.

13.2.1 Grid pattern, number of grid points, and orientation

In practice, either a rectangular or square grid is used. A rectangular grid defines the spacing between sampling locations along a line, denoted \(D_p\) (think “plot” or “point”), and the spacing between lines, \(D_l\). A square grid is a special case where \(D_{\text{sq}} = D_p = D_l\). Unless there’s a compelling reason to use unequal spacing, a square grid is a reasonable and efficient default.

Given a desired sample size \(n\) and total area \(A\), the spacing between locations in a square grid is given by

\[\begin{equation} D_{\text{sq}} = \sqrt{\frac{cA}{n}}, \end{equation}\]

where \(c\) is a conversion factor to express area in square distance units. For example, use \(c = 10,\!000\ \text{m}^2/\text{ha}\) if \(A\) is in hectares and \(D_{\text{sq}}\) is in meters, or \(c = 43,\!560\ \text{ft}^2/\text{ac}\) if \(A\) is in acres and \(D_{\text{sq}}\) is in feet.

As introduced in Section 11.3.6, sample size \(n\) should be based on the level of precision needed to meet the survey’s objectives. However, in practice, it’s common for experienced foresters to adopt a sampling intensity that works well for their forest type and inferential objectives. For example, the forester who conducted the Elk County cruise described in Section 1.2.2 typically uses a seven-by-seven chain SYS grid.

If you do use a sample size calculation, as recommended in Section 11.3.6, it’s likely that the number of grid points won’t exactly match the computed \(n\) due to how the grid falls within the boundary. In such cases, we agree with the following sage words.

My advice is simply to take the number of points that are indicated [by the grid], and live with the slight variability. Sample sizes that are “needed” … are just estimated anyway.

— Kim Iles

If you really must have a specific sample size, then the best approach is to specify a denser grid than needed and randomly or systematically remove points until the desired sample size is reached (see, e.g., K. Iles (2003)).

The grid orientation can be adjusted to improve field logistics. For example, aligning lines with a forest boundary may simplify navigation. This was the approach used in the Elk County cruise (Figure 1.2).

Although rare, it’s possible for a population to exhibit a systematic pattern that aligns with the grid’s spacing and orientation, which could introduce bias. For example, a grid aligned with plantation rows might consistently place sampling locations between rows, leading to unrepresentative samples. In such cases, rotating the grid to avoid alignment with periodic features can mitigate bias.

If a known gradient exists in the population—for example, productivity increasing with elevation—orienting the grid perpendicular to that gradient can improve the representativeness of the sample.

13.2.2 Positioning the grid

To avoid bias, the grid must be randomly placed. This is typically done by selecting a random point start within the areal sampling frame and extending the grid in each direction from that point. If rotation is applied, it should be performed about this random start point.

The random point start can be one of the sampling locations or used as an anchor point from which to extend the grid. In the development that follows, we treat the starting point as a location on the sampling grid.

In the next section, we introduce a function that generates systematic sampling grids in R. This function allows users to specify grid spacing, orientation, and how sampling locations are ordered. The code and examples also help highlight some practical considerations when laying out systematic designs.

13.2.3 Creating a systematic sampling grid with make_sys_grid()

The make_sys_grid() function creates a rectangular grid of points with user-specified sampling interval (\(D_p\)) and spacing between lines (\(D_l\)) within a bounding polygon that defines the areal sampling frame. The grid begins from a random start point placed within the polygon. The resulting grid can be rotated and sampling locations ordered. Here’s how it works given values for the following arguments:

  1. A random start point is placed within the polygon.
  2. A grid with spacing given by D_p and D_l is generated around the random start point, extended far enough in each direction to account for possible rotation.
  3. The grid points are sorted in a “snake” pattern based on the chosen starting corner specified by origin.
  4. If specified, the grid is rotated around the random start point by the given angle (in degrees).
  5. Finally, points falling outside the polygon are clipped, and the results are returned as an sf object along with other supporting sf objects.

Sorting the grid points via origin can be useful for planning the cruiser’s navigation along lines. Allowed origin values are "se", "sw", "ne", and "nw", corresponding to southeast, southwest, northeast, and northwest corners, respectively.

The make_sys_grid() function is included in the "scripts/utils.R" script that comes with the book (see Section 5.4 for how to access it using source()). You’re encouraged to read through the function on your own. The comments walk through each step, and many components should look familiar. For anything new, check the help pages or online resources.

The function returns a list with four elements146:
- grid: an sf object of point features representing the ordered sampling locations;
- path: an sf object of a single LINESTRING connecting the locations in order;
- random_point_start: an sf object of the point used to position the grid;
- boundary: an sf object of the polygon(s) used to clip the grid.

The grid object includes columns point_id, line_id, and point_id_in_line, which hold unique point IDs ordered from the given origin, line IDs, and point-within-line IDs, respectively. Other columns in grid will be inherited from boundary by spatial intersection. These attributes will be useful for other designs (e.g., stratified sampling in Section 13.3).

Let’s apply make_sys_grid() to set up a SYS design for the Elk County cruise described in Section 1.2.2. In that cruise, the forester used a square grid with sampling locations spaced every seven chains (recall that one chain equals 66 feet).

The make_sys_grid() function expects the boundary polygon to be in a projection that uses the same distance units as D_l and D_p. The Elk County property is stored in the "datasets/PA/Elk_bounds" ESRI shapefile (Environmental Systems Research Institute 1998), which uses a geographic coordinate system (i.e., decimal degrees of latitude and longitude)—not an intuitive measurement system for field work. Instead, we want the boundary in a projected coordinate system that allows us to specify D_l and D_p in feet or meters.

After reading the shapefile using st_read(), the code below reprojects the boundary to a projection appropriate for northern Pennsylvania and consistent with the cruise protocol. This is done using st_transform().147

library(sf)
library(tidyverse)
source("scripts/utils.R")

# Read in the bounding polygon.
elk_bounds <- st_read("datasets/PA", "Elk_bounds", quiet = TRUE)
elk_bounds <- st_transform(elk_bounds, crs = st_crs(2271))

Once reprojected, elk_bounds has units in feet, so we specify D_l and D_p in feet when calling make_sys_grid(). Because we don’t specify values for angle and origin, the defaults of 0 and "se", respectively, are used. To ensure the same random point start is generated each time the code is run, we set the random seed prior to calling the function.

# Set random seed.
set.seed(1)

# Create the SYS grid.
sys_grid <- make_sys_grid(elk_bounds, D_l = 7 * 66, D_p = 7 * 66)

We can visualize the resulting SYS design by passing the sys_grid object to the plot_sys_grid() function, defined in the "scripts/utils.R" script. The resulting map is shown in Figure 13.1. You’re encouraged to explore the plot_sys_grid() function to see how it builds the graphic using ggplot(). Compared to the simpler examples in Chapter 9, this plot is a bit more involved, with a customized legend and multiple spatial layers.148

Systematic sampling design for an Elk County property cruise.

FIGURE 13.1: Systematic sampling design for an Elk County property cruise.

Let’s create a few variations on the design shown in Figure 13.1 by rotating the grid and choosing a different origin—while keeping the same random point start. This isolates the effect of orientation and ordering. To reduce code clutter, we don’t show the calls to plot_sys_grid() or the patchwork functions used to produce the composite graphic shown in Figure 13.2.

# (a) Rotate grid 45 degrees.
set.seed(1)
sys_grid_a <- make_sys_grid(elk_bounds, 
                            D_l = 7 * 66, D_p = 7 * 66, angle = 45)

# (b) Northwest corner start.
set.seed(1)
sys_grid_b <- make_sys_grid(elk_bounds, 
                            D_l = 7 * 66, D_p = 7 * 66, origin = "nw")

# (c) Rotation and northwest corner start.
set.seed(1)
sys_grid_c <- make_sys_grid(elk_bounds, 
                            D_l = 7 * 66, D_p = 7 * 66, 
                            angle = 45,  origin = "nw")
Systematic sampling design for an Elk County cruise under different grid rotations and starting origins.

FIGURE 13.2: Systematic sampling design for an Elk County cruise under different grid rotations and starting origins.

Notice in Figures 13.1 and 13.2 the random point start remains in the same location—it’s the only random component in the systematic sampling design. Once the starting point is fixed, all other sampling locations are determined by the specified spacing and rotation. This highlights a key feature of systematic sampling: although the layout may appear highly structured, the randomness introduced by the initial point ensures the design retains its probabilistic foundation. As a result, estimators derived from such a design can still be used for valid statistical inference under appropriate assumptions.

With a sampling grid in place, we now turn to estimation using data collected under systematic sampling.

13.2.4 Estimation using SYS

The SYS estimators for the population mean \(\mu\) (11.1) and total \(\tau\) (11.6) are the same unbiased sample mean \(\bar{y}\) (11.15) and total \(\hat{t}\) (11.22) used under SRS (Section 11.3). Both are unbiased under SYS.

There is, however, no unbiased estimator for the sample variance (and thus no estimator for the standard error of the mean) under SYS. The most common solution is to apply the SRS estimators for variance (11.17) and standard error (11.21).

In some cases, however, the SRS variance and standard error of the mean estimators can overestimate uncertainty, yielding conservative confidence intervals. In others, they can underestimate it. Unfortunately, it’s not straightforward to know which case you’re in without additional information about the population.

It’s easy to build some intuition about why SYS lacks a valid variance estimator. Recall from Section 11.3 that the sample variance estimator used for SRS assumes the \(n\) sampled units are drawn independently and at random from the population. This independence is critical: it justifies the use of \(n - 1\) in the denominator of the variance estimator (11.17), which reflects the number of independent pieces of information available to estimate population variance.

Now contrast that with SYS. Here, a single random point start determines the entire sample. All subsequent locations are fixed by the grid layout. So, while we observe \(n\) units, only one is selected at random. The remaining \(n - 1\) are not independent—they’re determined once the start is fixed.

From this perspective, SYS behaves like a sample of size one. And plugging \(n = 1\) into the SRS variance estimator (11.17) results in division by zero which is undefined. This highlights the core issue: SYS does not produce a sample of independent observations, so the usual SRS variance estimator(s) don’t apply.

Again, despite these limitations, the SRS variance estimator is commonly used as a practical approximation for SYS. In many situations, it provides a reasonable estimate of the sample variance and, hence, the standard error of the mean. Still, understanding how the population is structured—and how that structure interacts with the sampling grid—is key to interpreting results. See Kershaw et al. (2016) and K. Iles (2003) for further discussion, including cautions and justification for why the SRS variance estimator is often considered acceptable for SYS data.

13.2.5 Advantages and disadvantages of systematic sampling

Systematic sampling offers several practical benefits and is widely used in forestry, but like all designs, it involves trade-offs. Here we summarize its main strengths and limitations, particularly in comparison to simple random sampling (SRS).

13.2.5.1 Advantages

  • Efficient field layout: SYS designs are easy to implement in the field. Sampling locations fall on a predictable grid, allowing cruisers to follow straight lines, reducing travel time and simplifying navigation.

  • Improved spatial coverage: Grid-based sampling ensures sampling locations are evenly distributed across the population, helping to capture spatial heterogeneity and reduce the risk of missing important features—a common concern with SRS.

  • Simplifies remeasurement: Because sampling locations follow a fixed, repeatable pattern, they are easier to relocate during future inventory cycles.

  • Cost-effective: The combination of reduced travel time and simplified logistics typically results in lower per-unit sampling cost compared to SRS.

  • Good performance in many populations: In the absence of strong periodicity or spatial trends aligned with the grid, SYS often provides better precision than SRS for the same sample size.

13.2.5.2 Disadvantages

  • No valid variance estimator: SYS lacks a design-based estimator for the sample variance and standard error. While the SRS estimator is commonly used as an approximation, it may over- or underestimate uncertainty, depending on the population structure.

  • Potential bias from periodicity: If the population exhibits spatial patterns that align with the grid spacing or orientation (e.g., plantation rows, soil type bands), SYS can produce biased estimates.

  • Requires spatial data infrastructure: Implementing SYS effectively typically requires GIS tools and spatial data, which may be a barrier in some settings.

  • Assumes homogeneity within grid cells: SYS works best when the population is spatially continuous or smoothly varying. If abrupt changes occur between grid cells, important features may be missed.

Despite these drawbacks, SYS remains a go-to design in operational forestry due to its simplicity, spatial balance, and low implementation cost. When used thoughtfully—with attention to population structure and proper randomization of the starting point—SYS can yield highly effective inventory data.

13.3 Stratified sampling

13.3.1 Estimation using stratified sampling

13.3.2 Allocation of sample plots

13.3.2.1 Proportional allocation

13.3.2.2 Optimum allocation

13.3.3 Sample size determination

13.3.4 Illustrative analysis

13.3.5 Advantages and disadvantages of stratifed sampling

13.4 Cluster sampling

13.4.1 Estimation using cluster sampling

13.4.2 Illustrative analysis

13.4.3 Advantages and disadvantages of cluster sampling

13.5 Multistage sampling

13.5.1 Estimation using multistage sampling

13.5.2 Illustrative analysis

13.5.3 Estimation with unequal sized primaries and varying amounts of secondaries

13.5.4 Advantages and disadvantages of multistage sampling

13.6 Sampling with covariates

13.6.1 Regression Sampling

13.6.1.1 Regression Estimation

13.6.1.2 Ratio Estimation

13.6.2 Double sampling

13.6.2.1 Regression Estimation

13.6.2.2 Ratio Estimation

13.7 Summary and outlook

13.8 Exercises

References

Burkhart, H. E., T. E. Avery, and B. P. Bullock. 2018. Forest Measurements: Sixth Edition. Waveland Press.
Environmental Systems Research Institute. 1998. ESRI Shapefile Technical Description. Environmental Systems Research Institute, Inc. https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf.
Iles, K. 2003. A Sampler of Inventory Topics: A Practical Discussion for Resource Samplers, Concentrating on Forest Inventory Techniques. Kim Iles & Associates.
Kershaw, J. A., M. J. Ducey, T. W. Beers, and B. Husch. 2016. Forest Mensuration. Wiley.

  1. You can export any of these spatial objects using the sf function st_write(), which supports a wide range of formats. For example, st_write(sys_grid$grid, "sys_grid_points.shp") saves the sampling locations as an ESRI Shapefile, which can be read by many GPS units and mapping applications.↩︎

  2. The specific projection is NAD83 / Pennsylvania South (ft), which corresponds to EPSG:2271.↩︎

  3. In plot_sys_grid(), notice how the geom_sf() layers automatically render each sf object based on its geometry type—points, lines, or polygons—making it easy to build clean and consistent maps. You can modify the function to adjust colors, symbols, or other plot features to suit your needs.↩︎

Want to know when the book is for sale? Enter your email so we can let you know.