A Jackknife Variance Estimator for Unequal Probability Sampling (2024)

Summary

The jackknife method is often used for variance estimation in sample surveys but has only been developed for a limited class of sampling designs. We propose a jackknife variance estimator which is defined for any without-replacement unequal probability sampling design. We demonstrate design consistency of this estimator for a broad class of point estimators. A Monte Carlo study shows how the proposed estimator may improve on existing estimators.

Inclusion probabilities, Linearization, Pseudovalues, Smooth function of means, Stratification

1. Introduction

Jackknife methods are widely used for standard error estimation in sample surveys (e.g. Wolter (1985) and Shao and Tu (1995)). Tukey's (1958) original idea of jackknife variance estimation has been developed to handle stratified multistage sampling by Lee (1973), Jones (1974), Kish and Frankel (1974) and Krewski and Rao (1981), among others, and the properties of various forms of the jackknife estimator for this case have been studied both theoretically and empirically (e.g. Krewski and Rao (1981), Rao and Wu (1985), Kovar et al. (1988), Rao et al. (1992) and Shao and Tu (1995)). The restriction of the jackknife method to stratified multistage designs constrains its applicability compared, for example, with linearization estimators, which have been defined for any unequal probability sampling design without replacement (Särndal et al. (1992), section 5.5). In this paper we address this constraint by proposing, in Section 3, a jackknife variance estimator, which is applicable for the same general class of sampling designs.

Our approach is based on the analogy between the jackknife and linearization methods, in which the analytic derivative in linearization is replaced by a numerical approximation (Davison and Hinkley (1997), page 50). The estimator that is proposed is a jackknife analogue of a standard linearization variance estimator for unequal probability designs. The same estimator was effectively also proposed by Campbell (1980) in an impressively general paper, which seems unfortunately to have received little attention in the subsequent survey sampling literature. This paper goes beyond Campbell (1980) by investigating the properties of this estimator both theoretically and numerically.

The class of point estimators, for which the variance estimator proposed is defined, is set out in Section 2. We demonstrate in Section 4 that the estimator is consistent for the same asymptotic variance as the linearization estimator. We support this result with a small simulation study in Section 6 comparing the sampling properties of our estimator with three existing jackknife variance estimators that are described in Section 5.

2. The class of point estimators

Before considering variance estimation, it is necessary to define the point estimator, the variance of which is to be estimated. We consider a finite population 𝒰={1,…,i,…,N} containing N units and suppose that values y_qi, q=1,…,Q, for Q survey variables are associated with the unit that is labelled i. We assume that a sample 𝒮⊂𝒰 is selected according to a probability sampling design and that there is no non-response.

We motivate the class of point estimators by first defining a class of population parameters θ of interest. We assume that this parameter can be expressed as a function of means, θ=g(μ₁,…,μ_Q), where g(·) is a smooth function (see Appendix A) from ℝ^Q to ℝ and μ_q is the finite population mean, μ_q=N⁻¹Σ_{i ∈ 𝒰}y_qi. This definition of θ includes most parameters of interest arising in common survey applications, such as ratios, subpopulation means and correlation and regression coefficients. We assume that θ is a scalar for simplicity although the approach could be generalized to multivariate θ.

We now define the point estimator $\hat{θ}$ as the substitution estimator $\hat{θ} = g ({\hat{μ}}_{1}, \dots, {\hat{μ}}_{Q}),$ ⁠, where

${\hat{μ}}_{q} = \sum_{i \in S} w_{i} y_{q i}$

is the Hájek (1981) ratio estimator of μ_q, the weight w_i is given by

$w_{i} = 1 / \hat{N} π_{i},$

(1)

$\hat{N} = Σ_{i \in S} π_{i}^{- 1}$ is an unbiased estimator of N and π_i denotes the first-order inclusion probability of unit i. Many parameters of interest in surveys, e.g. ratios and correlation coefficients, are invariant to multiplication of each μ_q in g(μ₁,…,μ_Q) by a common constant; in such cases the specification of $\hat{N}$ in equation (1) is arbitrary and $\hat{θ}$ could be viewed alternatively as a function of estimated totals.

3. The proposed jackknife variance estimator

We adopt a design-based approach and consider the estimation of the variance of $\hat{θ}$ with respect to the sampling design. We propose to estimate this variance by

$\hat{var} (\hat{θ}) = \sum_{i \in S} \sum_{j \in S} \frac{π_{i j} - π_{i} π_{j}}{π_{i j}} ε_{(i)} ε_{(j)},$

(2)

where π_ij denotes the probability that both units i and j are selected,

$ε_{(i)} = (1 - w_{i}) (\hat{θ} - {\hat{θ}}_{(i)}),$

(3)

${\hat{θ}}_{(j)} = g ({\hat{μ}}_{1 (j)}, \dots, {\hat{μ}}_{Q (j)})$ ⁠, ${\hat{μ}}_{q (j)} = Σ_{i \in S_{- j}} π_{i}^{- 1} y_{q i} / {\hat{N}}_{(j)},$ ⁠, ${\hat{N}}_{(j)} = Σ_{i \in S_{- j}} π_{i}^{- 1}$ ⁠, 𝒮_−j consists of 𝒮 with the jth unit deleted and n is the size of the sample 𝒮.

The estimator in equation (2) takes the form of the variance estimator of Horvitz and Thompson (1952) for the sample sum of empirical influence values (Davison and Hinkley (1997), chapter 2), where these empirical influence values are numerically approximated by the jackknife pseudovalues. This is analogous to the linearization variance estimator (Särndal et al. (1992), page 175) which takes the same form but with the empirical influence values obtained by analytic differentiation. This perspective was first set out by Campbell (1980), who noted how both these estimators could be constructed but did not evaluate their properties in detail.

The factor 1−w_i is a correction for unequal π_i, reducing the contribution of observations which have higher π_i-values and thus make smaller contributions to the variance. The inclusion of this factor ensures that equation (2) reduces to the usual linearization variance estimator (Särndal et al. (1992), page 182) when $\hat{θ}$ is the Hájek estimator ${\hat{μ}}_{1}$ ⁠, say, in which case ɛ_(i) reduces to $(y_{1 i} - {\hat{μ}}_{1})$ ⁠. The (1−w_i)-correction was suggested by Campbell (1980), who noted an algebraic equivalence with the weighted jackknife method of Hinkley (1977).

4. Consistency

In this section we consider the design consistency of the variance estimator proposed. Building on the analogy between linearization and jackknife variance estimation, we follow the approach of Särndal et al. (1992), who treated the linearization variance estimator under an unequal probability design as an estimator of an approximate linearized variance and then referred to other evidence that this approximate variance agrees well with the actual variance in large samples (Särndal et al. (1992), page 175). The approximate linearized variance (Robinson and Särndal, 1983) var ${(\hat{θ})}_{L}$ in our case (using expressions (5.5.10) and (5.7.4) in Särndal et al. (1992)) is given by

$var {(\hat{θ})}_{L} = \nabla {(μ)}^{T} Σ \nabla (μ),$

(4)

where

$\begin{matrix} Σ = \frac{1}{N^{2}} \sum_{i \in U} \sum_{j \in U} \frac{π_{i j} - π_{i} π_{j}}{π_{i} π_{j}} (y_{i} - μ) {(y_{j} - μ)}^{T}, \\ \nabla (x) = {(\frac{\partial g (μ)}{\partial μ_{1}}, \dots, \frac{\partial g (μ)}{\partial μ_{Q}})}_{μ = x}^{T}, \end{matrix}$

y_i=(y_1i,…,y_Qi)^T, ∇(x) denotes the gradient of g(·) at x ∈ ℝ^Q and it is assumed that g(·) is continuous and differentiable at μ=(μ₁,…,μ_Q)^T.

To demonstrate the consistency of the proposed variance estimator for the approximate linearized variance, we first define our asymptotic framework. Let {𝒮_t} be a sequence of samples selected from the sequence of nested finite populations {𝒰_t} of sizes N_t by a sequence of sampling designs, such that 𝒮_t is composed of a fixed number n_t of distinct elements selected from 𝒰_t (n_t<N_t) for t=1,2,…. For simplicity of notation, the index t will be suppressed in what follows and all limiting processes will be understood to be as t→∞. We shall denote by →^p and $\to^{D}$ respectively convergence in probability and in distribution when t→∞.

Theorem 1

Provided that the linearization variance estimator (11) is design consistent and under regularity assumptions that are given in Appendix A, the proposed variance estimator (2) is also design consistent, i.e.

$\hat{var} (\hat{θ}) / var {(\hat{θ})}_{L} \overset{p}{\to} 1$

(4)

The proof of theorem 1 is given in Appendix A.

It follows as a corollary of theorem 1 that if

$\frac{\hat{θ} - θ}{var {(\hat{θ})}_{L}^{1 / 2}} \overset{D}{\to} N (0, 1),$

(6)

i.e. if appropriate conditions hold for the linearization variance estimator to generate asymptotically valid confidence intervals, then by slu*tsky's lemma

$\frac{\hat{θ} - θ}{\hat{var} {(\hat{θ})}^{1 / 2}} \overset{D}{\to} N (0, 1) .$

Confidence intervals based on $\hat{var} (\hat{θ})$ will then be asympotically valid.

The key requirement for condition (6) to hold is that the Horvitz–Thompson estimators underlying the definition of $\hat{θ}$ are asymptotically normal. Sufficient conditions for asymptotic normality have been investigated to a limited extent in the survey sampling literature, but some examples of conditions are given by Hájek (1964) and Rosén (1972).

5. Alternative jackknife variance estimators

For comparison with the variance estimator proposed, we now consider some alternative jackknife estimators that have been proposed in the literature. The standard jackknife variance estimator of $\hat{θ}$ (Tukey, 1958) is defined by

$\hat{var} {(\hat{θ})}_{J} = \frac{n - 1}{n} \sum_{i \in S} {({\hat{θ}}_{(i)} - \bar{θ})}^{2},$

(7)

where $\bar{θ} = n^{- 1} Σ_{i \in S} {\hat{θ}}_{(i)}$ ⁠. If we ignore the finite population correction and if we assume that the sample is selected with simple random sampling without replacement, equation (2) reduces to equation (7). The variance estimator in equation (7) has been shown to be consistent for independent and identically distributed observations (e.g. Shao (1989, 1993) and Shao and Tu (1995)).

For the case of stratified simple random sampling without replacement, Lee (1973) (see also Kish and Frankel (1974)) proposed the variance estimator

$\hat{var} {(\hat{θ})}_{ST} = \sum_{h = 1}^{H} \frac{n_{h} - 1}{n_{h}} \sum_{i \in S_{h}} {({\hat{θ}}_{(i)} - \hat{θ})}^{2},$

(8)

where 𝒮_h is the sample of size n_h in the hth stratum U_h. For comparison, equation (2) reduces under this design to

$\hat{var} (\hat{θ}) = \sum_{h = 1}^{H} (1 - \frac{n_{h}}{N_{h}}) \frac{n_{h} - 1}{n_{h}} \sum_{i \in S_{h}} {({\hat{θ}}_{(i)} - {\bar{θ}}_{h})}^{2},$

(9)

where ${\bar{θ}}_{h} = n_{h}^{- 1} Σ_{i \in S_{h}} {\hat{θ}}_{(j)}$ ⁠. Ignoring the finite population correction, equation (9) is the jackknife estimator that was proposed by Jones (1974). Thus, when $\hat{θ} \approx {\bar{θ}}_{h}$ and the finite population correction is negligible, equation (8) is close to equation (9). It is worth noting that equation (9) naturally includes a finite population correction which is absent in equation (8).

Rao et al. (1992) described a customary ‘delete cluster’ jackknife variance estimator for a general weighted point estimator in stratified multistage designs. For the case when the clusters are single units and the weights are Horvitz–Thompson weights $π_{j}^{- 1}$ ⁠, their estimator reduces to

$\hat{var} {(\hat{θ})}_{R} = \sum_{h = 1}^{H} \frac{n_{h} - 1}{n_{h}} \sum_{i \in S_{h}} {({\hat{θ}}_{(i)}^{*} - \hat{θ})}^{2},$

(10)

where ${\hat{θ}}_{(i)}^{*}$ is computed by omitting unit i ∈ 𝒮_h and by modifying the weights so that $π_{j}^{- 1}$ is replaced by $n_{h} π_{j}^{- 1} / (n_{h} - 1)$ for all j ∈ 𝒮_h and the weight stays unaltered for all other j.

6. Monte Carlo study

In this section, the proposed variance estimator (2) is compared numerically with the alter- native jackknife estimators (7), (8) and (10). We use a population frame given in Valliant et al. (2000), appendix B, and available at the John Wiley World Wide Web site ftp://ftp.wiley.com/public/sci_tech_med/finite_populations. This population frame is extracted from the September 1976 Current Population Survey in the USA. We duplicate this population frame five times to create an artificial population of N=2390 individuals from which samples will be selected. This population is stratified into H=3 strata. The variables that are of interest are the number of hours worked per week (y_1i) and the weekly wages (y_2i). The population parameter that is considered is the finite population correlation coefficient between these two variables $ρ = σ_{12} {(σ_{1}^{2} σ_{2}^{2})}^{- 1 / 2}$ where σ₁₂=Σ_{i ∈ 𝒰}(y_1i−μ₁)(y_2i−μ₂) and $σ_{k}^{2} = Σ_{i \in U} {(y_{k i} - μ_{k})}^{2}$ (k=1,2). The population value is ρ=0.49. We propose to estimate ρ by

$\hat{ρ} = {\hat{σ}}_{12} {({\hat{σ}}_{1}^{2} {\hat{σ}}_{2}^{2})}^{- 1 / 2},$

where ${\hat{σ}}_{12} = Σ_{i \in S} w_{i} (y_{1 i} - {\hat{μ}}_{1}) (y_{2 i} - {\hat{μ}}_{2})$ and ${\hat{σ}}_{k}^{2} = Σ_{i \in S} w_{i} {(y_{k i} - {\hat{μ}}_{k})}^{2} (k = 1, 2)$ ⁠.

We consider a stratified sampling design with proportional allocation with at least two units selected per stratum, using the Chao (1982) sampling design for selection within each stratum. The π_i are proportional to a skewed size variable correlated with the y_2i, with a correlation coefficient of 0.83. The size variable has a coefficient of variation of 1.22, a Fisher coefficient of skewness of 3.13 and a kurtosis of 14.7. The π_ij are computed exactly by using an expression given by Chao (1982).

For each simulation, 10 000 samples were selected to compute the empirical relative bias

$R B (%) = 100 \frac{bias {\hat{var} (\hat{ρ})}}{var (\hat{ρ})},$

where bias $bias {\hat{var} (\hat{ρ})} = E {\hat{var} (\hat{ρ})} - var (\hat{ρ})$ and the empirical relative root-mean-square error

$RRMSE (%) = 100 \frac{MSE {\hat{var} (\hat{ρ})}^{1 / 2}}{var (\hat{ρ})}$

of equations (2), (7), (8) and (10). The variance $var (\hat{ρ})$ is the empirical variance of the 10 000 observed values of $\hat{ρ}$ ⁠.

The relative bias for the various estimators is given in Table 1 for several sampling fractions f=n/N. The second column gives the relative bias of $\hat{ρ}$ ⁠, RB $(\hat{ρ})$ ⁠. Estimators (7), (8) and (10) seriously overestimate the variance. For all the sampling fractions that were considered, the proposed estimator (2) has negligible bias. Table 2 gives RRMSE for equations (2), (7), (8) and (10). We see that the proposed estimator (2) has the smallest RRMSE for almost every value of f.

Table 1

Relative bias with and without finite population correction (FPC)

f	RB( $(\hat{ρ})$ )	Relative bias (%) for the following variance estimators:
f	RB( $(\hat{ρ})$ )	Equation (2)	Equation (7)	Equation (7) with FPC	Equation (8)	Equation (8) with FPC	Equation (10)	Equation (10) with FPC
0.03	−6.16	1.18	20.18	16.56	16.86	13.34	18.66	15.09
0.05	−4.30	−1.08	12.85	7.23	11.05	5.52	12.22	6.63
0.07	−2.76	−2.34	9.33	1.65	8.12	0.52	8.99	1.33
0.10	−2.08	0.43	11.39	0.25	10.53	−0.52	11.20	0.08
0.12	−1.93	−0.01	10.58	−2.69	9.88	−3.31	10.46	−2.81
0.15	−1.30	1.70	12.69	−4.24	12.11	−4.73	12.60	−4.31
0.20	−0.88	0.77	12.96	−9.63	12.53	−9.98	12.91	−9.67
0.40	−0.45	−1.16	22.68	−26.39	22.44	−26.53	22.66	−26.40

f	RB( $(\hat{ρ})$ )	Relative bias (%) for the following variance estimators:
f	RB( $(\hat{ρ})$ )	Equation (2)	Equation (7)	Equation (7) with FPC	Equation (8)	Equation (8) with FPC	Equation (10)	Equation (10) with FPC
0.03	−6.16	1.18	20.18	16.56	16.86	13.34	18.66	15.09
0.05	−4.30	−1.08	12.85	7.23	11.05	5.52	12.22	6.63
0.07	−2.76	−2.34	9.33	1.65	8.12	0.52	8.99	1.33
0.10	−2.08	0.43	11.39	0.25	10.53	−0.52	11.20	0.08
0.12	−1.93	−0.01	10.58	−2.69	9.88	−3.31	10.46	−2.81
0.15	−1.30	1.70	12.69	−4.24	12.11	−4.73	12.60	−4.31
0.20	−0.88	0.77	12.96	−9.63	12.53	−9.98	12.91	−9.67
0.40	−0.45	−1.16	22.68	−26.39	22.44	−26.53	22.66	−26.40

Open in new tab

Table 1

Relative bias with and without finite population correction (FPC)

f	RB( $(\hat{ρ})$ )	Relative bias (%) for the following variance estimators:
f	RB( $(\hat{ρ})$ )	Equation (2)	Equation (7)	Equation (7) with FPC	Equation (8)	Equation (8) with FPC	Equation (10)	Equation (10) with FPC
0.03	−6.16	1.18	20.18	16.56	16.86	13.34	18.66	15.09
0.05	−4.30	−1.08	12.85	7.23	11.05	5.52	12.22	6.63
0.07	−2.76	−2.34	9.33	1.65	8.12	0.52	8.99	1.33
0.10	−2.08	0.43	11.39	0.25	10.53	−0.52	11.20	0.08
0.12	−1.93	−0.01	10.58	−2.69	9.88	−3.31	10.46	−2.81
0.15	−1.30	1.70	12.69	−4.24	12.11	−4.73	12.60	−4.31
0.20	−0.88	0.77	12.96	−9.63	12.53	−9.98	12.91	−9.67
0.40	−0.45	−1.16	22.68	−26.39	22.44	−26.53	22.66	−26.40

f	RB( $(\hat{ρ})$ )	Relative bias (%) for the following variance estimators:
f	RB( $(\hat{ρ})$ )	Equation (2)	Equation (7)	Equation (7) with FPC	Equation (8)	Equation (8) with FPC	Equation (10)	Equation (10) with FPC
0.03	−6.16	1.18	20.18	16.56	16.86	13.34	18.66	15.09
0.05	−4.30	−1.08	12.85	7.23	11.05	5.52	12.22	6.63
0.07	−2.76	−2.34	9.33	1.65	8.12	0.52	8.99	1.33
0.10	−2.08	0.43	11.39	0.25	10.53	−0.52	11.20	0.08
0.12	−1.93	−0.01	10.58	−2.69	9.88	−3.31	10.46	−2.81
0.15	−1.30	1.70	12.69	−4.24	12.11	−4.73	12.60	−4.31
0.20	−0.88	0.77	12.96	−9.63	12.53	−9.98	12.91	−9.67
0.40	−0.45	−1.16	22.68	−26.39	22.44	−26.53	22.66	−26.40

Open in new tab

Table 2

Relative root-mean-square error with and without finite population correction (FPC)

f	RRMSE (%) for the following variance estimators:
f	Equation (2)	Equation (7)	Equation (7) with FPC	Equation (8)	Equation (8) with FPC	Equation 10	Equation (10) with FPC
0.03	91.13	126.78	122.52	123.71	119.61	124.46	120.29
0.05	74.95	97.67	92.28	96.44	91.20	96.86	91.55
0.07	66.67	81.56	75.34	80.84	74.78	81.10	74.95
0.10	59.25	71.24	63.29	70.74	62.96	71.00	63.10
0.12	55.35	64.88	56.39	64.50	56.18	64.74	56.29
0.15	50.08	58.15	48.41	57.83	48.29	58.03	48.33
0.20	43.24	50.36	40.11	50.13	40.09	50.30	40.07
0.40	28.67	40.17	33.05	40.00	33.15	40.14	33.05

f	RRMSE (%) for the following variance estimators:
f	Equation (2)	Equation (7)	Equation (7) with FPC	Equation (8)	Equation (8) with FPC	Equation 10	Equation (10) with FPC
0.03	91.13	126.78	122.52	123.71	119.61	124.46	120.29
0.05	74.95	97.67	92.28	96.44	91.20	96.86	91.55
0.07	66.67	81.56	75.34	80.84	74.78	81.10	74.95
0.10	59.25	71.24	63.29	70.74	62.96	71.00	63.10
0.12	55.35	64.88	56.39	64.50	56.18	64.74	56.29
0.15	50.08	58.15	48.41	57.83	48.29	58.03	48.33
0.20	43.24	50.36	40.11	50.13	40.09	50.30	40.07
0.40	28.67	40.17	33.05	40.00	33.15	40.14	33.05

Open in new tab

Table 2

Relative root-mean-square error with and without finite population correction (FPC)

f	RRMSE (%) for the following variance estimators:
f	Equation (2)	Equation (7)	Equation (7) with FPC	Equation (8)	Equation (8) with FPC	Equation 10	Equation (10) with FPC
0.03	91.13	126.78	122.52	123.71	119.61	124.46	120.29
0.05	74.95	97.67	92.28	96.44	91.20	96.86	91.55
0.07	66.67	81.56	75.34	80.84	74.78	81.10	74.95
0.10	59.25	71.24	63.29	70.74	62.96	71.00	63.10
0.12	55.35	64.88	56.39	64.50	56.18	64.74	56.29
0.15	50.08	58.15	48.41	57.83	48.29	58.03	48.33
0.20	43.24	50.36	40.11	50.13	40.09	50.30	40.07
0.40	28.67	40.17	33.05	40.00	33.15	40.14	33.05

f	RRMSE (%) for the following variance estimators:
f	Equation (2)	Equation (7)	Equation (7) with FPC	Equation (8)	Equation (8) with FPC	Equation 10	Equation (10) with FPC
0.03	91.13	126.78	122.52	123.71	119.61	124.46	120.29
0.05	74.95	97.67	92.28	96.44	91.20	96.86	91.55
0.07	66.67	81.56	75.34	80.84	74.78	81.10	74.95
0.10	59.25	71.24	63.29	70.74	62.96	71.00	63.10
0.12	55.35	64.88	56.39	64.50	56.18	64.74	56.29
0.15	50.08	58.15	48.41	57.83	48.29	58.03	48.33
0.20	43.24	50.36	40.11	50.13	40.09	50.30	40.07
0.40	28.67	40.17	33.05	40.00	33.15	40.14	33.05

Open in new tab

To see whether the difference between the bias of equations (2), (7), (8) and (10) is due to the finite population correction, we have multiplied the variance estimators (7), (8) and (10) by 1−f. The RB- and the RRMSE-values are given in the columns that are headed by ‘with FPC’ in Tables 1 and 2. We see that, for large sampling fractions, this correction tends to lead to underestimation of the variance. For small sampling fractions, the finite population correction cannot eliminate the large positive bias. This may be caused by the skewness of the π_i and the small sample size.

7. Discussion

The jackknife variance estimator that is proposed in equation (2) is applicable to general unequal probability designs and is design consistent in circ*mstances where the linearization variance estimator is consistent. A Monte Carlo study shows that the estimator proposed can demonstrate clear improvements compared with existing jackknife estimators. It naturally includes a finite population correction which is usually absent in the standard jackknife methods and may be of particular use for surveys with large sampling fractions.

The jackknife method proposed may be extended in various ways. Point estimators, such as calibration estimators (e.g. Deville and Särndal (1992)), which employ auxiliary population information may often be expressible as functions of means if the function g(·) may be specified in terms of this auxiliary finite population information. The method may in principle be extended to other point estimators which may be expressed as differentiable functionals (Hampel, 1974; Campbell, 1980), although it is well known that the consistency result will not extend to all non-smooth functions of means, such as quantiles.

The practical advantage of the method proposed is its breadth of applicability. A potential disadvantage is that it is constructed by deleting one sample element at a time in contrast with the usual deletion of clusters and this may lead to a major increase in computation. Furthermore, the method assumes that joint inclusion probabilities π_ij for sample units are available. If not, then various approximations to these joint inclusion probabilities may be used (e.g. Hájek (1964) and Berger (1998)). Multistage sampling with unequal probability sampling without replacement at each stage merits particular further research. The application of the method proposed when the first- and second-order inclusion probabilities are available for each stage of sampling and the potential use of equation (2) at each stage could be considered and compared with standard jackknife methods which delete primary sampling units.

Acknowledgements

The authors are grateful to J. N. K. Rao (Carleton University, Canada) and to two referees for helpful comments.

References

Berger

Y. G.

(

1998

)

Rate of convergence to asymptotic variance for the Horvitz–Thompson estimator

J. Statist. Planng Inf.

149

–

168

Google Scholar

Crossref

Search ADS

Campbell

(

1980

)

A different view of finite population estimation

Proc. Surv. Res. Meth. Sect. Am. Statist. Ass.

319

–

324

Google Scholar

OpenURL Placeholder Text

Chao

M. T.

(

1982

)

A general purpose unequal probability sampling plan

Biometrika

653

–

656

Google Scholar

Crossref

Search ADS

Davison

A. C.

and

Hinkley

D. V.

(

1997

)

Bootstrap Methods and Their Application

. Cambridge:

Cambridge University Press

Google Scholar

Crossref

Search ADS

Deville

J. C.

and

Särndal

C. E.

(

1992

)

Calibration estimators in survey sampling

J. Am. Statist. Ass.

376

–

382

Google Scholar

Crossref

Search ADS

Hájek

(

1964

Appendix A: Assumptions and proof of theorem 1

The following assumptions will be made.

(a)
$\hat{var} {(\hat{θ})}_{L} / var {(\hat{θ})}_{L} \to^{p} 1$ ⁠, where $\hat{var} {(\hat{θ})}_{L}$ is the linearization variance estimator that is given by
$\hat{var} {(\hat{θ})}_{L} = \nabla {(\hat{μ})}^{T} \hat{Σ} \nabla (\hat{μ})$
(11)
where
$\hat{Σ} = \sum_{i \in S} \sum_{j \in S} D_{i j} w_{i} w_{j} (y_{i} - \hat{μ}) {(y_{j} - \hat{μ})}^{T},$
with $D_{i j} = (π_{i j} - π_{i} π_{j}) π_{i j}^{- 1}$
(b)
|1−w_i| $⩾$ α>0 for all i ∈ 𝒰, where α is a constant (free of t).
(c)
$\lim \inf {n var {(\hat{θ})}_{L}} > 0.$
(d)
$(1 / n) Σ_{i \in S} w_{i}^{τ} {‖ y_{i} - \hat{μ} ‖}^{τ} = O_{p} (n^{- τ})$ ⁠, for all τ $⩾$ 2, where ║·║ denotes the Euclidean norm defined by ║A║=tr(A^TA)^1/2.
(e)
$G_{s} = \sum_{i \in S} \sum_{\begin{matrix} j \in S \\ j \neq i \end{matrix}} {(D_{i j}^{-})}^{2} = O_{p} (1)$ ⁠, where
$D_{i j}^{-} = {\begin{matrix} - D_{i j}, & if D_{i j} < 0, \\ 0, & otherwise. \end{matrix}$
(12)
(f)
$H_{s} = \sum_{i \in S} \sum_{\begin{matrix} j \in S \\ j \neq i \end{matrix}} {(D_{i j}^{+})}^{2} = O_{p} (1)$ ⁠, where
$D_{i j}^{+} = {\begin{matrix} 0, & if D_{i j} < 0, \\ D_{i j}, & otherwise. \end{matrix}$
(13)
(g)
∇(x) is Lipschitz continuous of order δ>0 (e.g. Shao and Tu (1995), page 43) in the sense that
$‖ \nabla (x_{1}) - \nabla (x_{2}) ‖ ⩽ λ {‖ x_{1} - x_{2} ‖}^{δ}$
for a constant λ>0, where x₁ and x₂ are in the neighbourhood of μ.
(h)
$‖ \nabla (\hat{μ}) ‖ = O_{p} (1)$ ⁠.

Assumption (a) states that the linearization variance estimator is consistent. An example of sufficient conditions for this assumption to hold can be found in Krewski and Rao (1981). Assumption (b) ensures that none of the weights (1) can approach 1, which would represent a degenerate design. Assumption (c) holds in the standard circ*mstances where the linearized variance decreases with rate n⁻¹ (Shao and Tu (1995), page 260). It holds when $var {(\hat{θ})}_{L} ⩾ ν / n$ ⁠, where ν is a positive constant. This inequality is similar to the Cramér–Rao lower bound. Assumption (d) is an assumption about the behaviour of the weights and the existence of moments of the y_i, which would hold, for example, if the nw_i and the y_i were bounded. Assumptions (e) and (f) are mild assumptions on the design, similar to ones in Isaki and Fuller (1982). For example, with simple random sampling without replacement, G_s $⩽$ 1−n/N=O_p(1) and H_s=0. Moreover if the condition of Yates and Grundy (1953) holds, D_ij<0 for all i and j, implying that H_s=0. Assumptions (g) and (h) are smoothness requirements of the function g(·).

A.1. Proof of theorem 1

From the mean value theorem, we have

$\hat{θ} - {\hat{θ}}_{(i)} = g (\hat{μ}) - g ({\hat{μ}}_{(i)}) = \nabla {(ξ_{i})}^{T} (\hat{μ} - {\hat{μ}}_{(i)}) = \nabla {(\hat{μ})}^{T} (\hat{μ} - {\hat{μ}}_{(i)}) + r_{i}^{*},$

where ξ_i is a point between $\hat{μ}$ and ${\hat{μ}}_{(i)}$ and $r_{i}^{*}$ is the remainder given by

$r_{i}^{*} = {(\nabla (ξ_{i}) - \nabla (\hat{μ}))}^{T} (\hat{μ} - {\hat{μ}}_{(i)}) .$

Thus,

$ε_{(i)} = \nabla {(\hat{μ})}^{T} (1 - w_{i}) (\hat{μ} - {\hat{μ}}_{(i)}) + r_{i},$

where

$r_{i} = (1 - w_{i}) r_{i}^{*} .$

(14)

It can be shown that

$(1 - w_{i}) (\hat{μ} - {\hat{μ}}_{(i)}) = w_{i} (y_{i} - \hat{μ}),$

(15)

implying that

$ε_{(i)} = \nabla {(\hat{μ})}^{T} w_{i} (y_{i} - \hat{μ}) + r_{i}$

(16)

Thus, by substituting equation (16) into equation (2), we obtain

$\hat{var} (\hat{θ}) = A + B + 2 C$

with

$\begin{matrix} A = \nabla {(\hat{μ})}^{T} {\sum_{i \in S} \sum_{j \in S} D_{i j} w_{i} w_{j} (y_{i} - \hat{μ}) {(y_{j} - \hat{μ})}^{T}} \nabla (\hat{μ}), \\ B = \sum_{i \in S} \sum_{j \in S} D_{i j} r_{i} r_{j}, \end{matrix}$

(17)

$C = \sum_{i \in S} \sum_{j \in S} D_{i j} r_{i} w_{j} {(y_{j} - \hat{μ})}^{T} \nabla (\hat{μ})$

(18)

Hence, theorem 1 follows if we may show

$A var {(\hat{θ})}_{L}^{- 1} \overset{p}{\to} 1,$

(19)

$B var {(\hat{θ})}_{L}^{- 1} \overset{p}{\to} 0,$

(20)

$C var {(\hat{θ})}_{L}^{- 1} \overset{p}{\to} 0.$

(21)

Assumption (a) implies expression (19). It is therefore only necessary to show expressions (20) and (21). We start by showing expression (20). From equation (17),

$B = \frac{- 1}{2} \sum_{i \in S} \sum_{j \in S} D_{i j} {(r_{i} - r_{j})}^{2} + \frac{1}{2} \sum_{i \in S} \sum_{j \in S} D_{i j} (r_{i}^{2} + r_{j}^{2}) .$

Furthermore, by definition of $D_{i j}^{-}$ and $D_{i j}^{+}$ in expressions (12) and (13), we have

$B ⩽ B_{1} + B_{2},$

(22)

where

$B_{1} = \frac{1}{2} \sum_{i \in S} \sum_{j \in S} D_{i j}^{-} {(r_{i} - r_{j})}^{2}$

and

$B_{2} = \frac{1}{2} \sum_{i \in S} \sum_{j \in S} D_{i j}^{+} (r_{i}^{2} + r_{j}^{2}) .$

By the Cauchy inequality,

$\frac{B_{1}}{var {(\hat{θ})}_{L}} ⩽ \frac{G_{s}^{1 / 2}}{2} {\frac{1}{\hat{var} {(\hat{θ})}_{L}^{2}} \sum_{i \in S} \sum_{j \in S} {(r_{i} - r_{j})}^{4}}^{1 / 2} .$

Now, as

$\sum_{i \in S} \sum_{j \in S} {(r_{i} - r_{j})}^{4} = 2 n \sum_{i \in S} {(r_{i} - \bar{r})}^{4} + 6 {\sum_{i \in S} {(r_{i} - \bar{r})}^{2}}^{2}$

with $\bar{r} = n^{- 1} \sum_{i \in S} r_{i}$ ⁠, we have

$\frac{B_{1}}{var {(\hat{θ})}_{L}} ⩽ \frac{G_{s}^{1 / 2}}{2} {(B_{3} + B_{4})}^{1 / 2},$

(23)

where

$\begin{matrix} B_{3} = \frac{2 n}{var {(\hat{θ})}_{L}^{2}} \sum_{i \in S} {(r_{i} - \bar{r})}^{4}, \\ B_{4} = \frac{6}{var {(\hat{θ})}_{L}^{2}} {\sum_{i \in S} {(r_{i} - \bar{r})}^{2}}^{2} . \end{matrix}$

(24)

Moreover,

$B_{3} ⩽ {\tilde{B}}_{3} = \frac{2 n}{var {(\hat{θ})}_{L}^{2}} \sum_{i \in S} r_{i}^{4},$

(25)

$B_{4} ⩽ {\tilde{B}}_{4} = \frac{6}{var {(\hat{θ})}_{L}^{2}} {(\sum_{i \in S} r_{i}^{2})}^{2} .$

(26)

Thus, assumption (e) and inequality (23) imply that $B_{1} var {(\hat{θ})}_{L}^{- 1} \to^{p} 0$ ⁠, if ${\tilde{B}}_{3} \to^{p} 0$ and ${\tilde{B}}_{4} \to^{p} 0$ ⁠. The Cauchy inequality (e.g. Harville (1997), page 62) further implies that

$| r_{i} | ⩽ ‖ \nabla (ξ_{i}) - \nabla (\hat{μ}) ‖ ‖ (1 - w_{i}) (\hat{μ} - {\hat{μ}}_{(i)}) ‖ .$

Combining this last inequality with equation (15), we obtain

$| r_{i} | ⩽ ‖ \nabla (ξ_{i}) - \nabla (\hat{μ}) ‖ w_{i} ‖ y_{i} - \hat{μ} ‖ .$

(27)

Assumption (g) implies that there are constants λ>0 and δ>0 such that

$‖ \nabla (ξ_{i}) - \nabla (\hat{μ}) ‖ ⩽ λ {‖ ξ_{i} - \hat{μ} ‖}^{δ}$

(28)

As ξ_i is a point between $\hat{μ}$ and ξ_i, we have $‖ ξ_{i} - \hat{μ} ‖ ⩽ ‖ \hat{μ} - {\hat{μ}}_{(i)} ‖$ ⁠. Combining this last inequality with equation (15), we obtain

$‖ ξ_{i} - \hat{μ} ‖ ⩽ ‖ w_{i} {(1 - w_{i})}^{- 1} (y_{i} - \hat{μ}) ‖$

which combined with inequality (28) gives

$‖ \nabla (ξ_{i}) - \nabla (\hat{μ}) ‖ ⩽ λ w_{i}^{δ} {| 1 - w_{i} |}^{- δ} {‖ y_{i} - \hat{μ} ‖}^{δ}$

Now, using assumption (b), we have

$‖ \nabla (ξ_{i}) - \nabla (\hat{μ}) ‖ ⩽ λ α^{- δ} w_{i}^{δ} {‖ y_{i} - \hat{μ} ‖}^{δ}$

(29)

Thus, inequalities (27) and (29) imply that

$| r_{i} | ⩽ λ α^{- δ} w_{i}^{1 + δ} {‖ y_{i} - \hat{μ} ‖}^{1 + δ} .$

(30)

First, we show that ${\tilde{B}}_{3} \to^{p} 0$ ⁠. Combining inequalities (25) and (30), we obtain

${\tilde{B}}_{3} ⩽ \frac{2 λ^{4}}{α^{4 δ}} \frac{n^{4}}{{n var {(\hat{θ})}_{L}}^{2}} (\frac{1}{n} \sum_{i \in S} w_{i}^{4 (1 + δ)} {‖ y_{i} - \hat{μ} ‖}^{4 (1 + δ)}) .$

(31)

Assumption (c) implies that

${n var {(\hat{θ})}_{L}}^{- 2} = O (1) .$

(32)

Now assumption (d) and expressions (31) and (32) imply that ${\tilde{B}}_{3} = n^{4} O_{p} (n^{- 4 (1 + δ)})$ ⁠, i.e.

${\tilde{B}}_{3} \overset{p}{\to} 0.$

(33)

Secondly, we show that ${\tilde{B}}_{4} \to^{p} 0$ ⁠. Combining inequalities (26) and (30), we obtain

${\tilde{B}}_{4} ⩽ \frac{6 λ^{4}}{α^{4 δ}} \frac{n^{4}}{{n var {(\hat{θ})}_{L}}^{2}} {(\frac{1}{n} \sum_{i \in S} w_{i}^{2 (1 + δ)} {‖ y_{i} - \hat{μ} ‖}^{2 (1 + δ)})}^{2} .$

(34)

Now assumption (d) and expressions (34) and (32) imply that ${\tilde{B}}_{4} = n^{4} O_{p} {(n^{- 2 (1 + δ)})}^{2}$ ⁠, i.e.

${\tilde{B}}_{4} \overset{p}{\to} 0.$

(35)

Thirdly, assumption (e) and expressions (23), (33) and (35) imply that

$B_{1} var {(\hat{θ})}_{L}^{- 1} \overset{p}{\to} 0.$

(36)

Now, we show that $B_{2} var {(\hat{θ})}_{L}^{- 1} \to^{p} 0$ ⁠. We have by the Cauchy inequality

$\frac{B_{2}}{var {(\hat{θ})}_{L}} ⩽ \frac{H_{s}^{1 / 2}}{2} {\frac{1}{var {(\hat{θ})}_{L}^{2}} \sum_{i \in S} \sum_{j \in S} {(r_{i}^{2} + r_{j}^{2})}^{2}}^{1 / 2} = \frac{H_{s}^{1 / 2}}{2} {({\tilde{B}}_{3} + \frac{{\tilde{B}}_{4}}{3})}^{1 / 2} .$

Thus, assumption (f) and expressions (33) and (35) imply that

$B_{2} var {(\hat{θ})}_{L}^{- 1} \overset{p}{\to} 0.$

(37)

Consequently, expression (20) follows from expressions (36) and (37). To complete the proof we need to show expression (21). By the triangle inequality, equation (18) implies that

$| C | ⩽ \sum_{i \in S} \sum_{j \in S} | D_{i j} | | r_{i} | | {\tilde{y}}_{j} | = C_{1} + C_{2}$

with ${\tilde{y}}_{j} = w_{j} {(y_{j} - \hat{μ})}^{T} \nabla (\hat{μ})$ ⁠, where

$C_{1} = \sum_{i \in S} \sum_{j \in S} D_{i j}^{+} | r_{i} | | {\tilde{y}}_{j} |$

and

$C_{2} = \sum_{i \in S} \sum_{j \in S} D_{i j}^{-} | r_{i} | | {\tilde{y}}_{j} | .$

By the Cauchy inequality, $C_{1} ⩽ G_{s}^{1 / 2} C_{3}^{1 / 2}$ and $C_{2} ⩽ H_{s}^{1 / 2} C_{3}^{1 / 2}$ ⁠, with

$C_{3} = \sum_{i \in S} r_{i}^{2} \sum_{j \in S} {| {\tilde{y}}_{j} |}^{2} .$

(38)

Thus, expression (21) follows from assumptions (e) and (f), if we can show that $C_{3} var {(\hat{θ})}_{L}^{- 2} \to^{p} 0$ ⁠. The Cauchy inequality implies that $| {\tilde{y}}_{j} | ⩽ w_{j} ‖ y_{j} - \hat{μ} ‖ ‖ \nabla (\hat{μ}) ‖$ ⁠. By substituting the last inequality and inequality (30) into equation (38), we obtain

$\frac{C_{3}}{var {(\hat{θ})}_{L}^{2}} ⩽ ‖ \nabla (\hat{μ}) ‖^{2} \frac{λ^{2}}{α^{2 δ}} \frac{n^{4}}{{n var {(\hat{θ})}_{L}}^{2}} (\frac{1}{n} \sum_{i \in S} w_{i}^{2 (1 + δ)} {‖ y_{i} - \hat{μ} ‖}^{2 (1 + δ)}) (\frac{1}{n} \sum_{j \in S} w_{j}^{2} {‖ y_{j} - \hat{μ} ‖}^{2}) .$

(39)

Now, from assumptions (c) and (d) and expressions (32) and (39), we have

$C_{3} var {(\hat{θ})}_{L}^{- 2} = n^{4} O_{p} (n^{- 2 (1 + δ)}) O_{p} (n^{- 2}) \overset{p}{\to} 0,$

which implies expression (21), completing the proof.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)