pps Sampling with Minitab Duane Meeter STA 4222/5225

Suppose we use Minitab on the counties.xls file used in Exercise 8, and estimate the total number of physicians in the 100 counties considered as a population. The county populations vary in size from a few thousand to six million, so pps sampling should be helpful. We can use pps in two equivalent ways. First, consider the counties as elements and their population as a measure of size (mos) which should be correlated with the number of physicians (pages 49 and 50, notes.) Second, treat the county as a cluster of m elements, each of which is either a physician or not, and estimate p, the proportion of physicians in the population (pages 51 and 52, where is the same as , since the data is 0 or 1.) Then, since we know the population size M, we can estimate pM = t in the obvious way. We will use the first method.

To use the pps estimator, we need the cumulative sums of the mos (county populations.) Calc>Calculator Store result in variable (you provide) Functions >Partial Sums Expression PARS( c5 ) OK

This gives, in each row, the sum of all of the populations up to this point. (cum mos)

Sumpop

36023 To sample counties, Calc>Random Data>Uniform

109547 Generate 10 rows of data Store in column(s) ( you provide )

115955 Lower endpoint: 1

135216 Upper endpoint: 11853116.99

…….. The above generates 10 random numbers on the interval 1 to 11853116.

(96 more)

Manip>Sort Sort column(s) ( give same column )

Store sorted column(s) in ( give same column )

Sort by column ( give same column )

sortcum

59896 722123 1914255 3117305 6049152 6204787

7310578 9329580 9982401 10983753

Now, first select the county whose cum mos is just less than 59896 but greater than the entry above it in Sumpop (36023 < 59896 <= 109547), etc., continuing until ten counties are selected, and store these numbers in a column.

selbypps

2 9 18 18 18 18 22 49 63 82

Note that the most populous county, Cook (row 18), was selected four times.

Compute the selection probability p i for each county by calculating TOTPOP/11853116. The physician totals for the selected counties are in ppsPhys; the estimates are in y/pi.

Row selbypps ppsPhys pi y/pi

1 2 44 0.003039 7097

2 9 2851 0.006203 39896

3 18 15153 0.000541 34949

4 18 15153 0.001625 34949

5 18 15153 0.000645 34949

6 18 15153 0.015893 34949

7 22 81 0.000181 25000

8 49 4189 0.000211 129410

9 63 86 0.071459 9328

10 82 47 0.001694 8363

11 0.013602

12 0.013592

……….

(88 more)

The square root of the estimated variance for on p. 50 is the standard error of the mean 35,889 (= .) It is a whopping 11,144. An approximate 95% confidence interval for the total number of physicians in these 100 counties is 13601 < t < 58,177!

Variable N Mean Median TrMean StDev SE Mean

y/pi 10 35889 34949 27798 35240 11144

Looking at the above data, we see that row 49, St. Louis City, has an unusually high number of physicians. Remember: always plot the data! Since there are some counties that have zero physicians, Minitab will not plot that variable on the log scale. Instead, I calculated log(physicians + 1) ; the plot is below. When the data are plotted in the original scale, the detail is lost, but the magnitude of the outlier is more apparent. We are plotting the entire population; in practice, we would have only the sample values, so it would be harder to spot a problem.

For a comparison, I used srs with n = 10. Calc>Random Data>Sample From Columns Sample 10 rows from columns c6 Store samples in (pick an empty column)

srsPhys

0 98 6 10 2851 1 16 130 2 7

Then Stat>Basic Statistics>Display Descriptive Statistics (name the column)

Variable N Mean Median TrMean StDev SE Mean

srsPhys 10 312 9 34 893 282

The estimated number of physicians in the N = 100-county population is = 31,200. This has a standard error s/Ö n of 282 (before the fpc) so a 95% confidence bound on the county mean is

-224 < m < 848.

To get the estimate for t , multiply by N = 100: -22,400 < t < 84,800.

This is the estimate for Case 2: M unknown, in cluster sampling.