MIME-Version: 1.0 Content-Location: file:///C:/13299A55/StataProgramNotesfor15.htm Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="us-ascii" Program Note 15

STATA Program Notes<= /b>

# Biostatistics:  A Guide to Design, Analysis, and Discovery

Chapter 15: Analysis of Survey Data

Program Note 15.1 - Simple random sampling without replacement

In Example 15.4, a simple random sample of 8 health departments was selected from 60 rural county health departments.  The following STATA commands can b= e used to create the data displayed in Table 15.4.

input obs prof mph

1 21 14

2 18  8

3  9  3

4 13  6

5 15  8

6 22 13

7 30 17

8 27 15

end<= /b>=

We also use the commands below to assign values to N- the 60 rural county health departments, and = wt- the inverse of the sampling probability equ= al to 60/8.

gen N =3D 60

gen wt =3D 60/8=

Using the STATA command svyset= , we apply the survey data settings.  Because the design was a simple random sample, we use the pweight command in brackets [ ] to assign the = sample weight.  STATA allows for other weights but we are interested in pweight because it refers to the inverse of the probability of selection into the sample.  Next the design optio= n fpc followed by the population size in parentheses ( ) is used to indicate a finite-population correction.

svyset [pweight= =3Dwt],fpc(N)

pweight: wt

=   VCE: linearized

Strata 1: <on= e>

SU 1: <observations>

FPC 1: N

<= /p>

To estimate the total number of workers with an MPH de= gree, we use the following STATA command.

svy: total mph

This provides the following STATA output:

(running t= otal on estimation sample)

Survey: Total estimation

Number of strata =3D       1        =   Number of obs    =3D       8

Number of PSUs   =3D       8        =   Population size=   =3D    &= nbsp; 60

=             &nb= sp;               Desi= gn df        =3D     &n= bsp; 7

-------------------------------------= -------------------------

=      |        =      Linearized

=      |      Total   Std. Err.<= span style=3D'mso-spacerun:yes'>     [95% Conf. Inter= val]

-------------+-----------------------= -------------------------

mph |        630   97.32126      399.8718    860.1282

-------------------------------------= -------------------------

For the second analysis, we create the variable totxmph by multiply= ing the number of professional workers in the population by mph.

gen totxmph = =3D 1150*mph

svyset [pweight=3Dwt],fpc(N)=

pweight: wt

=   VCE: linearized

Strata 1: <on= e>

SU 1: <observations>

FPC 1: N

svy: ratio totxmph prof

The results of running the STATA command svy: ratio are shown below.  See the text for= more details regarding differences in the estimated standard errors.

(running r= atio on estimation sample)

Survey: Ratio estimation

Number of strata =3D       1        =   Number of obs    =3D       8

Number of PSUs   =3D       8         =  Population size  =3D<= /span>      60

=             &nb= sp;            =    Design df        =3D     &n= bsp; 7

_ratio_1: totxmph/prof

-------------------------------------= -------------------------

=      |        =      Linearized

=      |      Ratio   Std. Err.<= span style=3D'mso-spacerun:yes'>     [95% Conf. Inter= val]

-------------+-----------------------= -------------------------

_ratio_1 |   623.2258   29.92157      552.4725    693.9791

-------------------------------------= -------------------------

Program Note 15.2 - Analysis of subpopulations

The subpop command = in STATA can be used to perform a subpopulation analysis.  Use the STATA file adult4.dta and its corresponding data dictionary to conduct the data analysis.  For example, assume that we are interested in calculating the mean bmi for Afri= can Americans.  We begin by specif= ying the psu, pweight= , and strata.

svyset psu [pweight=3Dwgt], strata(stra)

pweight: wgt

=   VCE: linearized

Strata 1: stra

SU 1: psu

FPC 1: <zero>

The commands below attempt to perform a subpopulation analysis by selecting out African Americans with non-missing data on bmi.&nb= sp;

svy: mean bmi if bmi !=3D 99 & race =3D=3D 2=

Because only one PSU remained in the 13th and 15th<= /sup> strata, the linearized standard error estimate = and the 95% confidence interval values for the mean cannot be calculated as sho= wn below.

(running m= ean on estimation sample)

Survey: Mean estimation

Number of strata =3D      23        =   Number of obs    =3D    2847=

Number of PSUs   =3D      44        =   Population size=   =3D 1075.27

=             &nb= sp;            =    Design df        =3D      <= /span>21

-------------------------------------= -------------------------

=      |        =      Linearized

=      |     &nbs= p; Mean   Std. Err.     [95% Conf. Inter= val]

-------------+-----------------------= -------------------------

bmi |   27.27701        =   .        =      .        =    .

-------------------------------------= -------------------------

Note: Missing standard error due to stratum with single sampling unit; see help svydes.

However, the correct program to perform the calculation of the aver= age bmi for African Americans uses the STATA command subpop.  The subpop command requires the calculation of an indicator variable that equals ‘1’ if a participant’s race is African American and ‘0’ otherwise.  Before we use the = tab command to produce the values of the race variable, we will use the STATA command drop as shown below.  The drop command allows us to drop those values for which bmi equals to ‘99’.

drop= if bmi =3D=3D 99

tab<= /b> race<= /o:p>

race |      Freq.     Percent        Cum.

------------+------------------------= -----------

=   1 |      3,906     =   39.38     =   39.38

=   2 |      2,958     =   29.82     =   69.19

=   3 |      2,593     =   26.14       95.33=

=   4 |        463        4.67      100.00

------------+------------------------= -----------

Total |      9,920      100.00

The commands below show the creation of the indicator variable aa.

gen aa =3D 1= if race =3D=3D 2

replace aa =3D 0 if race =3D=3D 1 | race =3D=3D 3 | race =3D= =3D 4

The commands below allow us to calculate the linearized standard error estimate and the 95% confidence interval values for the mean=

svy: mean bmi, subpop(aa)

(running m= ean on estimation sample)

Survey: Mean estimation

Number of strata =3D      23        =   Number of obs    =3D    9372=

Number of PSUs   =3D      46        =   Population size=   =3D 9644.12

=             &nb= sp;            =    Subpop. no. = obs  =3D    2847

=             &nb= sp;            =    Subpop. size     =3D 1075.27=

&nbs= p;         =             &nb= sp;            =  Design df        =3D      <= /span>23

-------------------------------------= -------------------------

=      |        =      Linearized

=      |     &nbs= p; Mean   Std. Err.     [95% Conf. Inter= val]

-------------+-----------------------= -------------------------

bmi |   27.27701   .1804639       26.9037    27= .65033

-------------------------------------= -------------------------

Program Note 15.3= - Descriptive Analysis

The following STATA commands calculate the weighted me= ans, proportions, standard errors, and design effects shown in Example 15.7.  The structure of the commands is s= imilar to those in Program Note 15.2.  Because variables of interest have missing values, we begin by eliminating missing values from the data set and by creating an indicator variable for Hispanics using the STATA commands below.  The estimates provided below differ slightly from the values presented in the book because we are not using imputation.

gen hispanic= =3D 1 if race =3D=3D 3

replace hispanic =3D 0 if race =3D=3D 1 | race =3D=3D 2 | rac= e =3D=3D 4

recode vit (2=3D0)

recode smoke (2= =3D0)

svy: mean age aa hispanic educat avsbp bmi vit smoke

(running m= ean on estimation sample)

Survey: Mean estimation

Number of strata =3D      23        =   Number of obs    =3D    9165=

Number of PSUs   =3D      46        =   Population size=   =3D 9446.76

=             &nb= sp;            =    Design df        =3D      23

-------------------------------------= -------------------------

=      |        =      Linearized

=      |     &nbs= p; Mean   Std. Err.     [95% Conf. Inter= val]

-------------+-----------------------= -------------------------

age |   43.41835   .5655883      42.24834    44.58836

=   aa |   .1111356   .0096133      .0912491    .1310222

<= span class=3DGramE>hispanic |&nbs= p;  .0498226    .006869       .035613    .0= 640322

educat |   12.40544   .1204095      12.15635    12.65452

avsbp |     122.14   .3717584      121.3709     122.909

bmi |   25.95808    .117135      25.71577    26.20039

vit |   .4304797   .0123517      .4049283    .4560312

smoke |   .5138776   .011887= 4      .4892866    .5384686

-------------------------------------= -------------------------

Program Note 15.4= - Contingency Table Analysis

svyset psu [pweight=3Dwgt], strata(stra)

svy: tab vit= edu, col

(running tabulate on estimation sample)

Number of strata   =3D        23        =           Number of obs      =3D      9911<= /o:p>

Number of PSUs     =3D        46        =           Population size    <= /span>=3D  9916.576

=             &nb= sp;            =             &nb= sp;  Design df        =   =3D        23

-------------------------------------= -

=   |        =     edu        =

vit |     1      2      3  Total

----------+--------------------------= -

0 | .6662  .6009  .4856  .5706

1 | .3338  .3991  .5144  .4294

=   |

Total |     1      1      1      1

-------------------------------------= -

&nbs= p; Key:  column proportion= s

&nbs= p; Pearson:

Uncorrected   chi2(2)         =3D  227.5422

Design-based  F(1.65, 37.92)  =3D   28.7910     P =3D 0.0000

svyset psu [pweight=3Dwgt], strata(stra)

svy: tab vit= edu, col subpop(hispanic)

(running tabulate on estimation sample)

Number of strata   =3D        23        =           Number of obs      =3D      9911<= /o:p>

Number of PSUs     =3D        46        =           Population size    <= /span>=3D  9916.576

=             &nb= sp;            =             &nb= sp;  Subpop. no. of obs =3D      2= 592

=             &nb= sp;          =             &nb= sp;    Subpop= . size       =3D<= span style=3D'mso-spacerun:yes'>   538.998

=             &nb= sp;            =             &nb= sp;  Design df        =   =3D        23

-------------------------------------= -

=   |        =     edu        =

vit |     1      2      3  Total

----------+--------------------------= -

0 | .7385  .6728  .5719  .6914

1 | .2615  .3272  .4281  .3086

=   |

Total |     1      1      1      1

-------------------------------------= -

&nbs= p; Key:  column proportion= s

&nbs= p; Pearson:

Uncorrected   chi2(2)         =3D  186.5486

Design-based  F(1.97, 45.27)  =3D   24.3900     P =3D 0.0000

Program Note 15.5= – Regression Analysis<= /p>

svy: regress <= span class=3DSpellE>avsbp height weight age sex vit<= /span>

(running r= egress on estimation sample)

Survey: Linear regression<= /span>

Number of strata   =3D        23        =           Number of obs      =3D      9368<= /o:p>

Number of PSUs     =3D        46        =           Population size    =3D   9920.06=

=             &nb= sp;            =             &nb= sp;  Design df        =   =3D        23

=             &nb= sp;         =             &nb= sp;     F(   5,     19)    =3D    937.30

=             &nb= sp;            =             &nb= sp;  Prob > F        =    =3D    0.0000=

=             &nb= sp;            =             &nb= sp;  R-squared        =   =3D    0.3925

-------------------------------------= -----------------------------------------

=      |        =      Linearized

avsbp |      Coef.   Std. Err.    &n= bsp; t    P>|t|     [95% Conf. Inter= val]

-------------+-----------------------= -----------------------------------------

height |  -.4008937   .1022887    -3.92   0.001    -.6124939   -.1892935<= /p>

weight |  &= nbsp; .091683   .0047982=     19.11   0.000     .0817571    .1016088

age |    .600425   .0131726    45.58   0.000     .5731754    .6276746

sex |   4.029257    .654631     6.16   0.000      2.67505    5.383465

vit |  -1.196061   .4194095    -2.85   0.009    -2.063675   -.3284459<= /p>

_cons |   106.2809   6.765258    15.71   0.000     92.28585    120.2759

-------------------------------------= -----------------------------------------

Program Note 15.6= – Logistic Regression Analysis

gen edu1 =3D 0

replace edu1 =3D 1= if educat =3D=3D 12

gen edu2 =3D 0

replace edu2 =3D 1= if educat >12

gen male =3D 0

replace male =3D1 = if sex =3D=3D 1

svy: logistic = vit male edu1 edu2

(running l= ogistic on estimation sample)

Survey: Logistic regression

Number of strata   =3D        23        =           Number of obs      =3D      9920<= /o:p>

Number of PSUs     =3D        46        =           Population size    =3D   9920.06=

=             &nb= sp;            =                Desig= n df        =   =3D        23

=             &nb= sp;            =             &nb= sp;  F( =   3,     21)    =3D     63.08=

=             &nb= sp;            =             &nb= sp;  Prob > F        =    =3D    0.0000=

-------------------------------------= -----------------------------------------

=      |        =      Linearized

vit | Odds Ra= tio   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+-----------------------= -----------------------------------------

male |   .6071846   .035114= 9    -8.63   0.000     .5387209     .684349

edu1 |    1.28964   .1134639     2.89   0.008     1.075043    1.547075

edu2 |   2.148065   .1960688     8.38   0.000     1.778458    2.594485

-------------------------------------= -----------------------------------------