# anova — Analysis of variance and covariance

Download anova — Analysis of variance and covariance

## Preview text

Title

anova — Analysis of variance and covariance

stata.com

Syntax Remarks and examples

Menu Stored results

Description References

Options Also see

Syntax

anova varname termlist if in weight , options

where termlist is a factor-variable list (see [U] 11.4.3 Factor variables) with the following additional features:

• Variables are assumed to be categorical; use the c. factor-variable operator to override this.

• The | symbol (indicating nesting) may be used in place of the # symbol (indicating interaction).

• The / symbol is allowed after a term and indicates that the following term is the error term for the preceding terms.

options

Description

Model

repeated(varlist) partial sequential noconstant dropemptycells

variables in terms that are repeated-measures variables use partial (or marginal) sums of squares use sequential sums of squares suppress constant term drop empty cells from the design matrix

Adv. model

bse(term) bseunit(varname) grouping(varname)

between-subjects error term in repeated-measures ANOVA variable representing lowest unit in the between-subjects error term grouping variable for computing pooled covariance matrix

bootstrap, by, fp, jackknife, and statsby are allowed; see [U] 11.1.10 Preﬁx commands. Weights are not allowed with the bootstrap preﬁx; see [R] bootstrap. aweights are not allowed with the jackknife preﬁx; see [R] jackknife. aweights and fweights are allowed; see [U] 11.1.6 weight. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.

Menu

Statistics > Linear models and related > ANOVA/MANOVA > Analysis of variance and covariance

Description

The anova command ﬁts analysis-of-variance (ANOVA) and analysis-of-covariance (ANCOVA) models for balanced and unbalanced designs, including designs with missing cells; for repeated-measures ANOVA; and for factorial, nested, or mixed designs.

1

2 anova — Analysis of variance and covariance

The regress command (see [R] regress) will display the coefﬁcients, standard errors, etc., of the regression model underlying the last run of anova.

If you want to ﬁt one-way ANOVA models, you may ﬁnd the oneway or loneway command more convenient; see [R] oneway and [R] loneway. If you are interested in MANOVA or MANCOVA, see [MV] manova.

Options

£

£

Model

repeated(varlist) indicates the names of the categorical variables in the terms that are to be treated

as repeated-measures variables in a repeated-measures ANOVA or ANCOVA.

partial presents the ANOVA table using partial (or marginal) sums of squares. This setting is the default. Also see the sequential option.

sequential presents the ANOVA table using sequential sums of squares.

noconstant suppresses the constant term (intercept) from the ANOVA or regression model.

dropemptycells drops empty cells from the design matrix. If c(emptycells) is set to keep (see [R] set emptycells), this option temporarily resets it to drop before running the ANOVA model. If c(emptycells) is already set to drop, this option does nothing.

£

£

Adv. model

bse(term) indicates the between-subjects error term in a repeated-measures ANOVA. This option

is needed only in the rare case when the anova command cannot automatically determine the

between-subjects error term.

bseunit(varname) indicates the variable representing the lowest unit in the between-subjects error term in a repeated-measures ANOVA. This option is rarely needed because the anova command automatically selects the ﬁrst variable listed in the between-subjects error term as the default for this option.

grouping(varname) indicates a variable that determines which observations are grouped together in computing the covariance matrices that will be pooled and used in a repeated-measures ANOVA. This option is rarely needed because the anova command automatically selects the combination of all variables except the ﬁrst (or as speciﬁed in the bseunit() option) in the between-subjects error term as the default for grouping observations.

Remarks and examples

Remarks are presented under the following headings:

Introduction One-way ANOVA Two-way ANOVA N-way ANOVA Weighted data ANCOVA Nested designs Mixed designs Latin-square designs Repeated-measures ANOVA Video examples

stata.com

anova — Analysis of variance and covariance 3

Introduction

anova uses least squares to ﬁt the linear models known as ANOVA or ANCOVA (henceforth referred to simply as ANOVA models).

If your interest is in one-way ANOVA, you may ﬁnd the oneway command to be more convenient; see [R] oneway.

Structural equation modeling provides a more general framework for ﬁtting ANOVA models; see the Stata Structural Equation Modeling Reference Manual.

ANOVA was pioneered by Fisher. It features prominently in his texts on statistical methods and his design of experiments (1925, 1935). Many books discuss ANOVA; see, for instance, Altman (1991); van Belle et al. (2004); Cobb (1998); Snedecor and Cochran (1989); or Winer, Brown, and Michels (1991). For a classic source, see Scheffe´ (1959). Kennedy and Gentle (1980) discuss ANOVA’s computing problems. Edwards (1985) is concerned primarily with the relationship between multiple regression and ANOVA. Acock (2014, chap. 9) illustrates his discussion with Stata output. Repeated-measures ANOVA is discussed in Winer, Brown, and Michels (1991); Kuehl (2000); and Milliken and Johnson (2009). Pioneering work in repeated-measures ANOVA can be found in Box (1954); Geisser and Greenhouse (1958); Huynh and Feldt (1976); and Huynh (1978). For a Stata-speciﬁc discussion of ANOVA contrasts, see Mitchell (2012, chap. 7–9).

One-way ANOVA

anova, entered without options, performs and reports standard ANOVA. For instance, to perform a one-way layout of a variable called endog on exog, you would type anova endog exog.

Example 1: One-way ANOVA

We run an experiment varying the amount of fertilizer used in growing apple trees. We test four concentrations, using each concentration in three groves of 12 trees each. Later in the year, we measure the average weight of the fruit.

If all had gone well, we would have had 3 observations on the average weight for each of the four concentrations. Instead, two of the groves were mistakenly leveled by a confused man on a large bulldozer. We are left with the following data:

. use http://www.stata-press.com/data/r13/apple (Apple trees) . list, abbrev(10) sepby(treatment)

treatment weight

1.

1 117.5

2.

1 113.8

3.

1 104.4

4.

2

48.9

5.

2

50.4

6.

2

58.9

7.

3

70.4

8.

3

86.9

9.

4

87.7

10.

4

67.3

4 anova — Analysis of variance and covariance

To obtain one-way ANOVA results, we type

. anova weight treatment Source

Number of obs =

10

Root MSE

= 9.07002

Partial SS df

MS

R-squared

= 0.9147

Adj R-squared = 0.8721

F

Prob > F

Model 5295.54433

3 1765.18144

21.46

0.0013

treatment 5295.54433

3 1765.18144

21.46

0.0013

Residual 493.591667

6 82.2652778

Total

5789.136

9 643.237333

We ﬁnd signiﬁcant (at better than the 1% level) differences among the four concentrations.

Although the output is a usual ANOVA table, let’s run through it anyway. Above the table is a summary of the underlying regression. The model was ﬁt on 10 observations, and the root mean squared error (Root MSE) is 9.07. The R2 for the model is 0.9147, and the adjusted R2 is 0.8721.

The ﬁrst line of the table summarizes the model. The sum of squares (Partial SS) for the model is 5295.5 with 3 degrees of freedom (df). This line results in a mean square (MS) of 5295.5/3 ≈ 1765.2. The corresponding F statistic is 21.46 and has a signiﬁcance level of 0.0013. Thus the model appears to be signiﬁcant at the 0.13% level.

The next line summarizes the ﬁrst (and only) term in the model, treatment. Because there is only one term, the line is identical to that for the overall model.

The third line summarizes the residual. The residual sum of squares is 493.59 with 6 degrees of freedom, resulting in a mean squared error of 82.27. The square root of this latter number is reported as the Root MSE.

The model plus the residual sum of squares equals the total sum of squares, which is reported as 5789.1 in the last line of the table. This is the total sum of squares of weight after removal of the mean. Similarly, the model plus the residual degrees of freedom sum to the total degrees of freedom, 9. Remember that there are 10 observations. Subtracting 1 for the mean, we are left with 9 total degrees of freedom.

Technical note Rather than using the anova command, we could have performed this analysis by using the

oneway command. Example 1 in [R] oneway repeats this same analysis. You may wish to compare the output.

Type regress to see the underlying regression model corresponding to an ANOVA model ﬁt using the anova command.

Example 2: Regression table from a one-way ANOVA

Returning to the apple tree experiment, we found that the fertilizer concentration appears to signiﬁcantly affect the average weight of the fruit. Although that ﬁnding is interesting, we next want to know which concentration appears to grow the heaviest fruit. One way to ﬁnd out is by examining the underlying regression coefﬁcients.

anova — Analysis of variance and covariance 5

. regress, baselevels

Source

SS

Model Residual

5295.54433 493.591667

Total

5789.136

df

MS

3 1765.18144 6 82.2652778

9 643.237333

Number of obs =

F( 3,

6) =

Prob > F

=

R-squared

=

Adj R-squared =

Root MSE

=

10 21.46 0.0013 0.9147 0.8721

9.07

weight

treatment 1 2 3 4

_cons

Coef. Std. Err.

t P>|t|

0 -59.16667

-33.25 -34.4

(base) 7.405641 8.279758 8.279758

111.9 5.236579

-7.99 -4.02 -4.15

21.37

0.000 0.007 0.006

0.000

[95% Conf. Interval]

-77.28762 -53.50984 -54.65984

99.08655

-41.04572 -12.99016 -14.14016

124.7134

See [R] regress for an explanation of how to read this table. The baselevels option of regress displays a row indicating the base category for our categorical variable, treatment. In summary, we ﬁnd that concentration 1, the base (omitted) group, produces signiﬁcantly heavier fruits than concentration 2, 3, and 4; concentration 2 produces the lightest fruits; and concentrations 3 and 4 appear to be roughly equivalent.

Example 3: ANOVA replay

We previously typed anova weight treatment to produce and display the ANOVA table for our apple tree experiment. Typing regress displays the regression coefﬁcients. We can redisplay the ANOVA table by typing anova without arguments:

. anova

Source

Number of obs =

10

Root MSE

= 9.07002

Partial SS df

MS

R-squared

= 0.9147

Adj R-squared = 0.8721

F

Prob > F

Model 5295.54433

3 1765.18144

21.46

0.0013

treatment 5295.54433

3 1765.18144

21.46

0.0013

Residual 493.591667

6 82.2652778

Total

5789.136

9 643.237333

Two-way ANOVA

You can include multiple explanatory variables with the anova command, and you can specify interactions by placing ‘#’ between the variable names. For instance, typing anova y a b performs a two-way layout of y on a and b. Typing anova y a b a#b performs a full two-way factorial layout. The shorthand anova y a##b does the same.

With the default partial sums of squares, when you specify interacted terms, the order of the terms does not matter. Typing anova y a b a#b is the same as typing anova y b a b#a.

6 anova — Analysis of variance and covariance

Example 4: Two-way factorial ANOVA

The classic two-way factorial ANOVA problem, at least as far as computer manuals are concerned, is a two-way ANOVA design from Aﬁﬁ and Azen (1979).

Fifty-eight patients, each suffering from one of three different diseases, were randomly assigned to one of four different drug treatments, and the change in their systolic blood pressure was recorded. Here are the data:

Drug 1 Drug 2 Drug 3 Drug 4

Disease 1 42, 44, 36 13, 19, 22 28, 23, 34 42, 13 1, 29, 19

24, 9, 22 –2, 15

Disease 2

33, 26, 33 21

34, 33, 31 36

11, 9, 7 1, –6

27, 12, 12 –5, 16, 15

Disease 3

31, –3, 25 25, 24

3, 26, 28 32, 4, 16

21, 1, 9 3

22, 7, 25 5, 12

Let’s assume that we have entered these data into Stata and stored the data as systolic.dta. Below we use the data, list the ﬁrst 10 observations, summarize the variables, and tabulate the control variables:

. use http://www.stata-press.com/data/r13/systolic (Systolic Blood Pressure Data) . list in 1/10

drug disease systolic

1.

1

1

42

2.

1

1

44

3.

1

1

36

4.

1

1

13

5.

1

1

19

6.

1

1

22

7.

1

2

33

8.

1

2

26

9.

1

2

33

10.

1

2

21

. summarize Variable

Obs

Mean Std. Dev.

Min

Max

drug disease systolic

58

2.5 1.158493

1

4

58 2.017241 .8269873

1

3

58 18.87931 12.80087

-6

44

. tabulate drug disease

Drug Used

Patient’s Disease

1

2

3

Total

1

6

4

5

15

2

5

4

6

15

3

3

5

4

12

4

5

6

5

16

Total

19

19

20

58

anova — Analysis of variance and covariance 7

Each observation in our data corresponds to one patient, and for each patient we record drug, disease, and the increase in the systolic blood pressure, systolic. The tabulation reveals that the data are not balanced — there are not equal numbers of patients in each drug – disease cell. Stata does not require that the data be balanced. We can perform a two-way factorial ANOVA by typing

. anova systolic drug disease drug#disease

Number of obs =

58

Root MSE

= 10.5096

Source Partial SS df

MS

R-squared

= 0.4560

Adj R-squared = 0.3259

F

Prob > F

Model 4259.33851 11 387.212591

3.51

0.0013

drug disease drug#disease

2997.47186 415.873046 707.266259

3 999.157287 2 207.936523 6 117.87771

9.05 1.88 1.07

0.0001 0.1637 0.3958

Residual 5080.81667 46 110.452536

Total 9340.15517 57 163.862371

Although Stata’s table command does not perform ANOVA, it can produce useful summary tables of your data (see [R] table):

. table drug disease, c(mean systolic) row col f(%8.2f)

Drug Used

1 2 3 4

Total

Patient’s Disease

1

2

3 Total

29.33 28.00 16.33 13.60

28.25 33.50

4.40 12.83

20.40 18.17

8.50 14.20

26.07 25.53

8.75 13.50

22.79 18.21 15.80 18.88

These are simple means and are not inﬂuenced by our anova model. More useful is the margins command (see [R] margins) that provides marginal means and adjusted predictions. Because drug is the only signiﬁcant factor in our ANOVA, we now examine the adjusted marginal means for drug.

. margins drug, asbalanced

Adjusted predictions

Number of obs =

58

Expression at

: Linear prediction, predict()

: drug

(asbalanced)

disease

(asbalanced)

drug 1 2 3 4

Delta-method Margin Std. Err.

t P>|t|

25.99444 26.55556 9.744444 13.54444

2.751008 2.751008 3.100558 2.637123

9.45 9.65 3.14 5.14

0.000 0.000 0.003 0.000

[95% Conf. Interval]

20.45695 21.01806 3.503344 8.236191

31.53194 32.09305 15.98554

18.8527

These adjusted marginal predictions are not equal to the simple drug means (see the total column from the table command); they are based upon predictions from our ANOVA model. The asbalanced option of margins corresponds with the interpretation of the F statistic produced by ANOVA—each cell is given equal weight regardless of its sample size (see the following three technical notes). You

8 anova — Analysis of variance and covariance

can omit the asbalanced option and obtain predictive margins that take into account the unequal sample sizes of the cells.

. margins drug

Predictive margins

Number of obs =

58

Expression : Linear prediction, predict()

drug 1 2 3 4

Delta-method Margin Std. Err.

t P>|t|

25.89799 26.41092 9.722989 13.55575

2.750533 2.742762 3.099185 2.640602

9.42 9.63 3.14 5.13

0.000 0.000 0.003 0.000

[95% Conf. Interval]

20.36145 20.89003 3.484652

8.24049

31.43452 31.93181 15.96132

18.871

Technical note

How do you interpret the signiﬁcance of terms like drug and disease in unbalanced data? If you are familiar with SAS, the sums of squares and the F statistic reported by Stata correspond to SAS type III sums of squares. (Stata can also calculate sequential sums of squares, but we will postpone that topic for now.)

Let’s think in terms of the following table:

Disease 1 Disease 2 Disease 3

Drug 1

µ11

µ12

Drug 2

µ21

µ22

Drug 3

µ31

µ32

Drug 4

µ41

µ42

µ13

µ1·

µ23

µ2·

µ33

µ3·

µ43

µ4·

µ·1

µ·2

µ·3

µ··

In this table, µij is the mean increase in systolic blood pressure associated with drug i and disease j, while µi· is the mean for drug i, µ·j is the mean for disease j, and µ·· is the overall mean.

If the data are balanced, meaning that there are equal numbers of observations going into the calculation of each mean µij, the row means, µi·, are given by

µi· = µi1 + µi2 + µi3 3

In our case, the data are not balanced, but we deﬁne the µi· according to that formula anyway. The test for the main effect of drug is the test that

µ1· = µ2· = µ3· = µ4·

To be absolutely clear, the F test of the term drug, called the main effect of drug, is formally equivalent to the test of the three constraints:

anova — Analysis of variance and covariance 9

µ11 + µ12 + µ13 = µ21 + µ22 + µ23

3

3

µ11 + µ12 + µ13 = µ31 + µ32 + µ33

3

3

µ11 + µ12 + µ13 = µ41 + µ42 + µ43

3

3

In our data, we obtain a signiﬁcant F statistic of 9.05 and thus reject those constraints.

Technical note

Stata can display the symbolic form underlying the test statistics it presents, as well as display other test statistics and their symbolic forms; see Obtaining symbolic forms in [R] anova postestimation. Here is the result of requesting the symbolic form for the main effect of drug in our data:

. test drug, drug

1 2 3 4 disease 1 2 3 drug#disease 11 12 13 21 22 23 31 32 33 41 42 43 _cons

symbolic

-(r2+r3+r4) r2 r3 r4

0 0 0

-1/3 -1/3 -1/3

1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 0

(r2+r3+r4) (r2+r3+r4) (r2+r3+r4) r2 r2 r2 r3 r3 r3 r4 r4 r4

This says exactly what we said in the previous technical note.

Technical note

Saying that there is no main effect of a variable is not the same as saying that it has no effect at all. Stata’s ability to perform ANOVA on unbalanced data can easily be put to ill use.

For example, consider the following table of the probability of surviving a bout with one of two diseases according to the drug administered to you:

10 anova — Analysis of variance and covariance

Drug 1 Drug 2

Disease 1

1 0

Disease 2

0 1

If you have disease 1 and are administered drug 1, you live. If you have disease 2 and are administered drug 2, you live. In all other cases, you die.

This table has no main effects of either drug or disease, although there is a large interaction effect. You might now be tempted to reason that because there is only an interaction effect, you would be indifferent between the two drugs in the absence of knowledge about which disease infects you. Given an equal chance of having either disease, you reason that it does not matter which drug is administered to you — either way, your chances of surviving are 0.5.

You may not, however, have an equal chance of having either disease. If you knew that disease 1 was 100 times more likely to occur in the population, and if you knew that you had one of the two diseases, you would express a strong preference for receiving drug 1.

When you calculate the signiﬁcance of main effects on unbalanced data, you must ask yourself why the data are unbalanced. If the data are unbalanced for random reasons and you are making predictions for a balanced population, the test of the main effect makes perfect sense. If, however, the data are unbalanced because the underlying populations are unbalanced and you are making predictions for such unbalanced populations, the test of the main effect may be practically — if not statistically — meaningless.

Example 5: ANOVA with missing cells

Stata can perform ANOVA not only on unbalanced populations, but also on populations that are so unbalanced that entire cells are missing. For instance, using our systolic blood pressure data, let’s reﬁt the model eliminating the drug 1–disease 1 cell. Because anova follows the same syntax as all other Stata commands, we can explicitly specify the data to be used by typing the if qualiﬁer at the end of the anova command. Here we want to use the data that are not for drug 1 and disease 1:

. anova systolic drug##disease if !(drug==1 & disease==1)

Number of obs =

52

Root MSE

= 10.1615

R-squared

= 0.4545

Adj R-squared = 0.3215

Source Partial SS df

MS

F

Prob > F

Model 3527.95897 10 352.795897

3.42

0.0025

drug disease drug#disease

2686.57832 327.792598 703.007602

3 895.526107 2 163.896299 5 140.60152

8.67 1.59 1.36

0.0001 0.2168 0.2586

Residual 4233.48333 41 103.255691

Total 7761.44231 51 152.185143

Here we used drug##disease as a shorthand for drug disease drug#disease.

anova — Analysis of variance and covariance

stata.com

Syntax Remarks and examples

Menu Stored results

Description References

Options Also see

Syntax

anova varname termlist if in weight , options

where termlist is a factor-variable list (see [U] 11.4.3 Factor variables) with the following additional features:

• Variables are assumed to be categorical; use the c. factor-variable operator to override this.

• The | symbol (indicating nesting) may be used in place of the # symbol (indicating interaction).

• The / symbol is allowed after a term and indicates that the following term is the error term for the preceding terms.

options

Description

Model

repeated(varlist) partial sequential noconstant dropemptycells

variables in terms that are repeated-measures variables use partial (or marginal) sums of squares use sequential sums of squares suppress constant term drop empty cells from the design matrix

Adv. model

bse(term) bseunit(varname) grouping(varname)

between-subjects error term in repeated-measures ANOVA variable representing lowest unit in the between-subjects error term grouping variable for computing pooled covariance matrix

bootstrap, by, fp, jackknife, and statsby are allowed; see [U] 11.1.10 Preﬁx commands. Weights are not allowed with the bootstrap preﬁx; see [R] bootstrap. aweights are not allowed with the jackknife preﬁx; see [R] jackknife. aweights and fweights are allowed; see [U] 11.1.6 weight. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.

Menu

Statistics > Linear models and related > ANOVA/MANOVA > Analysis of variance and covariance

Description

The anova command ﬁts analysis-of-variance (ANOVA) and analysis-of-covariance (ANCOVA) models for balanced and unbalanced designs, including designs with missing cells; for repeated-measures ANOVA; and for factorial, nested, or mixed designs.

1

2 anova — Analysis of variance and covariance

The regress command (see [R] regress) will display the coefﬁcients, standard errors, etc., of the regression model underlying the last run of anova.

If you want to ﬁt one-way ANOVA models, you may ﬁnd the oneway or loneway command more convenient; see [R] oneway and [R] loneway. If you are interested in MANOVA or MANCOVA, see [MV] manova.

Options

£

£

Model

repeated(varlist) indicates the names of the categorical variables in the terms that are to be treated

as repeated-measures variables in a repeated-measures ANOVA or ANCOVA.

partial presents the ANOVA table using partial (or marginal) sums of squares. This setting is the default. Also see the sequential option.

sequential presents the ANOVA table using sequential sums of squares.

noconstant suppresses the constant term (intercept) from the ANOVA or regression model.

dropemptycells drops empty cells from the design matrix. If c(emptycells) is set to keep (see [R] set emptycells), this option temporarily resets it to drop before running the ANOVA model. If c(emptycells) is already set to drop, this option does nothing.

£

£

Adv. model

bse(term) indicates the between-subjects error term in a repeated-measures ANOVA. This option

is needed only in the rare case when the anova command cannot automatically determine the

between-subjects error term.

bseunit(varname) indicates the variable representing the lowest unit in the between-subjects error term in a repeated-measures ANOVA. This option is rarely needed because the anova command automatically selects the ﬁrst variable listed in the between-subjects error term as the default for this option.

grouping(varname) indicates a variable that determines which observations are grouped together in computing the covariance matrices that will be pooled and used in a repeated-measures ANOVA. This option is rarely needed because the anova command automatically selects the combination of all variables except the ﬁrst (or as speciﬁed in the bseunit() option) in the between-subjects error term as the default for grouping observations.

Remarks and examples

Remarks are presented under the following headings:

Introduction One-way ANOVA Two-way ANOVA N-way ANOVA Weighted data ANCOVA Nested designs Mixed designs Latin-square designs Repeated-measures ANOVA Video examples

stata.com

anova — Analysis of variance and covariance 3

Introduction

anova uses least squares to ﬁt the linear models known as ANOVA or ANCOVA (henceforth referred to simply as ANOVA models).

If your interest is in one-way ANOVA, you may ﬁnd the oneway command to be more convenient; see [R] oneway.

Structural equation modeling provides a more general framework for ﬁtting ANOVA models; see the Stata Structural Equation Modeling Reference Manual.

ANOVA was pioneered by Fisher. It features prominently in his texts on statistical methods and his design of experiments (1925, 1935). Many books discuss ANOVA; see, for instance, Altman (1991); van Belle et al. (2004); Cobb (1998); Snedecor and Cochran (1989); or Winer, Brown, and Michels (1991). For a classic source, see Scheffe´ (1959). Kennedy and Gentle (1980) discuss ANOVA’s computing problems. Edwards (1985) is concerned primarily with the relationship between multiple regression and ANOVA. Acock (2014, chap. 9) illustrates his discussion with Stata output. Repeated-measures ANOVA is discussed in Winer, Brown, and Michels (1991); Kuehl (2000); and Milliken and Johnson (2009). Pioneering work in repeated-measures ANOVA can be found in Box (1954); Geisser and Greenhouse (1958); Huynh and Feldt (1976); and Huynh (1978). For a Stata-speciﬁc discussion of ANOVA contrasts, see Mitchell (2012, chap. 7–9).

One-way ANOVA

anova, entered without options, performs and reports standard ANOVA. For instance, to perform a one-way layout of a variable called endog on exog, you would type anova endog exog.

Example 1: One-way ANOVA

We run an experiment varying the amount of fertilizer used in growing apple trees. We test four concentrations, using each concentration in three groves of 12 trees each. Later in the year, we measure the average weight of the fruit.

If all had gone well, we would have had 3 observations on the average weight for each of the four concentrations. Instead, two of the groves were mistakenly leveled by a confused man on a large bulldozer. We are left with the following data:

. use http://www.stata-press.com/data/r13/apple (Apple trees) . list, abbrev(10) sepby(treatment)

treatment weight

1.

1 117.5

2.

1 113.8

3.

1 104.4

4.

2

48.9

5.

2

50.4

6.

2

58.9

7.

3

70.4

8.

3

86.9

9.

4

87.7

10.

4

67.3

4 anova — Analysis of variance and covariance

To obtain one-way ANOVA results, we type

. anova weight treatment Source

Number of obs =

10

Root MSE

= 9.07002

Partial SS df

MS

R-squared

= 0.9147

Adj R-squared = 0.8721

F

Prob > F

Model 5295.54433

3 1765.18144

21.46

0.0013

treatment 5295.54433

3 1765.18144

21.46

0.0013

Residual 493.591667

6 82.2652778

Total

5789.136

9 643.237333

We ﬁnd signiﬁcant (at better than the 1% level) differences among the four concentrations.

Although the output is a usual ANOVA table, let’s run through it anyway. Above the table is a summary of the underlying regression. The model was ﬁt on 10 observations, and the root mean squared error (Root MSE) is 9.07. The R2 for the model is 0.9147, and the adjusted R2 is 0.8721.

The ﬁrst line of the table summarizes the model. The sum of squares (Partial SS) for the model is 5295.5 with 3 degrees of freedom (df). This line results in a mean square (MS) of 5295.5/3 ≈ 1765.2. The corresponding F statistic is 21.46 and has a signiﬁcance level of 0.0013. Thus the model appears to be signiﬁcant at the 0.13% level.

The next line summarizes the ﬁrst (and only) term in the model, treatment. Because there is only one term, the line is identical to that for the overall model.

The third line summarizes the residual. The residual sum of squares is 493.59 with 6 degrees of freedom, resulting in a mean squared error of 82.27. The square root of this latter number is reported as the Root MSE.

The model plus the residual sum of squares equals the total sum of squares, which is reported as 5789.1 in the last line of the table. This is the total sum of squares of weight after removal of the mean. Similarly, the model plus the residual degrees of freedom sum to the total degrees of freedom, 9. Remember that there are 10 observations. Subtracting 1 for the mean, we are left with 9 total degrees of freedom.

Technical note Rather than using the anova command, we could have performed this analysis by using the

oneway command. Example 1 in [R] oneway repeats this same analysis. You may wish to compare the output.

Type regress to see the underlying regression model corresponding to an ANOVA model ﬁt using the anova command.

Example 2: Regression table from a one-way ANOVA

Returning to the apple tree experiment, we found that the fertilizer concentration appears to signiﬁcantly affect the average weight of the fruit. Although that ﬁnding is interesting, we next want to know which concentration appears to grow the heaviest fruit. One way to ﬁnd out is by examining the underlying regression coefﬁcients.

anova — Analysis of variance and covariance 5

. regress, baselevels

Source

SS

Model Residual

5295.54433 493.591667

Total

5789.136

df

MS

3 1765.18144 6 82.2652778

9 643.237333

Number of obs =

F( 3,

6) =

Prob > F

=

R-squared

=

Adj R-squared =

Root MSE

=

10 21.46 0.0013 0.9147 0.8721

9.07

weight

treatment 1 2 3 4

_cons

Coef. Std. Err.

t P>|t|

0 -59.16667

-33.25 -34.4

(base) 7.405641 8.279758 8.279758

111.9 5.236579

-7.99 -4.02 -4.15

21.37

0.000 0.007 0.006

0.000

[95% Conf. Interval]

-77.28762 -53.50984 -54.65984

99.08655

-41.04572 -12.99016 -14.14016

124.7134

See [R] regress for an explanation of how to read this table. The baselevels option of regress displays a row indicating the base category for our categorical variable, treatment. In summary, we ﬁnd that concentration 1, the base (omitted) group, produces signiﬁcantly heavier fruits than concentration 2, 3, and 4; concentration 2 produces the lightest fruits; and concentrations 3 and 4 appear to be roughly equivalent.

Example 3: ANOVA replay

We previously typed anova weight treatment to produce and display the ANOVA table for our apple tree experiment. Typing regress displays the regression coefﬁcients. We can redisplay the ANOVA table by typing anova without arguments:

. anova

Source

Number of obs =

10

Root MSE

= 9.07002

Partial SS df

MS

R-squared

= 0.9147

Adj R-squared = 0.8721

F

Prob > F

Model 5295.54433

3 1765.18144

21.46

0.0013

treatment 5295.54433

3 1765.18144

21.46

0.0013

Residual 493.591667

6 82.2652778

Total

5789.136

9 643.237333

Two-way ANOVA

You can include multiple explanatory variables with the anova command, and you can specify interactions by placing ‘#’ between the variable names. For instance, typing anova y a b performs a two-way layout of y on a and b. Typing anova y a b a#b performs a full two-way factorial layout. The shorthand anova y a##b does the same.

With the default partial sums of squares, when you specify interacted terms, the order of the terms does not matter. Typing anova y a b a#b is the same as typing anova y b a b#a.

6 anova — Analysis of variance and covariance

Example 4: Two-way factorial ANOVA

The classic two-way factorial ANOVA problem, at least as far as computer manuals are concerned, is a two-way ANOVA design from Aﬁﬁ and Azen (1979).

Fifty-eight patients, each suffering from one of three different diseases, were randomly assigned to one of four different drug treatments, and the change in their systolic blood pressure was recorded. Here are the data:

Drug 1 Drug 2 Drug 3 Drug 4

Disease 1 42, 44, 36 13, 19, 22 28, 23, 34 42, 13 1, 29, 19

24, 9, 22 –2, 15

Disease 2

33, 26, 33 21

34, 33, 31 36

11, 9, 7 1, –6

27, 12, 12 –5, 16, 15

Disease 3

31, –3, 25 25, 24

3, 26, 28 32, 4, 16

21, 1, 9 3

22, 7, 25 5, 12

Let’s assume that we have entered these data into Stata and stored the data as systolic.dta. Below we use the data, list the ﬁrst 10 observations, summarize the variables, and tabulate the control variables:

. use http://www.stata-press.com/data/r13/systolic (Systolic Blood Pressure Data) . list in 1/10

drug disease systolic

1.

1

1

42

2.

1

1

44

3.

1

1

36

4.

1

1

13

5.

1

1

19

6.

1

1

22

7.

1

2

33

8.

1

2

26

9.

1

2

33

10.

1

2

21

. summarize Variable

Obs

Mean Std. Dev.

Min

Max

drug disease systolic

58

2.5 1.158493

1

4

58 2.017241 .8269873

1

3

58 18.87931 12.80087

-6

44

. tabulate drug disease

Drug Used

Patient’s Disease

1

2

3

Total

1

6

4

5

15

2

5

4

6

15

3

3

5

4

12

4

5

6

5

16

Total

19

19

20

58

anova — Analysis of variance and covariance 7

Each observation in our data corresponds to one patient, and for each patient we record drug, disease, and the increase in the systolic blood pressure, systolic. The tabulation reveals that the data are not balanced — there are not equal numbers of patients in each drug – disease cell. Stata does not require that the data be balanced. We can perform a two-way factorial ANOVA by typing

. anova systolic drug disease drug#disease

Number of obs =

58

Root MSE

= 10.5096

Source Partial SS df

MS

R-squared

= 0.4560

Adj R-squared = 0.3259

F

Prob > F

Model 4259.33851 11 387.212591

3.51

0.0013

drug disease drug#disease

2997.47186 415.873046 707.266259

3 999.157287 2 207.936523 6 117.87771

9.05 1.88 1.07

0.0001 0.1637 0.3958

Residual 5080.81667 46 110.452536

Total 9340.15517 57 163.862371

Although Stata’s table command does not perform ANOVA, it can produce useful summary tables of your data (see [R] table):

. table drug disease, c(mean systolic) row col f(%8.2f)

Drug Used

1 2 3 4

Total

Patient’s Disease

1

2

3 Total

29.33 28.00 16.33 13.60

28.25 33.50

4.40 12.83

20.40 18.17

8.50 14.20

26.07 25.53

8.75 13.50

22.79 18.21 15.80 18.88

These are simple means and are not inﬂuenced by our anova model. More useful is the margins command (see [R] margins) that provides marginal means and adjusted predictions. Because drug is the only signiﬁcant factor in our ANOVA, we now examine the adjusted marginal means for drug.

. margins drug, asbalanced

Adjusted predictions

Number of obs =

58

Expression at

: Linear prediction, predict()

: drug

(asbalanced)

disease

(asbalanced)

drug 1 2 3 4

Delta-method Margin Std. Err.

t P>|t|

25.99444 26.55556 9.744444 13.54444

2.751008 2.751008 3.100558 2.637123

9.45 9.65 3.14 5.14

0.000 0.000 0.003 0.000

[95% Conf. Interval]

20.45695 21.01806 3.503344 8.236191

31.53194 32.09305 15.98554

18.8527

These adjusted marginal predictions are not equal to the simple drug means (see the total column from the table command); they are based upon predictions from our ANOVA model. The asbalanced option of margins corresponds with the interpretation of the F statistic produced by ANOVA—each cell is given equal weight regardless of its sample size (see the following three technical notes). You

8 anova — Analysis of variance and covariance

can omit the asbalanced option and obtain predictive margins that take into account the unequal sample sizes of the cells.

. margins drug

Predictive margins

Number of obs =

58

Expression : Linear prediction, predict()

drug 1 2 3 4

Delta-method Margin Std. Err.

t P>|t|

25.89799 26.41092 9.722989 13.55575

2.750533 2.742762 3.099185 2.640602

9.42 9.63 3.14 5.13

0.000 0.000 0.003 0.000

[95% Conf. Interval]

20.36145 20.89003 3.484652

8.24049

31.43452 31.93181 15.96132

18.871

Technical note

How do you interpret the signiﬁcance of terms like drug and disease in unbalanced data? If you are familiar with SAS, the sums of squares and the F statistic reported by Stata correspond to SAS type III sums of squares. (Stata can also calculate sequential sums of squares, but we will postpone that topic for now.)

Let’s think in terms of the following table:

Disease 1 Disease 2 Disease 3

Drug 1

µ11

µ12

Drug 2

µ21

µ22

Drug 3

µ31

µ32

Drug 4

µ41

µ42

µ13

µ1·

µ23

µ2·

µ33

µ3·

µ43

µ4·

µ·1

µ·2

µ·3

µ··

In this table, µij is the mean increase in systolic blood pressure associated with drug i and disease j, while µi· is the mean for drug i, µ·j is the mean for disease j, and µ·· is the overall mean.

If the data are balanced, meaning that there are equal numbers of observations going into the calculation of each mean µij, the row means, µi·, are given by

µi· = µi1 + µi2 + µi3 3

In our case, the data are not balanced, but we deﬁne the µi· according to that formula anyway. The test for the main effect of drug is the test that

µ1· = µ2· = µ3· = µ4·

To be absolutely clear, the F test of the term drug, called the main effect of drug, is formally equivalent to the test of the three constraints:

anova — Analysis of variance and covariance 9

µ11 + µ12 + µ13 = µ21 + µ22 + µ23

3

3

µ11 + µ12 + µ13 = µ31 + µ32 + µ33

3

3

µ11 + µ12 + µ13 = µ41 + µ42 + µ43

3

3

In our data, we obtain a signiﬁcant F statistic of 9.05 and thus reject those constraints.

Technical note

Stata can display the symbolic form underlying the test statistics it presents, as well as display other test statistics and their symbolic forms; see Obtaining symbolic forms in [R] anova postestimation. Here is the result of requesting the symbolic form for the main effect of drug in our data:

. test drug, drug

1 2 3 4 disease 1 2 3 drug#disease 11 12 13 21 22 23 31 32 33 41 42 43 _cons

symbolic

-(r2+r3+r4) r2 r3 r4

0 0 0

-1/3 -1/3 -1/3

1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 0

(r2+r3+r4) (r2+r3+r4) (r2+r3+r4) r2 r2 r2 r3 r3 r3 r4 r4 r4

This says exactly what we said in the previous technical note.

Technical note

Saying that there is no main effect of a variable is not the same as saying that it has no effect at all. Stata’s ability to perform ANOVA on unbalanced data can easily be put to ill use.

For example, consider the following table of the probability of surviving a bout with one of two diseases according to the drug administered to you:

10 anova — Analysis of variance and covariance

Drug 1 Drug 2

Disease 1

1 0

Disease 2

0 1

If you have disease 1 and are administered drug 1, you live. If you have disease 2 and are administered drug 2, you live. In all other cases, you die.

This table has no main effects of either drug or disease, although there is a large interaction effect. You might now be tempted to reason that because there is only an interaction effect, you would be indifferent between the two drugs in the absence of knowledge about which disease infects you. Given an equal chance of having either disease, you reason that it does not matter which drug is administered to you — either way, your chances of surviving are 0.5.

You may not, however, have an equal chance of having either disease. If you knew that disease 1 was 100 times more likely to occur in the population, and if you knew that you had one of the two diseases, you would express a strong preference for receiving drug 1.

When you calculate the signiﬁcance of main effects on unbalanced data, you must ask yourself why the data are unbalanced. If the data are unbalanced for random reasons and you are making predictions for a balanced population, the test of the main effect makes perfect sense. If, however, the data are unbalanced because the underlying populations are unbalanced and you are making predictions for such unbalanced populations, the test of the main effect may be practically — if not statistically — meaningless.

Example 5: ANOVA with missing cells

Stata can perform ANOVA not only on unbalanced populations, but also on populations that are so unbalanced that entire cells are missing. For instance, using our systolic blood pressure data, let’s reﬁt the model eliminating the drug 1–disease 1 cell. Because anova follows the same syntax as all other Stata commands, we can explicitly specify the data to be used by typing the if qualiﬁer at the end of the anova command. Here we want to use the data that are not for drug 1 and disease 1:

. anova systolic drug##disease if !(drug==1 & disease==1)

Number of obs =

52

Root MSE

= 10.1615

R-squared

= 0.4545

Adj R-squared = 0.3215

Source Partial SS df

MS

F

Prob > F

Model 3527.95897 10 352.795897

3.42

0.0025

drug disease drug#disease

2686.57832 327.792598 703.007602

3 895.526107 2 163.896299 5 140.60152

8.67 1.59 1.36

0.0001 0.2168 0.2586

Residual 4233.48333 41 103.255691

Total 7761.44231 51 152.185143

Here we used drug##disease as a shorthand for drug disease drug#disease.

## Categories

## You my also like

### Properties of Least Squares Estimators Simple Linear Regression

77.3 KB10.1K3.8K### Using Topical Interests And Social Interactions To

11.6 MB27.3K11.5K### Measurement System Analysis Repeatability dan Reproducibility

453 KB3.1K551### Analyzing ANOVA designs

394 KB57.9K21.4K### Introducing ANOVA and APA Style F

218.6 KB9.5K4K### Generic Drug Substitution: Role and Function

7.8 MB5.4K1.6K### A Beautiful Floor from Wood Scraps

1.1 MB38.4K9.2K### Signatura of magic and Latin integer squares: isentropic

215.4 KB50.1K11.5K### Magic Squares and Orthogonal Arrays

334.6 KB1.7K676