Up  

 

 

One way ANOVA (the F ratio test)

it is the standard parametric test of difference between 3 or more samples, it is the parametric equivalent of the Kruskal-Wallis test

H0= :1=:2=:3=:k the population means are all equal

so that if just one is different, we will reject H0

we expect that the sample means will be different, question is, are they significantly different from each other

H0: differences are not significant - differences between sample means have been generated in a random sampling process

H1: differences are significant - sample means likely to have come from different population means

assumptions

1) data must be at the interval/ratio level

2) sample data must be drawn from a normally distributed population

3) sample data must be drawn from independent random samples

4) populations have the same/similar variances - homoscedasticity assumption from which the samples are drawn

ANOVA examines differences in means by looking at estimates of variances

essentially ANOVA poses the following partition of a data value

observation=overall mean + deviation of group mean from overall mean + deviation of observation from group mean

the overall mean is a constant to all observations

deviation of the group mean from the overall mean is taken to represent the effect of belonging to a particular group

deviation from the group mean is taken to represent the effect of all other variables other than the group variable

examples:

consider the numbers below as constituting One data set

example 1:

 

observations

 

Gxk

sample 1

1

 

7

3

3.7

sample 2

3

3

4

 

3.3

sample 3

3

7

1

2

3.3

 

Much of the variation is within each row, it is difficult to tell if the means are significantly different

as stated before, the variability in all observations can be divided into 3 parts

1) variations due to differences within rows

2) variations due to differences between rows

3) variations due to sampling errors

example 2;

 

observations

 

Gxk

sample 1

1

1

1

2

1.25

sample 2

3

3

4

 

3.33

sample 3

7

7

8

7

7.25

 

much of the variation is between each row, it is easy to tell if the means are significantly different

[note: 1 way ANOVA does not need equal numbers of observations in each row]

in example 1: between row variation is small compared to row variation, therefore F will be small

large values of F indicate difference

conclude there is no significant difference between means

in example 2: between row variation is large compared to within row variation, therefore, F will be large

conclude there is a significant difference

the first step in ANOVA is to make two estimates of the variance of the hypothesized common population

1) the within samples variance estimate

2) the between samples variance estimate

the within sample variance estimate is

k - is the number of sample

N - total number of individuals in all samples

X bar - is the mean

the between samples variance estimate is

is the grand mean

having calculated 2 estimates of the population variance how probable is it that 2 values are estimates of the same population variance

to answer this we use the statistic known as the F ratio

significance: critical values are available from tables

df are calculated as

1) for the between sample variance estimate they are the number of sample means minus 1 (k-1)

2) for the within sample variance they are the total number of individuals in the data minus the number of samples (N-k)

since the calculations are somewhat complicated it should be done in a table

example

Test winning times for the men’s Olympic 100 meter dash over several time periods

 

 

winning time (in seconds)

Xk

1900-1912

10.8

11

10.8

10.8

10.85

1920-1932

10.8

10.6

10.8

10.3

10.625

1936-1956

10.3

10.3

10.4

10.5

10.375

source: Chatterjee, S. and Chatterjee, S. (1982) ‘New lamps for old: an exploratory analysis of running times in Olympic Games’, Applied statistics, 31, 14-22.

H0: There is no significant difference in winning times. The difference in means have been generated in a random sampling process

H1: There are significant differences in winning times. Given observed differences in sample means, it is likely they have been drawn from different populations.

Confidence at "=0.01, 99% confident from different population

example table

1900-1912

1920-1932

1936-1956

 

10.8

10.8

10.3

 

11

10.6

10.3

 

10.8

10.8

10.4

 

10.8

10.3

10.5

XG=0.62

Ex=43.4

Ex=42.5

Ex=41.5

 

N=4

n=4

n=4

 

X =10.85

X=10.625

X=10.375

 

 

1900-1912

1920-1932

1936-1956

1900-1912

1920-1932

1936-1956

(x-X)

(x-X)2

(x-X)

(x-X)2

(x-X)

(x-X) 2

-.05

.175

-.075

.0025

.0306

.0056

.15

-.025

-.075

.0225

.0006

.0056

-.05

.175

.025

.0025

.0306

.0006

-.05

-.325

.125

.0025

.1056

.0156

E (x-X)2=.03

E (x-X)2=.1674

E (x-X)2=.0274

calculation of between samples variance estimate

1900-1912

X=10.85

n=4

n(X-XG)

4(10.85-10.62)2

4(.0529)

.2116

1920-1932

X=10.625

n=4

n(X-XG)

4(10.625-10.62)2

4(.0000)

0

1936-1956

X=10.375

n=4

n(X-XG)

4(10.375-10.62)2

4(.0600)

.24

analysis of variance table

 

variance estimate

df

between samples

.2258

2 (k-1)

within samples

.025

9 (N-k)

critical value=8.02, page 277 in text, "=.01

therefore we reject the null hypothesis

computational formula

one problem in using these formulas, however, is that if the means are approximated, the multiple subtractions compound rounding errors. To get around this, the formulas can rewritten as

N=total number of observations

T is the total sum of observations

it can be shown that

Total sum of squares = Between row sum of squares + within row sum of squares

SST = SSR + SSE

1900-1912

X2

1920-1932

X2

1936-1956

X2

 

10.8

116.64

10.8

116.64

10.3

106.09

 

11

121

10.6

112.36

10.3

106.09

 

10.8

116.64

10.8

116.64

10.4

108.16

 

10.8

116.64

10.3

106.09

10.5

110.25

 

Totals

470.92

 

451.73

 

430.59

1353.24

where

T = sum of all observations

Ti is the sum of all the observations in a row

SSE=SST-SSR

SSE=.68 - .45 = 0.23

 

note the F ratio is slightly different when using the shortcut formula

8.823 vs 9.03

This a rare situation where the shortcut is actually a little less accurate than the long method

Source of variation

df

sum of squares

mean square

F-statistic

between rows/groups/samples

k-1

(2)

SSR

(.45)

SSR/k-1

(0.225)

MSR/MSE

(8.824)

within rows/groups/samples

N-k

(9)

SSE

(0.23)

SSE/N-k

(0.0255)

 

Total

 

2.2473

   

 

F distribution is One tail only (no such thing as a two tailed test), it doesn’t make sense to specify direction with k samples

computed value of F (8.83) > critical value

df1=2, df2=9, "=0.05, critical value=3.89

df1=2, df2=9, "=0.01, critical value=8.02

We can be > 99% certain the differences between the time periods are significant and that the observations in each time period are drawn from distributions with different population means

syllabus Lab Manual Schedule Supplemental Info Computer Basics lecture notes Deans Info Exam Results