Confidence intervals: when not to use them

Confidence intervals
Chi-square test
Test of equal proportions
One-sided Fisher's exact test of count data
The Barnard test
Conclusions

The problem, and the data set

A colleague presented a crosstable on the willingess of kinder-garten staff to report suspected child abuse

Kindergarten       (percentages)
                   Urban   Rural
Would report        67.7   37.5
Would not report    32.3   62.5

I wanted to enhance the analysis by adding the confidence intervals, for which I got the absolute numbers:

              Urban    Rural    Total
Kindergarten    31       24       55

To create R objects that represent this data, use:

K.urban <- round(c(.677, .323) * 31)
K.rural <- round(c(.375, .625) * 24)
my.matrix <- matrix(c(K.urban, K.rural), byrow = T, ncol = 2)
rownames(my.matrix) <- c("Urban", "Tural")
colnames(my.matrix) <- c("Would report", "Would not report")

Inspect the resulting table

my.matrix
      Would report Would not report
Urban           21               10
Tural            9               15

Ensure that the original table of percentages can be reached from the proportions of my.matrix:

sapply(1:2, function(x) { 100*round(my.matrix[x,]/sum(my.matrix[x,]),3) })
                 [,1] [,2]
Would report     67.7 37.5
Would not report 32.3 62.5

So, we have a good object to work on.

My first idea, provide the confidence intervals

Now, for each estimate, what is the confidence interval at a confidence level of 0.95?

Firstly, the confidence intervals of the first item "Would report", is given by binom.test:

binom.test(K.urban)$conf.int[1:2]
> [1] 0.4862702 0.8331764
binom.test(K.rural)$conf.int[1:2]
> [1] 0.1879929 0.5940636

Since 0.48 is less than 0.59, this sample cannot show a statistically signficant difference between the willingness to report cases of suspected child abuse in rural and urban kindergartens. Or?

In hindsight, The Chi-square test

chisq.test(my.matrix)

        Pearson's Chi-squared test with Yates' continuity correction

data:  my.matrix
X-squared = 3.8447, df = 1, p-value = 0.0499

Here, p < 0.05 thus we have a signficant difference.

The difference between the binom.test and the chisq.test lies in the size of n. In the first case, n is defined as the number of cases in each category rural (n=24) and urban (n=31). In the chisq.test, n is defined as the total number of cases (n=55).

So, my idea of providing confidence intervals was a bad one. The Chi-square test is better since it does not require slizing the data into small (less useful) pieces.

Even better, one-sided `prop.test`

While the chi-square test shows that there is (most likely) a non-random correlation between context, or type of location, and willingness to report, it does not specifically test the hypothesis that the willingness to report is lesser in rural than in urban kindergartens (or, differently stated, higher in urban than in rural kindergartens). To test that hypotesis we need a one-sided prop.test

Since the first row of my.matrix happens to be urban and reporting rather than not reporting is the first colum (the column representing "success" in test terminology), the question tested could be stated like this:

Is the willingness to report (proportion of "success") higher ("greater") in urban (first row) than in rural (second row) kindergartens?

prop.test(my.matrix, alternative = c("greater"))

        2-sample test for equality of proportions with continuity correction

data:  my.matrix
X-squared = 3.8447, df = 1, p-value = 0.02495
alternative hypothesis: greater
95 percent confidence interval:
 0.05216612 1.00000000
sample estimates:
   prop 1    prop 2
0.6774194 0.3750000

Indeed it is, the p-value now drops to 0.02.

If we, for some reason, want to express the hypothesis the other way around we must exchange the order of the rows.

my.matrix <- matrix(c(K.rural, K.urban), byrow = T, ncol = 2)
rownames(my.matrix) <- c("Rural", "Urban")
prop.test(my.matrix, alternative = c("less"))

        2-sample test for equality of proportions with continuity correction

data:  my.matrix
X-squared = 3.8447, df = 1, p-value = 0.02495
alternative hypothesis: less
95 percent confidence interval:
 -1.00000000 -0.05216612
sample estimates:
   prop 1    prop 2
0.3750000 0.6774194

The right thing: One-sided Fisher's Exact Test for Count Data

I knew something was not quite right with the prop.test, the help page of binom.test has it: the prop.test is an approximate test. the help page of prop.test is not optimal here, it references the binom.test if you want "an exact test of a binomial hypothesis". However, if you have two categories of units (two groups) and want to compare the "proportion of success" in these categories (or groups), what you want is really the fischer.test (see, http://en.wikipedia.org/wiki/Fisher%27s_exact_test for a background).

fisher.test(my.matrix, alternative = c("less"))

        Fisher's Exact Test for Count Data

data:  my.matrix
p-value = 0.02461
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.0000000 0.8414671
sample estimates:
odds ratio
 0.2927832

The right thing?

Reading up more on the Fisher's exact test, I realized that there exists another test that for small n is probably even better than fishers exact test: Barnards test For an explanation of the advantages of the Barnard test see Conditional versus Unconditional Exact Tests for Comparing Two Binomials and Analysis of 2 x 2 tables of frequencies: matching test to experimental design.

Strangely enough, there is no implementation of the Barnard test in CRAN, but Tal Galili provides one here.

Barnardextest(my.matrix)
$contingency.table
   A  B
A 21 10
B  9 15

$Wald.Statistic
[1] 2.233813

$Nuisance.parameter
[1] 0.8901

$p.value.one.tailed
[1] 0.01558175

EDIT 2010-09-21 START: After reading "Testing the Equality of Two Independent Binomial Proportions", by Roderick J. A. Little, I have reverted back to Fischer's test again, Barnards test seems too dependent on ancillary statistics for my taste. Little retells an example originally given by Fischer in a correspondence to Barnard where the color of flowering plants was of interest. The plants could be red och white, or not develop flowers at all. Barnards test takes into account the number of flowers that actually flowered and the number of plants that did not flower, while Fischer's test is applied only on the plants that did flower. Fischer's point is that Barnards test rewards the inferior horticulturist who only brings 3/4 of all plants to flowering with a higher p value, but that reward is not earned (or persuasive). Consider a good horticulturist which brings all plants to flowering, who bring up for plants, and all of them becomes red. The p value of the hypothesis that the likelihood for red is higher then the likelyhood for white color is then (1/2)*(1/2)*(1/2)*(1/2) = 0.0625, for an horticulturist who only succeeds in 3 out of 4 plants the p value (of Barnards test) is (1/2)*(1/2)*(1/2)*(1/2)*(3/4)*(3/4)*(3/4)*(3/4) = 0.0198.

Barnard himself was persuaded by this argument of Fischer.

I will no more hold "Lacking Barnards test" against R. EDIT 2010-09-21 STOP

looking at p or a measure of effect?

In this case, demonstrating a significant difference between rural and urban areas was quite sufficient, the data was from a pilot study.

But in other cases one would rather report some kind of effect measure rather then just if p is below 0.05 or not. The output of the Fischer test includes confidence intervals for the odds-ratio, which is a relevant measure of effect.

fisher.test(my.matrix, alternative = c("less"))

        Fisher's Exact Test for Count Data

data:  my.matrix
p-value = 0.02461
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.0000000 0.8414671
sample estimates:
odds ratio
 0.2927832

The odds for reporting in rural and urban areas respectively are defined as the proportion reporting / the proportion not reporting.

odds.rural <- .375/.625
odds.urban <- .677/.323
round(odds.rural,2)
[1] 0.6
> round(odds.urban,2)
[1] 2.1

The odds-ratio is - of course - defined as the ratio of these odds:

odds.ratio <- odds.rural/odds.urban
odds.ratio
[1] 0.2862629

But the value given for the odds ratio from the Fisher Exact test is a more elaborate measure, an estimate that is not so sensitive for small numbers in one cell of the table, see http://en.wikipedia.org/wiki/Odds_ratio#Alternative_estimators_of_the_odds_ratio.

TODO: Interpretation of the odds-ratio

Conclusions

Do not calculate confidence intervals on single measurements when you want to compare proportions between groups, i.e. use a test, prop.test() or fisher.test().
While all the tests that uses the full matrix gives p < 0.05, the differences are rather big. Going from two-sided to one sided test makes the biggest difference, of course, but even among the one-sided test, the differences between Barnards test: 0.01558 and the other two, Fisher: 0.02461 and prop.test 0.02495 are considerable.

Another data-set

There was also another data set, where the differences were much smaller.

School            (percentages)
                   Urban   Rural
Would report        69.2   58.6
Would not report    30.8   41.4

And the absolute numbers were as follows:

              Urban    Rural    Total
School         39       29       68

S.urban <- round(c(.692, .308) * 39)
S.rural <- round(c(.586, .414) * 29)
my.matrix <- matrix(c(S.rural, S.urban), byrow = T, ncol = 2)
fisher.test(my.matrix, alternative = c("less"))

        Fisher's Exact Test for Count Data

data:  my.matrix
p-value = 0.2577
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.000000 1.651135
sample estimates:
odds ratio
 0.6340203

So, for schools this sample does not indicate any difference in willingness to report between rural and urban areas.