Combining categories in a chi-square test

Discussion in 'CT3' started by Simon C, Oct 3, 2009.

  1. Simon C

    Simon C Member

    In a chi-square test relating to goodness of fit or contingency tables, how do we decide which categories to combine when expected frequencies are below 5?

    I had previously thought that where the categories are ordered in some way, adjacent categories should be combined. However in the ActEd notes Chapter 12 Question 21, non-adjacent categories are combined. Here the youngest and oldest age categories are put together.

    I appreciate that in this instance by combining non-adjacent categories we achieve a higher number of degrees of freedom. This is because we only lose 1 degree of freedom rather than 2 had we combined adjacent categories.

    However the two categories that have been combined here have nothing in common and do not naturally sit together. Therefore why is it appropriate to combine them?

    Thanks in advance for any help provided.
     
    Last edited by a moderator: Oct 3, 2009
  2. I think the general rule is that you combine the category which has an expected frequency <5 with the category which has the next lowest expected frequency. It doesn't have to be adjacent.
     
  3. Simon C

    Simon C Member

    Thanks very much. I’m still a little unclear on why this is appropriate though.

    My query can also be extended to situations where the categories are not ordered in any way:

    Consider a contingency table test where we are trying to see if the subject that a student chooses to study is independent of their sex.

    The possible subjects are French, German, Biology, Chemistry and Physical Education.

    Let us say we get an expected frequency of 3 for French and 4 for Biology so that these categories need to be combined.
    We could combine them together to create a new category “French and Biology” and lose only 1 degree of freedom.

    However, due to the similarities between types of subject, I might expect that the gender trends for French and German might be more similar than those for French and Biology. I wonder if it would therefore be more meaningful to combine French and German to create a new category "Languages" and combine Biology and Chemistry to create a new category "Sciences". These groupings seem to sit more naturally with each other but of course we would then lose 2 degrees of freedom instead of one.

    So in summary I’m still a bit confused about what factors we should take into account when considering combining categories i.e. is it just degrees of freedom, or also does the appropriateness of the combinations matter? Also, does it make any difference whether we are doing a goodness of fit test or a contingency table test?

    Thanks in advance for any help provided.
     
    Last edited by a moderator: Oct 5, 2009
  4. Categories with expected frequency is combined because the Chi-square test would not work if the frequency is less than 5.

    If the sample size is too small the chi-square value is over-estimated and if it is too large chi-square value is under-estimated. Hency why we combine with the category with the lowest frequency.

    As metioned in your example, if we combine Lanuages together, and German had an expected frequency of say 15 then by adding French we might get an incorrect value for the chi-square (Overestimated value).

    Hope this helps.
     
    Last edited by a moderator: Oct 5, 2009

Share This Page