Some questions about CMP CS2 CH17 to CH21

Discussion in 'CS2' started by ykai, Aug 12, 2023.

  1. ykai

    ykai Ton up Member

    1.CMP CS2-CH20-question 20.2-(i)
    Where is notation u come from?
    I can't find it from question.

    2.CMP CS2-CH21-question 21.9-(ii)
    How to distinguish the number of nodes in calculation?
    It is clear that each tree has 3 nodes (test1, 2, 3&test4, 5, 6), but each tree only uses 2 in the answer.

    I think it maybe for tree 1
    G_1=1-(3/4)^2-(1/4)^2
    G_2=1-1^2
    G_3=1-(4/8)^2-(4/8)^2

    G=4*G_1+0+8*G_3

    I think it maybe for tree 2
    G_1=1-(5/6)^2-(1/6)^2
    G_2=1-1^2
    G_3=1-(3/6)^2-(3/6)^2

    G=6*G_1+6*G_3

    3.CS2 Assignment X5-5.4-(ii)
    Why lower tail dependence copulais suitable?
    Is this because lower tail dependence is denser at low probabilities in graph in CMP CS2-CH17-page29?
    Does this mean that the correlation is higher when the probability is low?
     
    Last edited: Aug 12, 2023
  2. ykai

    ykai Ton up Member

    For 2, I mean G_2 is 0 because of the same type.

    I get it from CMP CS2-CH21-page35."So,if all the items are of the same type, the probability will be 0."
    sum^K_(k=1)[p_jk(1-p_jk)]

    In page 37&38
    formal tree sum from node 1 to 2
    second tree sum from node 1 to 3
    Shouldn't question 21.9 sum from node 1 to 4 for each tree?
    I mean sum^3_(k=1)[p_jk(1-p_jk)].

    For Greedy splitting,we should sum from node 1 to 3 to maximise the reduction in a loss function.
    To Max (3 node formula- 4 node formula),shouldn't we use 3 node formula because of 4 node formula = 0?
    Max [sum^3_(k=1)[p_jk(1-p_jk)]-sum^4_(k=1)[p_jk(1-p_jk)]]
     
    Last edited: Aug 13, 2023
  3. ykai

    ykai Ton up Member

    4.What is the meaning of "As K->infinity,the upper limit tends to 1" in CH21-page61?
     
  4. Andrew Martin

    Andrew Martin ActEd Tutor Staff Member

    Hi Ykai

    1. Apologies, there is actually a typo in the question. The first line should read:

    .... Poisson distribution with parameter mu.

    the 'X = 1' is wrong. Again, apologies for the confusion here, we will get a correction issued.

    2. This question is asking about the initial split point. So, for the first tree, it is considering the outcome of test 1 only, which splits the data into a node with BBBBCCCDDDD and a node with AAAB.

    3. The solutions here are comparing the probabilities and saying, on the basis of this calculation, that the Clayton copula appears more suitable given the probability is higher. This is what we'd expect due to the widowhood effect / broken heart syndrome. The solution doesn't discuss tail dependency.

    4. As k tends to infinity 1/k tends to 0 and 1 - 1/k tends to 1. In terms of what this means, it is saying that the worst possible purity score increases as the number of categories increases. For example, if k = 2, the worst possible purity is 1 - 1/2 = 1/2. If k = 3, the worst possible purity score is 1 - 1/3 = 2/3 and so on.

    Hope this helps

    Andy
     
    ykai likes this.
  5. ykai

    ykai Ton up Member

    Thank you for your reply,but I still have 1 question about 4.
    1.What I want to know is what is relationship between "measure of purity for the jth external node" and "1-1/k"?
    How did this come to be like this?
     
  6. Andrew Martin

    Andrew Martin ActEd Tutor Staff Member

    The measure of purity (or perhaps more accurately, impurity) for a particular node j is given by:

    sum(k = 1, K) [pjk(1-pjk))]

    where pjk is the proportion of individuals in node j in category k and K is the total number of categories.

    This formula is effectively the probability that two items selected at random are of different types. So, the higher this is, the more impure the node is (the more mixed up the node is with different categories). The lower this is, the purer the category. The lowest value possible is 0, where it is not possible to pick two items of different types (as a perfectly pure node only contains one type of category).

    The value of 1 - 1/K is the worst possible value (highest) this measure can take, which depends on the number of categories, K.

    For example, the worst possible purity for a node when there are two categories is when there is an equal number of each category in that node (so it is not pure at all, it is as mixed up as it can be). In this case, we get:

    sum(k = 1, K) [pjk(1-pjk))] = 0.5 * (1-0.5) + 0.5 * (1-0.5) = 0.5.

    Using the formula of 1 - 1/K for the worst case, we get 1- 1/2 = 0.5, which matches as expected.

    Hope this helps!

    Andy
     
    ykai likes this.
  7. ykai

    ykai Ton up Member

    Thank you for your reply!
    I have understand it totally!
     

Share This Page