Ch 13 knowing the distribution of Y_i

Molly · Mar 12, 2023

Hi all,

It isnt obvious to me how to decide the distribution of the response variables, and therefore the link function.

Is there a rule of thumb?

Thanks

CapitalActuary · Mar 12, 2023

Response variable distribution choice

There are 'standard' choices for choice of distribution based on the properties you expect your response variable to have. Here are some examples - by no means exhaustive. I will add to the examples if I find the time / energy / as I feel inclined to!

Poisson / Negative binomial
The simplest choice for modelling a count, like the number of claims, is the Poisson distribution. It takes integer values, which makes it suitable for counting, it's non-negative, and it doesn't limit the maximum value we can get. All of these make it a reasonable choice for modelling something we're counting which can be potentially unbounded.
One limitation with the Poisson distribution is that the variance is equal to the mean. For data where we expect the variance to be bigger than the mean, we would usually pick the negative binomial distribution instead. This has the same properties I listed above for the Poisson distribution, but it has two parameters instead of one, which allows the variance to be larger than or smaller than the mean - i.e. it's a more flexible choice than Poisson in this case.

Bernoulli / binomial
For modelling something which either happens or doesn't happen, like the result of an experiment with two possible outcomes, a Bernoulli random variable describes this perfectly. The binomial distribution is a natural extension of the Bernoulli distribution, for an experiment which is repeated several times over independently. E.g. if I toss a coin once, the outcome heads/tail can be modelled as a Bernoulli random variable. If I toss a coin 5 times, the number of heads or tails can be modelled as a binomial random variable.

I could also model whether it rains tomorrow using a Bernoulli random variable. I might be on shakier ground if I tried to model the number of days it rains over the next week as a binomial random variable though, because the days probably are not independent of one another. E.g. if it's raining on the first 4 days, it's probably more likely to rain on the fifth as well, because rain on the first four days suggests we're in a rainy time of year. (I'm not a weather expert, just trying to give a plausible simple example!)

Normal
The normal distribution is a common distribution for modelling outcomes which can be negative or positive, and don't need to be whole numbers.

Part of the reason for this is the central limit theorem (CLT). This says that the sum of a number of independent and identically distributed random variables with finite variances will tend to a normal distribution as the number of variables grows - regardless of the underlying distribution of the random variables! E.g. if I roll a dice once, this has probably 1/6 of being each of the numbers 1,2,3,4,5,6 - which isn't really like the normal distribution. But if I roll the dice over 30 times and add up the results, the outcome is more or less normally distributed, even though the original distribution wasn't.

The normal distribution is also well understood, relatively 'nice' to work with and do computations with, and commonly used so people are familiar with it. All of this makes it a good choice for simple models.

Non-normal
However, the the normal distribution is clearly unsuitable for some types of modelling, if any of the following is true:

Our variable has 'fat tails', i.e. extreme events occur more likely than the normal distribution would suggest. This is true for a lot of financial returns data, including investment returns and the loss distributions of insurance claims.
Our variable has to be only positive or only negative
Our variable is skewed and/or asymmetric, for example most of the time car insurance claims are not that huge, but every once in a while you might get a multi-car pileup and lots of people needing medical care for the rest of their lives. This would result in skewness, because most of the time the claim size is below the mean, but every once in a while it's very far above the mean, which would push the third central moment to be quite large.

If you think your response variable should be fat tailed and symmetric, and it can be positive or negative, the Student t distribution is a natural choice. This is essentially just a fatter tailed version of the normal distribution, where if you tend the degrees of freedom to infinity you actually recover a normal distribution.

If your response variable should be non-negative, skewed, and fat tailed, then the lognormal distribution is a reasonable choice. If you take the log it's just a normal distribution.

Geometric / exponential
These are suitable for modelling things which are positive valued and are commonly used for modelling waiting times. The geometric distribution is perfect for modelling the number of experiments it takes until a success happens, where the experiments are independent. In other words the number of 'Bernoulli trials' it takes until a success happens. The geometric distribution is integer valued and the exponential distribution is basically a continuous version of it, which can take any non-negative values. The exponential distribution also arises as the waiting time distribution between events in a Poisson process.

There is so much to say about distribution choice it almost feels impossible to do justice. I think a lot of it is easiest to pick up by example. If you do lots of questions you'll end up getting a feel for which distributions are used when, and you might start to understand why.

Link function
You'll have read in the notes, I expect, that there is a canonical link function for each distribution choice in the exponential family. However, perhaps it also helps to give you some intuition about why a link function is useful in a GLM.

The point of the link function 'g' is to cope with the generalisation from a linear model E[Y]=XB (where B is the vector of coefficients and X is your data) to a generalised linear model (GLM) where g(E[Y])=XB. In a standard linear regression, your model XB can take values between -infinity and +infinity, because there are no special constraints on the value of XB.

However, in a GLM setting, we can choose Y to have Poisson distribution, for example. Then it doesn't make sense to have E[Y]=XB as the left hand side must be positive and the right hand side goes between -infinity and +infinity. So you introduce a link function 'g' so that g(E[Y]) can go between -infinity and +infinity as well, to match the range of the right hand side XB. Now we can have a sensible model like g(E[Y])=XB where both sides go between -infinity and +infinity. The canonical link function for the Poisson distribution is g=log; note here that log(E[Y]) will go between -infinity and +infinity. Equivalently, the inverse of g is the 'exp' function, and the model E[Y]=exp(XB) makes sense because both sides are guaranteed to be positive.

Let's do another example. Let's choose Y to be a Bernoulli distribution instead, so Y can be either 0 or 1. So E[Y] has to be between 0 and 1. So now it’s helpful to choose some g such that g(E[Y]) goes between -infinity and +infinity. The canonical link function for the Bernoulli distribution the logit function g(p)=logit(p)=log(p/(1-p)), which will map values of p between 0 and 1 to values between -infinity and +infinity. Again, this will make your model logit(E[Y])=XB make sense because both sides have the same range again.

Your link function also describes the non-linear relationship E[Y] and XB have with each other. E.g. in the Poisson case I could choose some other function g which is not the canonical link function logit, but which would also accomplish g(E[Y]) going between -infinity and +infinity. But the choice of a different g would impact my regression, because it describes the non-linear way in which E[Y] depends on XB. It's sensible to choose the canonical link function in most cases.

Finally, you might find it useful to see the nice table of the 'canonical choices' for link functions of various response variable distribution choices here: https://en.wikipedia.org/wiki/Generalized_linear_model#Link_function

Molly · Mar 13, 2023

CapitalActuary said:
Response variable distribution choice

There are 'standard' choices for choice of distribution based on the properties you expect your response variable to have. Here are some examples - by no means exhaustive. I will add to the examples if I find the time / energy / as I feel inclined to!

Poisson / Negative binomial
The simplest choice for modelling a count, like the number of claims, is the Poisson distribution. It takes integer values, which makes it suitable for counting, it's non-negative, and it doesn't limit the maximum value we can get. All of these make it a reasonable choice for modelling something we're counting which can be potentially unbounded.
One limitation with the Poisson distribution is that the variance is equal to the mean. For data where we expect the variance to be bigger than the mean, we would usually pick the negative binomial distribution instead. This has the same properties I listed above for the Poisson distribution, but it has two parameters instead of one, which allows the variance to be larger than or smaller than the mean - i.e. it's a more flexible choice than Poisson in this case.

Bernoulli / binomial
For modelling something which either happens or doesn't happen, like the result of an experiment with two possible outcomes, a Bernoulli random variable describes this perfectly. The binomial distribution is a natural extension of the Bernoulli distribution, for an experiment which is repeated several times over independently. E.g. if I toss a coin once, the outcome heads/tail can be modelled as a Bernoulli random variable. If I toss a coin 5 times, the number of heads or tails can be modelled as a binomial random variable.

I could also model whether it rains tomorrow using a Bernoulli random variable. I might be on shakier ground if I tried to model the number of days it rains over the next week as a binomial random variable though, because the days probably are not independent of one another. E.g. if it's raining on the first 4 days, it's probably more likely to rain on the fifth as well, because rain on the first four days suggests we're in a rainy time of year. (I'm not a weather expert, just trying to give a plausible simple example!)

Normal
The normal distribution is a common distribution for modelling outcomes which can be negative or positive, and don't need to be whole numbers.

Part of the reason for this is the central limit theorem (CLT). This says that the sum of a number of independent and identically distributed random variables with finite variances will tend to a normal distribution as the number of variables grows - regardless of the underlying distribution of the random variables! E.g. if I roll a dice once, this has probably 1/6 of being each of the numbers 1,2,3,4,5,6 - which isn't really like the normal distribution. But if I roll the dice over 30 times and add up the results, the outcome is more or less normally distributed, even though the original distribution wasn't.

The normal distribution is also well understood, relatively 'nice' to work with and do computations with, and commonly used so people are familiar with it. All of this makes it a good choice for simple models.

Non-normal
However, the the normal distribution is clearly unsuitable for some types of modelling, if any of the following is true:

Our variable has 'fat tails', i.e. extreme events occur more likely than the normal distribution would suggest. This is true for a lot of financial returns data, including investment returns and the loss distributions of insurance claims.

Our variable has to be only positive or only negative

Our variable is skewed and/or asymmetric, for example most of the time car insurance claims are not that huge, but every once in a while you might get a multi-car pileup and lots of people needing medical care for the rest of their lives. This would result in skewness, because most of the time the claim size is below the mean, but every once in a while it's very far above the mean, which would push the third central moment to be quite large.

If you think your response variable should be fat tailed and symmetric, and it can be positive or negative, the Student t distribution is a natural choice. This is essentially just a fatter tailed version of the normal distribution, where if you tend the degrees of freedom to infinity you actually recover a normal distribution.

If your response variable should be non-negative, skewed, and fat tailed, then the lognormal distribution is a reasonable choice. If you take the log it's just a normal distribution.

Geometric / exponential
These are suitable for modelling things which are positive valued and are commonly used for modelling waiting times. The geometric distribution is perfect for modelling the number of experiments it takes until a success happens, where the experiments are independent. In other words the number of 'Bernoulli trials' it takes until a success happens. The geometric distribution is integer valued and the exponential distribution is basically a continuous version of it, which can take any non-negative values. The exponential distribution also arises as the waiting time distribution between events in a Poisson process.

There is so much to say about distribution choice it almost feels impossible to do justice. I think a lot of it is easiest to pick up by example. If you do lots of questions you'll end up getting a feel for which distributions are used when, and you might start to understand why.

Link function
You'll have read in the notes, I expect, that there is a canonical link function for each distribution choice in the exponential family. However, perhaps it also helps to give you some intuition about why a link function is useful in a GLM.

The point of the link function 'g' is to cope with the generalisation from a linear model E[Y]=XB (where B is the vector of coefficients and X is your data) to a generalised linear model (GLM) where g(E[Y])=XB. In a standard linear regression, your model XB can take values between -infinity and +infinity, because there are no special constraints on the value of XB.

However, in a GLM setting, we can choose Y to have Poisson distribution, for example. Then it doesn't make sense to have E[Y]=XB as the left hand side must be positive and the right hand side goes between -infinity and +infinity. So you introduce a link function 'g' so that g(E[Y]) can go between -infinity and +infinity as well, to match the range of the right hand side XB. Now we can have a sensible model like g(E[Y])=XB where both sides go between -infinity and +infinity. The canonical link function for the Poisson distribution is g=log; note here that log(E[Y]) will go between -infinity and +infinity. Equivalently, the inverse of g is the 'exp' function, and the model E[Y]=exp(XB) makes sense because both sides are guaranteed to be positive.

Let's do another example. Let's choose Y to be a Bernoulli distribution instead, so Y can be either 0 or 1. So E[Y] has to be between 0 and 1. So now it’s helpful to choose some g such that g(E[Y]) goes between -infinity and +infinity. The canonical link function for the Bernoulli distribution the logit function g(p)=logit(p)=log(p/(1-p)), which will map values of p between 0 and 1 to values between -infinity and +infinity. Again, this will make your model logit(E[Y])=XB make sense because both sides have the same range again.

Your link function also describes the non-linear relationship E[Y] and XB have with each other. E.g. in the Poisson case I could choose some other function g which is not the canonical link function logit, but which would also accomplish g(E[Y]) going between -infinity and +infinity. But the choice of a different g would impact my regression, because it describes the non-linear way in which E[Y] depends on XB. It's sensible to choose the canonical link function in most cases.

Finally, you might find it useful to see the nice table of the 'canonical choices' for link functions of various response variable distribution choices here: https://en.wikipedia.org/wiki/Generalized_linear_model#Link_function

Hi,

Thank you so so much for this answer, cant tell you how much i appreciate the detail you have gone into here!

Thank you also for the notes on the link function, you have explained this so much more clearly than the CMP and i feel much more comfortable now!

Thanks!

Ch 13 knowing the distribution of Y_i

Molly

Ton up Member

CapitalActuary

Ton up Member

Molly

Ton up Member