Test to know when to use GLM over Linear Regression? - Cross Validated
Linear model vs Generalised linear model vs Generalised mixed effect model - big confusion
What is the difference between the general linear model (GLM)and generalized linear model (GZLM)?
modeling - General Linear Model vs. Generalized Linear Model (with an identity link function?) - Cross Validated
Videos
As with many other cases in statistics, the goal of finding a single test to replace one's judgement is a bad one.
There are several sources of information you can and should use while deciding: the theoretical expectation of the distribution, prior empirical work on the topic, the properties of the data (e.g. is it truncated or zero-inflated?), and the residual distributions and other diagnostics after fitting models. But there is no single, general test (or even a set of tests) that will tell you what to do.
And there cannot be one. I recognise the intuitive appeal of having a decision tree to follow when making such a choice, especially in an area that is complex and new to you. But there are few hard boundaries in the areas you need to consider, and so this decision does not lend itself well to such a workflow. You need to use judgement, and developing that will take time and practice.
Another great answer from @mkt on this forum. Here are a few more pointers you might find useful.
GLMs include some widely used types of regression models:
- Binary Logistic Regression Models;
- Binomial Logistic Regression Models;
- Multinomial Logistic Regression Models;
- Ordinal Logistic Regression Models;
- Poisson Regression Models;
- Beta Regression Models;
- Gamma Regression Models.
As pointed out by @COOLSerdash in his comment, beta regression models share some features - such as linear predictor, link function, dispersion parameter - with GLMs (GLMs; McCullagh and Nelder 1989), but are NOT special cases of the GLM framework. However, I included them in the above list because of their similarity with GLMs and their practical value.
A good place to start would be to familiarize yourself with each of these types of models and when it might be used.
Binary Logistic Regression Models
These types of models are used to model the relationship between a binary dependent variable Y and a set of independent variables X1, ..., Xp.
For example, Y could represent the survival status of patients at a local hospital assessed 30 days following a surgical intervention for treating a particular disease such that Y = 1 for a patient who survived and Y = 0 for a patient who died. Furthermore, if p = 2, then X1 could represent Age (expressed in years) and X2 could represent gender. For all the subsequent examples below, it will be assumed that p = 2 and that X1 and X2 will have the same meaning as in the current example.
Binomial Logistic Regression Models
These types of models are used to model the relationship between a binomial dependent variable Y and a set of independent variables X1, ..., Xp.
For example, Y could represent the number of correct questions (out of 10) answered by patients on a questionnaire eliciting their knowledge of the symptoms associated with their disease.
Multinomial Logistic Regression Models
These types of models are used to model the relationship between a nominal dependent variable Y with more than 2 categories and a set of independent variables X1, ..., Xp.
Ordinal Logistic Regression Models
These types of models are used to model the relationship between an ordinal dependent variable Y and a set of independent variables X1, ..., Xp.
For example, Y could represent the degree of pain experienced by patients immediately after surgery, expressed on an ordinal scale from 1 to 5, where 1 stands for no pain and 5 stands for severe pain.
Poisson Regression Models
These types of models are used to model the relationship between a count dependent variable Y and a set of independent variables X1, ..., Xp.
For example, Y could represent the number of hospital days (out of 30) when patients had to use pain relieving medication following their surgery.
Beta Regression Models
These types of models are used to model the relationship between a dependent variable Y expressed as a continuous proportion taking values in the open interval (0,1) and a set of independent variables X1, ..., Xp.
For example, if the disease in question is a brain disease, Y could represent the fraction of the brain area still affected by disease 30 days post-surgery relative to the total brain area for patients who survived the surgery.
Gamma Regression Models
These types of models are used to model the relationship between a positive-valued, continuous dependent variable Y and a set of independent variables X1, ..., Xp.
For example, Y could represent the healthcare utilization costs of patients who survived up to the 30-day mark.
Hi I am a PhD student and have been studying statistics with R for some month. I have done courses for basics statistics (linear and logistic regression, anova, etc) and now would like to go deeper.
For one study I am conducting I have been suggested to look at generalised and logistic mixed effect models. I have looked online and found a lot of info, but now I am very confused!
So far, I understood the following
-
linear models can be used when the residuals follow a normal distribution and Generalised linear models when residuals do not follow normal distribution. Is this correct? I wasn't told anything about this during my courses on linear models. How do I run this in R?
-
Mixed effect models are used when you have repeatd measurements for the same subjects. But I am not sure about this. Additionally, I am not sure how to run this analysis in R.
I have found lots of books on these topics, but they go very deep and include lots of maths and formulaes. What I would like to have is a book/course that explain when it is best using each model, how to run the analysis and which assumptions to check- without going into formulae and other technical stuff.
Does anyone have something like this to recommend? If not, if someone could explains a bit and maybe give some example of analysis would be great
Thanks!
A generalized linear model specifying an identity link function and a normal family distribution is exactly equivalent to a (general) linear model. If you're getting noticeably different results from each, you're doing something wrong.
Note that specifying an identity link is not the same thing as specifying a normal distribution. The distribution and the link function are two different components of the generalized linear model, and each can be chosen independently of the other (although certain links work better with certain distributions, so most software packages specify the choice of links allowed for each distribution).
Some software packages may report noticeably different $p$-values when the residual degrees of freedom are small if it calculates these using the asymptotic normal and chi-square distributions for all generalized linear models. All software will report $p$-values based on Student's $t$- and Fisher's $F$-distributions for general linear models, as these are more accurate for small residual degrees of freedom as they do not rely on asymptotics. Student's $t$- and Fisher's $F$-distributions are strictly valid for the normal family only, although some other software for generalized linear models may also use these as approximations when fitting other families with a scale parameter that is estimated from the data.
I would like to include my experience in this discussion. I have seen that a generalized linear model (specifying an identity link function and a normal family distribution) is identical to a general linear model only when you use the maximum likelihood estimate as scale parameter method. Otherwise if "fixed value = 1" is chosen as scale parameter method you get very different p values. My experience suggest that usually "fixed value = 1" should be avoided. I'm curious to know if someone knows when it is appropriate to choose fixed value = 1 as scale parameter method. Thanks in advance. Mark
I recently was having a debate with a Data Scientist (with little statistical training) about GLMs. He believes that GLMs (such as logistic regression) are linear. I have some statistical training and as far I have heard many of my peers don't consider GLMs to be linear.
I started probing further to substantiate my claim and came across these quotes in an Quora answer.
Link to the post -> https://www.quora.com/Why-is-logistic-regression-considered-a-linear-model
For the benefit of others, this answer is at odds with what statisticians have meant by "linear model" ever since the term "generalized linear model" was introduced. The answer a statistician would give to this question is "logistic regression *is not* a linear model. "A statistician calls a model "linear" if the mean of the response is a linear function of the parameter, and this is clearly violated for logistic regression. Logistic regression is a *generalized linear model*. Generalized linear models are, despite their name, not generally considered linear models. They have a linear component, but the model itself is nonlinear due to the nonlinearity introduced by the link function.
I think this group has a significant number of statisticians. Hence I wanted to ask you, Do you guys consider GLMs to be linear ? Do you agree with the quoted text above ?
I would like to double check that GLMs are a type of regression analysis please.
Thanks!
A general linear model doesn't generalize the function of $X$.
Indeed assuming you mean $E(Y|X)$ where you have $Y$ (and independent errors) whether or not there's a transformed predictor doesn't change things -- either way it would still be called a linear model (the conditional mean is a linear function of the parameters).
That is to say, consider $\alpha+\beta \psi(X)$. Now let $X^* = \psi(X)$. Then in terms of this new variable (the one used in the estimation) we have $\alpha+\beta X^*$. So either a linear model or a general linear model will be able to incorporate a transformation, $\psi$, (of the independent variable or variables) without difficulty.
Instead, with a multivariate response (each observation point is a vector of values), a general linear model generalizes the covariance structure of the error term so that the response values includes the possibility of correlated errors within the observation vector (that is, the components of $\mathbf{y}_i$ are correlated).
This feature allows us to place under one banner t-tests, ANOVA, regression, MANOVA, MANCOVA, multivariate regression and a number of other models/tools (while the multivariate techniques wouldn't necessarily be seen as covered by the term 'linear model', though the usage does vary).
[Between observations there is still independence; if you want instead to generalize to correlated errors between-observations, that would be generalized least squares as fcop pointed out in comments.]
In a linear model, we define prediction or regression function using a linear structure as follows: $y\approx E(y|x)=\omega_0 + \omega^\top x.$
While in a generalized linear model, we define prediction function or discriminatory function either as a linear in parameter or a non-linear in parameter through linear argument ($\omega^\top x +\omega_0$).
That is the hypothesis function for generalized linear model is $h(x)=g(\omega^\top x +\omega_0)$, where g may be a linear or non-linear function (known as activation function). While estimating the hypothesis function, we focus on estimating the parameter $\omega$ only as g is defined as per requirment.