Why not approach classification through regression?

stats.stackexchange.com › questions › 22381 › why-not-approach-classification-through-regression

"..approach classification problem through regression.." by "regression" I will assume you mean linear regression, and I will compare this approach to the "classification" approach of fitting a logistic regression model.

Before we do this, it is important to clarify the distinction between regression and classification models. Regression models predict a continuous variable, such as rainfall amount or sunlight intensity. They can also predict probabilities, such as the probability that an image contains a cat. A probability-predicting regression model can be used as part of a classifier by imposing a decision rule - for example, if the probability is 50% or more, decide it's a cat.

Logistic regression predicts probabilities, and is therefore a regression algorithm. However, it is commonly described as a classification method in the machine learning literature, because it can be (and is often) used to make classifiers. There are also "true" classification algorithms, such as SVM, which only predict an outcome and do not provide a probability. We won't discuss this kind of algorithm here.

Linear vs. Logistic Regression on Classification Problems

As Andrew Ng explains it, with linear regression you fit a polynomial through the data - say, like on the example below we're fitting a straight line through {tumor size, tumor type} sample set:

Above, malignant tumors get $\text{[math]}$ and non-malignant ones get $\text{[math]}$ , and the green line is our hypothesis $\text{[math]}$ . To make predictions we may say that for any given tumor size $\text{[math]}$ , if $\text{[math]}$ gets bigger than $\text{[math]}$ we predict malignant tumor, otherwise we predict benign.

Looks like this way we could correctly predict every single training set sample, but now let's change the task a bit.

Intuitively it's clear that all tumors larger certain threshold are malignant. So let's add another sample with a huge tumor size, and run linear regression again:

Now our $h(x) > 0.5 \rightarrow malignant$ doesn't work anymore. To keep making correct predictions we need to change it to $\text{[math]}$ or something - but that not how the algorithm should work.

We cannot change the hypothesis each time a new sample arrives. Instead, we should learn it off the training set data, and then (using the hypothesis we've learned) make correct predictions for the data we haven't seen before.

Hope this explains why linear regression is not the best fit for classification problems! Also, you might want to watch VI. Logistic Regression. Classification video on ml-class.org which explains the idea in more detail.

EDIT

probabilityislogic asked what a good classifier would do. In this particular example you would probably use logistic regression which might learn a hypothesis like this (I'm just making this up):

Note that both linear regression and logistic regression give you a straight line (or a higher order polynomial) but those lines have different meaning:

$\text{[math]}$ for linear regression interpolates, or extrapolates, the output and predicts the value for $\text{[math]}$ we haven't seen. It's simply like plugging a new $\text{[math]}$ and getting a raw number, and is more suitable for tasks like predicting, say car price based on {car size, car age} etc.
$\text{[math]}$ for logistic regression tells you the probability that $\text{[math]}$ belongs to the "positive" class. This is why it is called a regression algorithm - it estimates a continuous quantity, the probability. However, if you set a threshold on the probability, such as $\text{[math]}$ , you obtain a classifier, and in many cases this is what is done with the output from a logistic regression model. This is equivalent to putting a line on the plot: all points sitting above the classifier line belong to one class while the points below belong to the other class.

So, the bottom line is that in classification scenario we use a completely different reasoning and a completely different algorithm than in regression scenario.

Answer from andreister on Stack Exchange

Towards Data Science

towardsdatascience.com › home › latest › regression for classification | hands on experience

Regression for Classification | Hands on Experience | Towards Data Science

January 23, 2025 - Fundamentally, classification is about predicting a label and regression is about predicting a quantity. Why linear regression can’t use for classification? The main reason for that is the predicted values are continuous, not probabilistic.

Turing

turing.com › kb › scikit-learn-cheatsheet-methods-for-classification-and-regression

Scikit-Learn Cheatsheet: Methods For Classification and Regression

Common examples of regression tasks include stock market price prediction, estimation of regional sales for various products in a factory, demand prediction for a particular item based on past sales records, and so on. Classification is where we train a model to classify data into well-defined categories, based on previous data labels.

Discussions

Regression to Solve Classification problem: Good or Rubbish?

Have you tried logistic regression? More on reddit.com

r/datascience

11

18

October 26, 2022

Why doesn’t linear regression work on classification problems?

So think of it this way: I ask you if something is a dog or cat. This is a classification problem. Linear regression returns a continuous value. So 0, 0.5, 0.2. I ask you to tell me if something is a dog or cat. You reply 0.2. this makes no sense. This is what you're doing when you apply linear regression to classification. What you can do is interpret these values as probabilities. And then designate 0 as absolute cat and 1 as absolute dog. Then anything greater than 0.5 is a do, anything less is a cat. To do this we apply a function to 'squash' all continuous outputs of the linear regression to simply be better 0 and 1. We do this with a logistic function. You know have logistic regression. More on reddit.com

r/learnmachinelearning

29

20

November 27, 2023

Videos