What is logistic in the logistic regression?

Suppose that you are interviewed for a data scientist role. You are asked about logistic regression, and you answer all sorts of questions: How to run it in Python, how would you perform feature selection, and how would you use it for prediction. For the last question you answer that if you have the estimated of the regression coefficients and the data of the features, then you perform the necessary multiplications and additions, and the result will be L=log(p/(1-p)) where p is the probability of the event to be predicted. This transformation is known as the logit transformation. From this you can calculate p as exp(L)/(1+exp(L)). Then comes the critical question: why is that?

One possible answer is that since p is between 0 and 1, then L is between -∞ to ∞, that is, it can be any real number, and therefore the logistic regression is transformed into “regular” linear regression. However, this answer is wrong. In the train data, the values of the event/label to be predicted or classified are either 0 or 1. You cannot apply the logit transformations to zeros and ones. And even if you could, the linear regression assumptions do not hold.

A more sophisticated answer is to say that the logit transformation is the link function of choice. This choice has a nice property: if β is the coefficient of a feature X, then exp(β) is a (biased) estimate for the corresponding odds ratio. This is useful if you want to identify risk factors , e.g. for a disease like cancer. But if you are interested in predictions, you may not care about the odds ratio (although you should).

Let’s assume that we know to explain what a link function is. The question still remains: Why not choose another link function? Almost any inverse distribution function will do this trick. Why not choose the inverse of the normal distribution function as the link function?

Moreover, the history does not support this answer. The logit transformation and the logistic regression model came first. This model was developed by Sir David Cox in the 1960’s — the same David Cox who later introduced the proportional hazards model (see https://papers.tinbergen.nl/02119.pdf) . The extension of the model to general linear models using various link functions came later.

So the question remains: what is logistic in the logistic regression?

The key is in the statistical model of the logistic regression, or any other binary regression. Let’s review the model.

Suppose you have data of a response/label/outcome Y that takes values of zeroes and ones. Let assume for the sake of simplicity that you have only one feature/predictor X, which can be any type of variable.

The key assumption of the model is that there exists a continuous/latent/unobservable Y* that relates somehow to the observed values of YNote that Y* is not a part of your data. It is a part of your model.

The next assumption is about the relationship between Y and Y*. You assume that Y equals to 1 if the signal of Y* is above some threshold, and otherwise Y is equal to zero. Furthermore, assume, without loss of generality, that this threshold is zero.

This model assumption is not new, and most of the readers are familiar with this approach. This is how the perceptron, the building block of neural networks, works. The idea itself is much older. Karl Pearson used similar modelling when he attempted to develop a correlation coefficient for categorical data, back in the 1910’s.

Assuming you know the values of Y*, you can model the relationship between Y* and X using simple linear regression:

The third and last assumption is about the distribution of the errorepsilon. As I said before, you can choose any distribution you like. If you assume, for example, that epsilon is normally distributed, you will get something that is called probit regression. But if assume that epsilon follows the logistic distribution:

then these three assumptions and some basic probability and algebra you get a logistic regression — a regression with a logit link function.

For simplicity I will assume that X is discrete variable. One can do the whole trick for any X by using density functions for a continuous X.

Let

be the conditional distribution of Y given that X is equal to some value x.

Since Y=1 if and only if Y*>0, we get that

By the second assumption of the model we get that

Using the third assumption that states that the distribution of Y* is logistic we get that

and therefore

Good luck in your interview!

 

Some comments on AB testing implementation

Many job postings in the field of technology (mainly for Data Scientist jobs, but not only) require knowledge and/or experience in “AB testing”.

What is AB testing? A brief inspection at Wikipedia reveals that this is a method for assessing the impact of a certain change when it is carried out. For example, one may sick to know what will happen following a modification of a web page, e.g. whether adding a picture will increase the number of clicks, etc. A and B are the situations before the change, and the subsequent situation, respectively. According to Wikipedia, Google started to implement AB testing in 2000, and this approach began to spread in the technology world in about 2011. Wikipedia also rightly notes that AB testing is actually an implementation of the experimental design that William Sealy Gosset developed in 1908.

Although this is a significant methodological advancement in the digital technology world, I think that this is a naive approach, especially in light of the many advances that occurred in this field since 1908. (of course in your company you do it better). The main problem of this methodology is that it is usually implemented using one factor at a time, thus ignoring possible interactions between a number of variables. Ronald Fisher had already presented this problem in the 1920s and proposed preliminary solutions (such as two-way ANOVA). Of course, there are more advanced solutions that were proposed by his successors.

Other problems can arise in planning the experiment itself: How is the sample size determined? How to choose an unbiased (i.e. representative) sample? How to analyze the results, if at all? And what is the appropriate statistical analysis method?

More things to consider: is there any awareness of the possible errors and the probabilities in which they occur? And if there is such awareness, what is done in order to control the magnitude of these probabilities?

And finally: is there a distinction between a statistically significant effect and a meaningful effect?

I recently visited at a large and successful technology company, where I was presented with several tables of “data analysis”. I recognized many of the failures I have just mentioned: no sample size rationale, interactions were not considered, faulty analysis was applied, no one cares about the error probabilities, and every effect is considered as meaningful.

In another company, two product managers performed two “independent” tests (that is, AB testing and CD testing) using the same population as a sample. The population consisted of new customers, therefore it was not a representative sample of the company’s customers. Worse, the AD and CD interaction was not considered, as the parallel existence of the two experiments was discovered only after the fact, and their results were already implemented. In order to avoid such a mess in the future, I suggested that someone will coordinate all the experiments. The response was that it is not possible because of the organization’s competitive culture.

You can say, “What do you want, the fact is that they do well?” But the truth is that they have succeeded despite the problems with their methodology, especially when the core of their algorithm is based on probability and statistics.

Oren Tzur put it nicely on Twitter:

“I think the argument is that it’s cheap and immediate and you see results even if there is no “good model”, and that mistakes cannot be fixed or even indicated. The approach is “Why should I invest in it? Sometimes it works.”

Rafael Cohen also wrote to me on Twitter:

“When I come to a certain field, I assume that the expert knows something and that my analysis should help him … I took a designer to the site, I will not do AB test on every pixel … even if I have thousands of users a day, I still want not to waste them on bad configuration … the naïve statistical analysis would require 80,000 observations for each experiment… it is likely that someone who uses less observations because of a gut feeling gets reasonable results and creates enough revenue to his company …

This is mediocrity. Why think and plan, asks Cohen, if you can use a naïve approach and get something that sometimes works? Who cares that you could do better?

A few years ago, I gave a talk on the future of statistics in the industry at the ISA annual meeting. I will repeat the main points here.

Sam Wilks, former president of the American Statistical Association, paraphrased H.G. Wells, a pronounced science fiction writer: “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.”

As far as the pharmaceutical industry is concerned, the future predicted by Wells and Wilks is already here. Statistics are central to all the research, development, and manufacturing processes of the pharmaceutical industry. No one dares to embark on a clinical trial without close statistical support, and in recent years there has been a growing demand for statistical support in even earlier stages of drug development, as well as in production processes.

I hope that awareness of the added value that statistics bring with it will seep into the technological industry as the use of statistics increases, so is the need for statistical thinking on the part of the participants in the process, and making so with someone who “knows a much more statistics than the average programmer”, as Oren Tzur phrased it.