Cancer clusters and the Poisson distributions

On March 1, 2019, an article was published in Israel’s Ynetnews website, under the title “The curious case of the concentration of cancer”. The story reports on a concentration of cancer cases in the town of Rosh Ha’ayin in central Israel.

In the past few years dozens of cases of cancer have been discovered in the center of Rosh Ha’ayin. About 40 people have already died of the disease. Residents are sure that the cause of the disease is cellular antennas on the roof of a building belonging to the municipality. “Years we cry and no one listens”, They say, “People die one after the other.”

I do not underestimate the pain of the residents of Rosh Ha’ayin. I also do not intend to discuss the numbers mentioned in the article. I accept them as they are. I just want to relate only to the claim that the cause of the disease is cellular antennas. It is easy (at least for me) to explain why this claim is at least questionable: there are many more cellular antennas in many places, and around them there is no high rate of morbidity in cancer. If the antennas are carcinogenic, then they need cancer everywhere, not just in Rosh Ha’ayin. On the other hand, I can’t blame them for blaming the antennas. People try to rationalize what they see and find a cause.

I also must emphasize that the case of Rosh Ha’ayin must not be neglected. If it is not because the cellular antennas, it is possible that there is some other risk factor, and the state authorities must investigate.

So why does Rosh Ha’ayin have such a large cluster of cancer morbidity? One possible answer is that there is an environmental factor (other than the antennas) that does not exist elsewhere. Another possible answer is that there may be a non-environmental factor that does not exist elsewhere, maybe a genetic factor (most of the town’s residents are immigrants from Yemen and their descendants). A third and particularly sad possibility is that the local residents suffer from really bad luck. Statistics rules can be cruel.

Clusters happen: if there are no local (environmental or other) risk factors that cause cancer (or any other disease), and the disease spreads randomly across the whole country, then clusters are formed. If the country is divided into units of equal size, then the number of cases in a given area unit follows a Poisson distribution. Then there is a small, but not negligible possibility that one of these units will contain a large number of cases. The problem is that there is no way of knowing in advance where it will happen.

The opposite is also true: If the distribution of the number of cases in a given area unit is a Poisson distribution, then it can be concluded that the dispersion on the surface is random.

I will demonstrate the phenomenon using simulation.

Consider a hypothetical country that has a perfect square shape, and its size is 100 x 100 kilometers. I randomized 400 cases of morbidity across the country:

# generate n random points on 0:100 x 0:100
> set.seed(21534)
> n=400
> x=runif(n)*100
> y=runif(n)*100
> dat=data.frame(x,y)
> head(dat)
x y
1 15.73088 8.480265
2 12.77018 78.652808
3 45.50406 31.316797
4 86.46181 6.669138
5 27.25488 48.164316
6 17.42388 98.429575

I plotted the 400 cases.

> # plot the points
> plot(dat$x, dat$y, ylim=c(0,100), xlim=c(0,100), 
+ asp=1, frame.plot=FALSE, axes=FALSE,
+ xlab=' ', ylab=' ', col='aquamarine', pch=16,
+ las=1, xaxs="i", yaxs="i",
+ )
> axis(1, at=0:10*10)
> axis(2, at=0:10*10, las=1, pos=0)
> axis(3, at=0:10*10, labels=FALSE)
> axis(4, at=0:10*10, labels=FALSE, pos=100)

Here’s the map I got.

Next, I divided the map into 100 squares, each 10 x 10 kilometers:

> #draw gridlines
> for (j in 1:9){
+ lines(c(0,100), j*c(10,10))
+ lines(j*c(10,10), c(0,100))
+ }

In order to count the cases in each square, I assigned row and column numbers for each of the squares, and recorded the position (row and column) for every case/dot:

> # row and column numbers and cases positions
> dat$row=0
> dat$col=0
> dat$pos=0
> for (j in 1:nrow(dat)){
+ dat$row=ceiling(dat$y/10)
+ dat$col=ceiling(dat$x/10)
+ dat$pos=10*(dat$row-1)+dat$col
+ }

Now I can count the number of points/cases in each square:

> # calculate number of points for each position
> # ppp=points per position
> dat$count=1
> ppp=aggregate(count~pos, dat, sum)
> dat=dat[,-6]

But of course, it is possible that there are squares with zero cases; (actually the data frame ppp has only 97 rows). Let’s identify them:

> # add positions with zero counts, if any
> npp=nrow(ppp)
> if(npp<100){
+ w=which(!(1:100 %in% ppp$pos))
+ addrows=(npp+1):(npp+length(w))
+ ppp[addrows,1]=w
+ ppp[addrows,2]=0
+ ppp=ppp[order(ppp$pos),]
+ }

And now we can get the distribution of number of cases in each of the 100 squares:

> # distribution of number of points/cases in each position
> tb=table(ppp$count)
> print(tb)
0  1  2  3  4  5  6  7  8  9 11 
3  9 12 21 15 17 13  5  1  3  1 

We see that there is one very unlucky cluster with 11 cases, and there also 3 squares with 9 cases each. Let’s paint them on the map:

> # identify largest cluster
> mx=max(ppp$count)
> loc=which(ppp$count==11)
> clusters=dat[dat$pos %in% loc,]
> points(clusters$x, clusters$y, col='red', pch=16)
> # identify second lasrgest cluster/s
> loc=which(ppp$count==9)
> clusters=dat[dat$pos %in% loc,]
> points(clusters$x, clusters$y, col='blue', pch=16)

Let’s also mark the squares with zero points/cases. In order to do this, we first need to identify the row and column locations of these squares:

> # identify sqaures without cases
> # find row and column locations
> loc=which(ppp$count==0)
> zeroes=data.frame(loc)
> zeroes$row=ceiling(zeroes$loc/10)
> zeroes$col=zeroes$loc %% 10
> w=which(zeroes$col==0)
> if(length(w)>0){
+ zeroes$col[w]=10
+ }
> print(zeroes)
loc row col
1 8 1 8
2 31 4 1
3 99 10 9

So there is one empty square in the 8th column of the first row, one in the first column of the 4th row, and one in the 9th column of the 10th row. Let’s mark them. To do that, we need to know the coordinates of each of the four vertices of these squares:

# mark squares with zero cases
> for (j in 1:nrow(zeroes)){
+ h1=(zeroes$col[j]-1)*10
+ h2=h1+10
+ v1=(zeroes$row[j]-1)*10
+ v2=v1+10
+ lines(c(h1,h2), c(v1,v1), lwd=3, col='purple')
+ lines(c(h1,h2), c(v2,v2), lwd=3, col='purple')
+ lines(c(h1,h1), c(v1,v2), lwd=3, col='purple')
+ lines(c(h2,h2), c(v1,v2), lwd=3, col='purple')
+ }

Do you see any pattern?

How well does the data fit the Poisson distribution? We can perform a goodness of fit test.
Let’s do the log-likelihood chi-square test (also known as the G-test):

> # log likelihood chi square to test the goodness of fit 
> # of the poisson distribution to the data
> # the obserevd data
> observed=as.numeric(tb)
> values=as.numeric(names(tb))
> # estimate the poisson distribution parameter lambda
> # it is the mean number of cases per square
> lambda=nrow(dat)/100
> print(lambda)
[1] 4
> # calculate the expected values according to 
> # a poisson distribution with mean lambda
> expected=100*dpois(values, lambda)
> # view the data for the chi-square test
> poisson_data=data.frame(values, observed, expected)
> print(poisson_data)
values observed expected
1 0 3 1.8315639
2 1 9 7.3262556
3 2 12 14.6525111
4 3 21 19.5366815
5 4 15 19.5366815
6 5 17 15.6293452
7 6 13 10.4195635
8 7 5 5.9540363
9 8 1 2.9770181
10 9 3 1.3231192
11 11 1 0.1924537
> # calculate the degrees of freedom
> df=max(values)
> print(df)
[1] 11
> # calculate the test statistic and p-value
> g2=sum(observed*log(observed/expected))
> pvalue=1-pchisq(g2,df)
> log_likelihood_chi_squrae_test=data.frame(g2, df, pvalue)
> print(log_likelihood_chi_squrae_test)
g2 df pvalue
1 4.934042 11 0.9343187

We cannot reject the hypothesis that the data follows the Poison distribution. This does not imply, of course, that the data follows the Poisson distribution, but we can say that the Poisson model fits the data well.

The delta method and its implementation in R

Suppose that you have a sample of a variable of interest, e.g. the heights of men in certain population, and for some obscured reason you are interest not in the mean height μ but in its square μ². How would you inference on μ², e.g. test a hypothesis or calculate a confidnce interval? The delta method is the trick you need.

The delta method is mathematical assertion that can yield estimates for the varinance of functons of statistics under mild condition. Loosley speaking, let bₙ is an estimate of β, where n is the sample size, and the distribution of bₙ is approximately normal with mean β and variance σ²/n.

Then, if g is a function, then g(bₙ) is approximately normal with mean g(β) and and variance [g’(β)² ]⋅σ²/n, provided that the sample size is large. Don’t let the calculus scare you. R’s deltmethod function will handle the calculus for you.

If bₙ is a Maximum Likelihood Estimate (MLE) of β then under mild conditions it is approximately normal for a large sample size, and g(bₙ) and g’(bₙ) are MLEs for g(β) and g’(β) respectively. Add on top of this a MLE for σ, and you can implement statistical inference.

I will demonstrate the use of the delta method using the Titanic survival data. For the sake of the example I will use only 3 variables: Survived is a 0/1 variable indicating whther a passenger survived the sinking of the Titanic, Age is obviously the passenger’s age, and Parch is the passenger’s number of parents/children aboard. Lets take a look at some of the data:

> # install.packages("titanic")
> library(titanic)
> titanic=titanic_train[, c(1, 2, 6,8)]
> head(titanic)
  PassengerId Survived Age Parch
1           1        0  22     0
2           2        1  38     0
3           3        1  26     0
4           4        1  35     0
5           5        0  35     0
6           6        0  NA     0

Let π be the probability of surviving. Then the odds of survival is π/(1-π). The sample size is n=891 is considered large, so we can apply the Central Limit Theorem to conclude that p is approximately normal with mean π and variance π/(1-π)/n. Then π can be estimated by p=0.384, the odds are estimated by p/(1-p)=0.623, and the variance of p is estimated by 0.000265:

> p_survival=mean(titanic$Survived)
> print(p_survival)
[1] 0.3838384
> survival_odds=p_survival/(1-p_survival)
> print(survival_odds)
[1] 0.6229508
> n=nrow(titanic)
> var_p_survival=(p_survival*(1-p_survival))/n
> print(var_p_survival)
[1] 0.0002654394

Let g(x)=p/(1-x). Then g`(x)=1/(1-x)², (you can check this at so the variane of the odds can be estimated by

Var(p/(1-p))=[g’(p)² ]⋅σ²/n=[1/(1-p)²]² ⋅ [p(1-p)/n].

This can be implemented in the following R code:

> dev_g_odds=1/(1-survival_odds)^2
> var_odds=(dev_g_odds^2)*var_p_survival
> print(var_odds)
[1] 0.01313328
> se_odds=sqrt(var_odds)
> print(se_odds)
[1] 0.1146005

But of course, instead of doing all the calculus, you can use the deltamethod function of R’s msm package.

The function has three parameters:

  • g is a formula object representating of the transformation g(x). The formula variables must be labeled x1, x2 and so on.
  • mean is the estimate of g(β)
  • cov is the estimate of var(β) which is σ²/n

The use of the delta function provides the same result:

> # install.packages("msm")
> library(msm)
> se_odds_delta=deltamethod(g=~x1/(1-x1), mean=survival_odds, cov=var_p_survival)
> print(se_odds_delta)
[1] 0.1146005

The second example considers logistic regression. We will model the (logit of the) probability of survival using Age nd Parch. Using R’s glm function we can obtain estimates to the logistic regression coefficients band their standard errors se_b:

> model=glm(Survived ~ Age + Parch, family=binomial(link="logit")
> model_beta=data.frame(summary(model)$coefficients[2:3, 1:2])
> names(model_beta)=c("b", "se_b")
> print(model_beta)
                 b        se_b
Age   -0.008792681 0.005421838
Parch  0.191531303 0.090643635

Since exp(β) is usually interpreted as the odd ratio (OR) of the response variable with regard to the regression independent variable, reseachers are interested in inference on exp(β). In order to perform such inference one nees to estimate the standard error of exp(β). Since b is a Maximum Likelihood Estimate for β, it is approximately normal and its variance is estimated by se_b², and the delta method can be applied.

The calculus in this case is easy: g(x)=exp(x), so g’(x)=exp(x). Therefore, the standard error of exp(β) can be estimated by exp(b)⋅ se_b:

> model_beta$OR=exp(model_beta$b)
> model_beta$se_OR=exp(model_beta$b)*model_beta$se_b
> print(model_beta)
                 b        se_b        OR       se_OR
Age   -0.008792681 0.005421838 0.9912459 0.005374374
Parch  0.191531303 0.090643635 1.2111027 0.109778755

To use the deltamethod function, we will first use the vcov function to obtain the variance-covariance matrix of the resression coefficient estimates, and the variances will be the inputs of the deltamethod function:

> vc=vcov(model)[2:3, 2:3]
> print(vc)
               Age        Parch
Age   2.939632e-05 9.236876e-05
Parch 9.236876e-05 8.216269e-03

> model_beta$se_OR_delta[1]=deltamethod(~exp(x1), mean=model_beta$b[1], cov=vc[1,1])

> model_beta$se_OR_delta[2]=deltamethod(~exp(x1), mean=model_beta$b[2], cov=vc[2,2])

> print(model_beta)
                 b        se_b        OR       se_OR se_OR_delta
Age   -0.008792681 0.005421838 0.9912459 0.005374374 0.005374374
Parch  0.191531303 0.090643635 1.2111027 0.109778755 0.109778755

Of course, we get the same results.


Powerball demystified

The US Powerball lottery hysteria took another step when no one won the big jackpot in the last draw that took place on October 20, 2018. So, the total jackpot is now 2.22 billion dollars. I am sure that you want to win this jackpot. I myself want to win it.

Actually, there are two different lotteries: The Mega Million lottery prize is about 1.6 billion dollars, and the probability of winning when playing a single ticket is about 1 in 302 million. The Powerball lottery jackpot is “only” 620 million dollars but the probability of winning is slightly better: about 1 in 292 million.

The probability of winning both jackpots is therefore the multiplication of the two probabilities stated above, which 1 about 1 in 88000000000000000.

Let’s not be greedy, and aim just to 1.6 billion jackpot, although its probability of winning is slightly worse.

First, it should be noted that although the probability of winning is small, it is still positive. So if you buy a ticket you get a chance. If you do not buy a ticket you will not win, period.

Second, is buying a ticket a good investment? It looks like it is. The price of a ticket is two dollars. On the average, you will win the jackpot with probability of 1 to 302 million, and lose your two dollars dollar with probability of nearly 1. Therefore, your average return is about the jackpot multiplied by the probability of winning it minus the price of the ticket. Since the probability of winning is 1 in 302 million and the jackpot is 1600 million, then the expected return is 1600/302–2 , which is positive — about 3.30 dollars. Therefore, you should play. Or shouldn’t you?

The above figure — expected value of 3.30 dollars is an expectation of money. It is not money. You are not going to gain this expected sum of money when you play the lottery once. You either win the jackpot or lose your money. Of course, if you get a chance to participate in such a lottery with such a jackpot as many times as you wish, you should play, and the law of large numbers will be in your favor. This is not going to happen, of course. You only get to play this game once.

The next interesting question is what is the probability that someone will win?

Assume that you roll a die. The probability of rolling 6 is 1 to six. If two people roll a die, then the probability of at least one of them rolling six is about 1 in 3.3. If 3 people roll a die then the probability of at least one of them rolling six is even better: 1 in 137, and so on. The lottery is similar. Think of a lottery ticket as a die, only that the probability of rolling 6 is 1 to 302 million. If two people are rolling suck a dice, i.e. buying a lottery ticket, then the probability that at least one of them rolling a six is slightly better than 1 to 302 million. How many people should buy a lottery ticket to make the probability of a least one win greater than 5%? 10%? 50%? What is the probability that two or more people will share the jackpot? These probabilities depend on the amount of tickets sold. The more ticket sold, the higher the probability that someone wins. If you know the number of tickets sold, you can be approximated these probabilities using the Poisson distribution. You can also back-calculated the number of tickets need to be sold in order to set the probability that someone wins to any level you like. I’ll skip the technical details. According to my calculations, the number of tickets need to be sold to ensure that the probability of at least one winner exceeds 0.5 is about 210 million.

But wait: the price of a ticket is 2 dollars dollar, and there are only 302 million possible combinations of numbers. So, if I buy all possible tickets, it will cost me only 604 million, and I am guaranteed to win 1.6 billion. This is a net profit of nearly a billion dollars. Not bad. Can I do it?

The short answer is “Yes”. The long answer is “probably not”.

Yes. It was done before. In 1992, the jackpot of the Virginia lottery was 27 million dollars, while the probability of winning the jackpot was about 1 to 7 million, and the price of a ticket was 1 dollar. So it was possible to buy 7 million tickets for 7 million dollars to ensure a jackpot of 27 million. A consortium of 2500 small investors aimed to raise the money to buy all these tickets. However, they managed to buy only about 5 million tickets. They still managed to win, (See: The improbability Principle — David Hand, page 120)

To buy 302 million tickets is a different story. First you need to have 604 million dollars in cash. Second, you need to actually to buy all these tickets, and you have only 4 days to do all these shopping. In 4 days there are 345600 seconds, so you need to buy nearly 900 tickets per second, while you make sure that each ticket carries a different combination of numbers. The logistics may be difficult. In 1992, that consortium mange to but only 5 million tickets in about a week. You may not be able to buy all the tickets you need.

Second, when you win, you will probably want the money in cash, and not in annuity payments. So, the 1.6 billion reduces to 910 million, and the government will take 25% tax (and more later). You will end up with 684 million. Still a net profit of 80 million dollars. Not bad.

Third, it is possible that you will share the jackpot with someone else, maybe even more than one person. As we already saw, there is a good chance for this scenario — if 210 million tickets are sold then the probability for sharing the jackpot is about 50%. Even if you share the jackpot with just another person, your share is just 800 million, and if you want to cash it it will shrink to 342 million, a net loss of 262 million. That’s not good.

And finally: should you buy a ticket? Yes, buy one ticket. It’s fun. And for two dollars you buy the right to hope winning the jackpot, as Prof Robert Aumann, the Nobel Prize winner said.

A brief history of clinical trials

The earliest report of a clinical trial is probably provided in the Book of Daniel. Daniel and a group of other Jewish people who stayed at the palace of the king of Babylon, did not want to eat the king’s non-Kosher food and preferred a vegetarian diet. To show that vegetarian and Kosher diet is healthier, Daniel suggested to conduct an experiment. There were two “treatment group” in this trial. One group ate the royal Babylonian food, the other kept the vegetarian diet. The health of the groups was compared after a follow-up period of 10 days. The conclusion was that the vegetarian diet is healthier.

The first modern clinical trial is James Lind’s scurvy trial, which many consider to be the starting point of modern medicine. This is the first documented controlled clinical trial (if you ignore Daniel’s trial). Lind conducted an experiment to test possible treatments for scurvy, the leading cause of death among sailors by the end of the 18th century. In a relatively brief voyage in the Mediterranean in 1749, Linde divided the 12 sailors who fell sick during the voyage to six equal groups. They were all hosted in the same place on the ship and were given the same menu, which was distinguished only by the experimental treatment given to them. The treatments were: drinking a liter of cider a day, drinking 25 drops of sulfuric acid three times a day, drinking two tablespoons of vinegar three times a day, drinking half a liter of seawater a day, an ointment made of garlic, mustard, and radish, or eating two oranges and lemon a day. The citrus patients had recovered completely, and the condition of cider patients improved slightly. The comparison between the groups allowed Lind to evaluate the efficacy of each treatment in relation to other therapeutic alternatives.

The next milestone is William Watson’s trial of treatments to reduce the risk of smallpox. Already in the 11th century it was known that anyone who had this disease and survived would not get sick again. As a result, a practice of immunization of the disease by “mild infection” of healthy people was developed. However, among the doctors there were disagreements about optimal adhesion and treatment for infection. Watson conducted a series of three clinical trials at London Children’s Hospital in 1767. His methodology was similar to that of Lind: The children participating in each trial were divided into groups, and in each group, controlled infection was performed using a bladder from an early stage of the disease. Each group was given a different adjuvant treatment that was supposed to reduce the risk of infection. Watson’s experiments had a number of innovations compared to Lind’s experiment. Watson ensured that in each treatment group there was an equal number of boys and girls to avoid possible bias in case the response to treatment was different between the genders. In addition, one group in each trial did not receive supplementary treatment but served as a control group. Most importantly, Watson was the first to report a quantitative measurement of results. The measure of success of treatment was the number of smallpox that occurred in each child participating in the trial. He also performed a basic statistical analysis and published the average number of blisters per child in each group. Watson concluded that conventional treatments to reduce risk, including mercury, various plants and laxatives, were ineffective.

The next significant milestone is the milk experiment in the Lancashire county of Scotland in the early 20th century. The purpose of the trials was to determine whether daily milk intake improves the growth of children compared to children who did not drink milk on a daily basis, and to check whether there is a difference in growth rates between children fed fresh milk and those fed in pasteurized milk. The experiment, conducted in 1930, was large-scale and included a total of about 20,000 children aged 6–12, who studied in 67 schools. About 5,000 were fed in fresh milk, about 5,000 in pasteurized milk, and approximately 10,000 children were assigned to the control group. The height and weight of the children were measured at the beginning of the experiment (February 1930) and at the end (June 1930). The conclusion was that a daily diet of milk improves the growth of children, and that there is no significant difference between fresh milk and pasteurized milk. The researchers also concluded that children’s age had no effect on the effect of growth rate.

This experiment entered my list because of the criticism leveled at it. The critics included Fisher and Bartlett, but the most comprehensive criticism was cast by William Sealy Gosset, also known as “Student”. In an article published in Biometrika, Gosset actually set rules that were necessary to ensure the validity of a clinical trial. First, he noted that in each school the children were treated with fresh milk or pasteurized milk, but the two groups were not represented in any school. As a result, it is not possible to directly compare fresh and pasteurized milk, due to differences between the different schools. He also noted that the treatments were assigned by the teachers in each class and not randomly. As a result, students in the control group were larger in their body size than students in the treatment groups. Thirdly he notes that although the measurements were conducted in February and June, the weight measurements did not consider the weights of the children cloths. Winter clothes are heavier than spring / summer clothes, and the weight difference between clothes offset the real weight differences. The researchers assumed that the difference in the weight of the clothes would be similar among the groups, but Gosset argued that the bias in the distribution of students to economically affected groups — children from affluent families were usually included in the control groups — meant that the weight of the control group’s winter clothing would be higher.

Gosset concluded that the results did not support the conclusion that there is no difference between a diet with fresh milk and a pasteurized milk diet, and claimed that it is impossible to conclude that there is no connection between age and the change in growth rate. He also mentions the analysis of Fisher and Bartlett that showed that fresh milk has an advantage over pasteurized milk as to the rate of growth.

Following his criticism, Gosset made a number of recommendations, including a proposal to conduct the experiment in a group of twins, one of whom will be fed milk and the other will serve as a control (or one of them will be fed in fresh milk and the other in pasteurized milk to compare the two types of milk). I think that such planning is not accepted ethically today. A more practical recommendation is to re-analyze the data collected to try to overcome the bias created in the non-random allocation to treatment and control groups. His ultimate recommendation was to re-conduct the experiment, this time using randomization, considering bias due to the weight of the clothes worn by each student, and planning the experiment so that each school has representation for the three treatment groups.

The main recommendation of Gosset, to ensure random allocation of patients to groups, was not immediately accepted, as this idea was perceived by some of the scientific community as “unethical”. It should be noted that the principle of randomization was only presented by Fisher in 1923, and there was still insufficient recognition of its importance.

The first clinical trial with random assignment to a treatment and control groups was conducted only in 1947, and is the fourth in my list. This is an experiment to test the efficacy of streptomycin antibiotics to treat pneumonia. Due to the short supply of antibiotics, there was no choice but to decide by “lottery” between the patients who will receive antibiotic treatment and who will not, and thus the planning of the experiment overcame the ethical barrier. However, the experiment was not double blind, and placebo was not used.

It should be noted that there has already been a precedent for a double blind trial: the first clinical trial using the double-blind method was conducted in 1943 to test the efficacy of penicillin as a treatment for common cold. Patients did not know whether they were treated with penicillin, or whether they were treated with placebo. The doctors who treated the patients did not know what treatment each patient received. Such a design prevents bias that may result from doctors’ prior judgment about the efficacy of the treatment, and in fact forces them to give an objective opinion about the patient’s medical condition. However, this trial did not randomize patients for treatment or control.

The debate regarding the importance of the principles outlined by Gosset and Fisher was finally decided in the trial to test the efficacy of Salk’s vaccine against polio virus, carried out in 1954. In fact, two trials were conducted. The main trial, led by Paul Meier, was a double-blind randomized trial, showing a 70% reduction in Polio-related paralysis in the treatment group compared to the control group. The size of the large sample (about 400,000 children aged 6–8) helped to establish external validity of the results. At the same time, another trial was conducted, in which the allocation of treatment (vaccination or placebo) was not random. 725,000 first and third graders who participated in the experiment served as a control group, to which 125,000 second grade children whose parents refused the vaccine were added. Their data were compared with the data of 225,000 second graders whose parents agreed to vaccinate them. A total of more than one million students participated in the experiment, almost three times the size of Meier’s trial. However, this trial results showed a decrease of only 44% in polio-related paralysis. Later analysis found that the effect was reduced due to bias related to the socioeconomic status of the treatment group. Many children in this group belonged to more affluent families, and in this population stratum polio incidence was higher because the proportion of children vaccinated naturally (the polio was mild and recovered without documentation) was lower due to the higher level of sanitation in their environment. The polio trails established the fact that the most important feature of a clinical trial is the randomization , and that only a random and double-blind allocation ensures the internal validity of the experiment.


  • Boylston, AW (2002). Clinical investigation of smallpox in 1767.New England Journal of Medicine, 346 (17), 1326–1328.
  • Leighton G, McKinlay P (1930). Milk consumption and the growth of school-children. Department of Health for Scotland, Edinburgh and London: HM Stationery Office.
  • Student (1931). The Lanarkshire Milk Experiment. Biometrika 23: 398–406
  • Fisher RA, Bartlett S (1931). Pasteurised and raw milk. Nature 127: 591–592.
  • Medical Research Council Streptomycin in Tuberculosis Trials Committee. (1948).
  • Streptomycin treatment for pulmonary tuberculosis. BMJ, 2 , 769–82.
  • Hart, PDA (1999). A change in scientific approach: from alternation to randomized allocation in clinical trials in the 1940s.BMJ, 319 (7209), 572–573.
  • Meier, Paul. “Polio trial: an early efficient clinical trial.” Statistics in medicine 9.1–2 (1990): 13–16.

What is logistic in the logistic regression?

Suppose that you are interviewed for a data scientist role. You are asked about logistic regression, and you answer all sorts of questions: How to run it in Python, how would you perform feature selection, and how would you use it for prediction. For the last question you answer that if you have the estimated of the regression coefficients and the data of the features, then you perform the necessary multiplications and additions, and the result will be L=log(p/(1-p)) where p is the probability of the event to be predicted. This transformation is known as the logit transformation. From this you can calculate p as exp(L)/(1+exp(L)). Then comes the critical question: why is that?

One possible answer is that since p is between 0 and 1, then L is between -∞ to ∞, that is, it can be any real number, and therefore the logistic regression is transformed into “regular” linear regression. However, this answer is wrong. In the train data, the values of the event/label to be predicted or classified are either 0 or 1. You cannot apply the logit transformations to zeros and ones. And even if you could, the linear regression assumptions do not hold.

A more sophisticated answer is to say that the logit transformation is the link function of choice. This choice has a nice property: if β is the coefficient of a feature X, then exp(β) is a (biased) estimate for the corresponding odds ratio. This is useful if you want to identify risk factors , e.g. for a disease like cancer. But if you are interested in predictions, you may not care about the odds ratio (although you should).

Let’s assume that we know to explain what a link function is. The question still remains: Why not choose another link function? Almost any inverse distribution function will do this trick. Why not choose the inverse of the normal distribution function as the link function?

Moreover, the history does not support this answer. The logit transformation and the logistic regression model came first. This model was developed by Sir David Cox in the 1960’s — the same David Cox who later introduced the proportional hazards model (see . The extension of the model to general linear models using various link functions came later.

So the question remains: what is logistic in the logistic regression?

The key is in the statistical model of the logistic regression, or any other binary regression. Let’s review the model.

Suppose you have data of a response/label/outcome Y that takes values of zeroes and ones. Let assume for the sake of simplicity that you have only one feature/predictor X, which can be any type of variable.

The key assumption of the model is that there exists a continuous/latent/unobservable Y* that relates somehow to the observed values of YNote that Y* is not a part of your data. It is a part of your model.

The next assumption is about the relationship between Y and Y*. You assume that Y equals to 1 if the signal of Y* is above some threshold, and otherwise Y is equal to zero. Furthermore, assume, without loss of generality, that this threshold is zero.

This model assumption is not new, and most of the readers are familiar with this approach. This is how the perceptron, the building block of neural networks, works. The idea itself is much older. Karl Pearson used similar modelling when he attempted to develop a correlation coefficient for categorical data, back in the 1910’s.

Assuming you know the values of Y*, you can model the relationship between Y* and X using simple linear regression:

The third and last assumption is about the distribution of the errorepsilon. As I said before, you can choose any distribution you like. If you assume, for example, that epsilon is normally distributed, you will get something that is called probit regression. But if assume that epsilon follows the logistic distribution:

then these three assumptions and some basic probability and algebra you get a logistic regression — a regression with a logit link function.

For simplicity I will assume that X is discrete variable. One can do the whole trick for any X by using density functions for a continuous X.


be the conditional distribution of Y given that X is equal to some value x.

Since Y=1 if and only if Y*>0, we get that

By the second assumption of the model we get that

Using the third assumption that states that the distribution of Y* is logistic we get that

and therefore

Good luck in your interview!


Some comments on AB testing implementation

Many job postings in the field of technology (mainly for Data Scientist jobs, but not only) require knowledge and/or experience in “AB testing”.

What is AB testing? A brief inspection at Wikipedia reveals that this is a method for assessing the impact of a certain change when it is carried out. For example, one may sick to know what will happen following a modification of a web page, e.g. whether adding a picture will increase the number of clicks, etc. A and B are the situations before the change, and the subsequent situation, respectively. According to Wikipedia, Google started to implement AB testing in 2000, and this approach began to spread in the technology world in about 2011. Wikipedia also rightly notes that AB testing is actually an implementation of the experimental design that William Sealy Gosset developed in 1908.

Although this is a significant methodological advancement in the digital technology world, I think that this is a naive approach, especially in light of the many advances that occurred in this field since 1908. (of course in your company you do it better). The main problem of this methodology is that it is usually implemented using one factor at a time, thus ignoring possible interactions between a number of variables. Ronald Fisher had already presented this problem in the 1920s and proposed preliminary solutions (such as two-way ANOVA). Of course, there are more advanced solutions that were proposed by his successors.

Other problems can arise in planning the experiment itself: How is the sample size determined? How to choose an unbiased (i.e. representative) sample? How to analyze the results, if at all? And what is the appropriate statistical analysis method?

More things to consider: is there any awareness of the possible errors and the probabilities in which they occur? And if there is such awareness, what is done in order to control the magnitude of these probabilities?

And finally: is there a distinction between a statistically significant effect and a meaningful effect?

I recently visited at a large and successful technology company, where I was presented with several tables of “data analysis”. I recognized many of the failures I have just mentioned: no sample size rationale, interactions were not considered, faulty analysis was applied, no one cares about the error probabilities, and every effect is considered as meaningful.

In another company, two product managers performed two “independent” tests (that is, AB testing and CD testing) using the same population as a sample. The population consisted of new customers, therefore it was not a representative sample of the company’s customers. Worse, the AD and CD interaction was not considered, as the parallel existence of the two experiments was discovered only after the fact, and their results were already implemented. In order to avoid such a mess in the future, I suggested that someone will coordinate all the experiments. The response was that it is not possible because of the organization’s competitive culture.

You can say, “What do you want, the fact is that they do well?” But the truth is that they have succeeded despite the problems with their methodology, especially when the core of their algorithm is based on probability and statistics.

Oren Tzur put it nicely on Twitter:

“I think the argument is that it’s cheap and immediate and you see results even if there is no “good model”, and that mistakes cannot be fixed or even indicated. The approach is “Why should I invest in it? Sometimes it works.”

Rafael Cohen also wrote to me on Twitter:

“When I come to a certain field, I assume that the expert knows something and that my analysis should help him … I took a designer to the site, I will not do AB test on every pixel … even if I have thousands of users a day, I still want not to waste them on bad configuration … the naïve statistical analysis would require 80,000 observations for each experiment… it is likely that someone who uses less observations because of a gut feeling gets reasonable results and creates enough revenue to his company …

This is mediocrity. Why think and plan, asks Cohen, if you can use a naïve approach and get something that sometimes works? Who cares that you could do better?

A few years ago, I gave a talk on the future of statistics in the industry at the ISA annual meeting. I will repeat the main points here.

Sam Wilks, former president of the American Statistical Association, paraphrased H.G. Wells, a pronounced science fiction writer: “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.”

As far as the pharmaceutical industry is concerned, the future predicted by Wells and Wilks is already here. Statistics are central to all the research, development, and manufacturing processes of the pharmaceutical industry. No one dares to embark on a clinical trial without close statistical support, and in recent years there has been a growing demand for statistical support in even earlier stages of drug development, as well as in production processes.

I hope that awareness of the added value that statistics bring with it will seep into the technological industry as the use of statistics increases, so is the need for statistical thinking on the part of the participants in the process, and making so with someone who “knows a much more statistics than the average programmer”, as Oren Tzur phrased it.

How to make children eat more vegetables

Let’s start from the end: I do not know how to make children eat more vegetables or even eat some vegetables. At least with my children, success is minimal. But two researchers from the University of Colorado had an idea: we would serve them the vegetables on plates with pictures of vegetables. To test whether the idea works, they conducted an experiment whose results were published in the prestigious journal JAMA Pediatrics . Because the results have been published you can guess that the result of the experiment was positive. But, did they really prove that their idea works?

Design of the experiment and its results

18 kindergarten and school classes (children aged 3–8) were selected in one of the suburbs of Denver. At first the children were offered fruits and vegetables when they were given white plates. In each class a bowl of fruits and a bowl of vegetables were placed, and each child took fruit and vegetables for himself or herself and ate them as he pleased. The weights of the vegetables and fruits were recorded before they were served to the children, and when the children had finished their meal, the researchers weighed the remaining fruits and vegetables. The difference between the weights (before and after the meal) was divided by the number of children, and thus the average amount of fruit and vegetables each child ate was obtained. Fruit and vegetables averages were also calculated separately. The researchers repeated these measurements three times per class.

After a while, the measurements were repeated the same way, but this time the children were given plates with pictures of vegetables and fruits. The result was an average increase of 13.82 grams in vegetables consumption, between 3 and 5 years of age. This result is statistically significant. In percentages it sounds much better: this is an increase of almost 47%.

So, what’s the problem? There are several problems.

First problem — extra precision

I will start with what is seemingly not a problem, but a warning: over-precision. When super precise results are published, you have to start worrying. I would like to emphasize: I mean precision, not accuracy. Accuracy refers to the distance between the measured value and the real, unobserved value, and is usually measured by standard deviation or confidence interval. The issue here is about precision: the results are reported at the level of two decimal places; they are very precise. I’m not saying it’s not important, but from my experience, when someone exaggerates, you have to look more thoroughly at what’s going on. Precision of two digits after decimal when it comes to grams seems excessive to me. You can of course think differently, but that’s the warning signal that made me read the article to the end and think about what was described in it.

Second problem — on whom was the experiment conducted?

The second problem is much more fundamental: the choice of the experimental unit, or unit of observation . The experimental units here are the classrooms. The observations were made at the class level. The researchers measured how many vegetables and fruits were eaten by all the children in the class. They did not measure how many vegetables and fruits each child ate. Although they calculated an average for a child, I suppose everyone knows that the average alone is a problematic measure: it ignores the variation between the children. Before experimental intervention, Each child ate an average of about 30 grams of vegetables at a meal, but I do not think there will be anyone who disagrees with the statement that each child ate a different amount of vegetables. What is the standard deviation? We do not know, and the researchers do not know, but this is essential, because the difference between the children affects the final conclusion. Because the researchers ignored (regardless of the reason) the variation between the children, they practically assumed that the variance was very low, in fact zero. Had the researchers consider this variation, the conclusions of the experiment would be different: the confidence intervals would be different, and wider than the confidence intervals calculated by the researchers.

Another type of variance that was not considered is the variation within children. Let me explain: Even if we watched one child and saw that on average he ate 30 grams of vegetables at every meal, at different meals he eats a different amount of vegetables. The same the question arises again: What is the standard deviation? This standard deviation also has an impact on the final conclusion of the experiment. Of course, each child has a different standard deviation, and this variability should also be taken into consideration.

A third type of variation that was not considered is the variation between children of different ages: it is reasonable to assume that an 8-year-old will react differently to a painted plate than a 3-year-old. An 8-year-old will probably eat more vegetables than a 3-year-old.

I think that the researchers did not pay attention to all these issues. The words variation, adjust or covariate do not appear in the article. Because the researchers ignored these sources of variation, the confidence intervals they calculate are too narrow to reflect the real differences between the children and the types of successes.

Finally, although the experimental unit was the class, the results were reported as measurements were made at the child’s level. In my opinion, this also shows that the researchers were not aware of the variation between and within the children. For them, class and child are one and the same.

Third problem — what about the control?

There is no control group in this experiment. At a first sight, there is no problem: according to the design of the experiment, each class constitutes its own control group. After all, the children received the vegetables in white plates as well as plates with paintings of vegetables and fruits. But I think that’s not enough.

There are lots of types of plates for children, with drawings by Bob the Builder, Disney characters, Adventure Bay, Thomas the engine, and the list goes on. Could it be that the change was due to the very fact of the paintings themselves, and not because they are paintings of vegetables and fruits? Maybe a child whose meal is served on a plate with pictures of his favorite superhero will eat even more vegetables? The experimental design does not answer this question. A control group is needed. In my opinion, two control groups are needed in this experiment. In one of them the children initially get white plates, and then plates of Thomas the engine, Disney or superheroes, depending on their age and preferences. In the second control group there will be children who will initially receive “ordinary” plates (i.e. Thomas, Disney, etc.) and then plates with paintings of vegetables and fruits.

Fourth problem — subgroup analysis

Although the age group of the children in the study was 3–8, the researchers discuss the results for children in ages 3–5. What happened to children at age 6–8? Was the analysis for the two (or more) age groups pre-defined? The researchers do not provide this information.

Fifth Problem — What does all this mean?

First, it was found that there was a statistically significant change in the consumption of vegetables, but no significant change was observed in the fruit. The researchers referred to this in a short sentence: a possible explanation, they said, is the ceiling effect . Formally they are right. ceiling effect is a statistical phenomenon, and that is what happened here. The really important question they did not answer: Why did this effect occur?

And the most important question: Is the significant change also meaningful? What does the difference of 14 grams (sorry, 13.82 grams) mean? The researchers did not address this question. I’ll give you some food for thought. I went to my local supermarket and weighted one cucumber and one tomato (yes, it’s a small sample, I know). The weight of the cucumber was 126 grams, and the weight of the tomato was 124 grams. In other words, each child ate on average an extra half a bite of a tomato or a cucumber. Is this amount of vegetables meaningful in terms of health and / or nutrition? The researchers did not address this question, nor did the editors of the journal.


It is possible that plates with vegetables and fruit paintings cause children to eat more vegetables and fruits. This is indeed an interesting hypothesis. The study that was described here does not answer this question. The manner in which it was planned and implemented does not allow even a partial answer to this question, apparently due to the lack of basic statistical thinking.