We must abandon the p-value

Just a couple of weeks ago I wrote here that we should not abandon statistical significance. No, I did not change my mind. We should not abandon statistical significance but we must abandon the p-values.

Executive summary: p-values make a false impression that statistical significance is a continuous concept, but it is not. It is dichotomous.

You are welcome to read on.

I will not get into a lengthy discussion on the difference between Fisher‘s significance testing and Neyman and Pearson‘s hypothesis testing. In short: Fisher had a null hypothesis, a test statistic and a p-value. The decision whether or not to reject the null hypothesis based on the p-value was flexible. Indeed, in one place Fisher suggested to use the 5% threshold for deciding significance, but in other places he used different thresholds.

Neyman and Pearson’s approach is totally different. In their framework there are two hypotheses: the null hypothesis and the alternative hypothesis. The Neyman-Pearson lemma provides as a tool to construct decision rule. Given data, you evaluate the likelihood of these data in two cases: once assuming that the null hypothesis is true, and then assuming that the alternative hypothesis is true. You reject the null hypothesis in favor of the alternative if the ratio of the two likelihoods exceeds a pre-defined threshold. This threshold is determined by the significance level: an acceptable probability of falsely rejecting the null hypothesis (in a frequentist manner), that meets your scientific standards.

When I was a young under-graduate student, we did not have easy access to computers in my university (yes, I am that old). So we calculated the test statistic (Z, t, F, chi-square) and looked in a table similar to this one to see if our test statistic exceeds the threshold.

Fisher was a genius, and so was Karl Pearson. They developed statistical tests such as F and Chi-square based on geometrical considerations. But when you try to construct a test for equality of means as is in ANOVA or a test for independence of two categorical variables by using the NP lemma, you get the same tests. This opened a computational shortcut: The NP decision rule is equivalent to a p-value decision rule. You can decide to reject the null by comparing your test statistic to the value you look up in a table, or by comparing the p-value to your pre-determined significance level. You will get the same result either way. But still, you either reject the null hypothesis or you don’t. The size of the test statistic or the p-value does not matter. There is are such things as “nearly significant”, “almost significant”, “borderline significant” or “very significant”. IF your pre-determined significance level is 0.05, and your p-value is 0.051, that’s really depressing. But if you declare that your result is “almost significant”, it means that you changed your decision rule after the fact.

Let me illustrate it with a story my teacher and mentor, Professor Zvi Gilula, told me once.

Someone robbed a convenience store in the middle of nowhere. The police caught him and brought him to court. There were three eye witnesses, and a video from the security camera. The jury found him guilty and he was sentenced to serve 6 months in jail. In statistical jargon, the null hypothesis states that the defendant is not guilty, and the jury rejected the null hypothesis.

Let’s look at another robber. This man robbed a convenience store in New York City, next to Yankee Stadium.  The robbery occurred when the game just ended, the Yankees won, 47000 happy fans witnessed the robbery, and they all came to testify in court, because the poor robber was a Red Sox fan. The jury found him guilty.

Is the second robber guiltier than the first robber? Is the first robber “almost not guilty” or “borderline guilty”? Is the second robber “very guilty”? What would be the right punishment to the second robber? 10 years? Life imprisonment? Death sentence? Do you think now that if you were in the jury of the first robber you would acquit him?

Here is what we need to do if we really want to improve science: when publishing results, one should provide all the necessary information, such as data, summary statistics, test statistics and more. And then state: “we reject the null hypothesis at the 5% significance level”, or maybe “we do not reject the null hypothesis at the 5% significance level” (or any other reasonable and justified significance level. No p-values. And if someone wants to know if the null hypothesis will be rejected at the 5.01% significance level, they can go and calculate their own p value. P values should be abandoned.

We should not abandon statistical significance

The call to ban statistical significance is not new. Several researchers, the most notable of them is Andrew Gellman from Columbia University, are promoting this idea in the last few years. What is new is the recent article in the prestigious journal Nature, that brought up the issue again. Actually, it is a petition signed by about 800 researchers, some of them are statisticians (and yes, Gellman signed this petition too). Gellman and Blake McShane from Northwestern University, who is one of the lead authors of the current petition, also expressed this idea  in another provocative article at Nature titled “Five ways to fix statistics” published in November 2017.

I think that it is not a good idea. It is not throwing the baby out with the bathwater. It is attempting to take the baby and throw it out of the sixth-floor window.

Let me make it clear: I do not oppose taking into account prior knowledge, plausibility of mechanism, study design and data quality, real-world costs and benefits, and other factors (I actually quoted here what Gellman and McShane wrote in the 2017 article). I think that when editors are considering a paper that was submitted to publication, they must take all of this into account, provided that the results are statistically significant.

“Abandoning statistical significance” are just nice words for abandoning the Neyman-Pearson Lemma. The lemma guarantees that likelihood ratio tests control the rate of the false positive results at the acceptable significance level, which is currently equal to 5%. The rate is guaranteed in a frequentist manner. I do not know how many studies are published every year. Maybe tens of thousands, maybe even more. If all the researchers are applying the Neyman-Pearson theory with a significance level of 5%, then the proportion of the false positive results approaches 0.05. No wonder that Gellman, who is a Bayesian statistician, opposes this.

Once statistical significance is abandoned, bad things will happen. First, the rate of false positive published results will rise. On the other hand, there is no guarantee that the rate of false negative results (that are not published anyway) will fall. The Neyman-Pearson Lemma provides protection against the false negative rates as well. Granted, you need to have an appropriate sample size to control it. But if you don’t care for statistical significance anymore, you will not care for controlling the false negative rate. Researchers will get an incentive for performing many small sample studies. Such studies are less expensive, and have higher variation. The chance of getting a nice and publishable effect is larger. This phenomenon is known as “The law of small numbers“. What about external validity and reproducibility? In a world of “publish or perish”, nobody seems to care.

And this brings us to the question how one would determine if an effect is publishable. I can think of two possible mechanisms. One possibility is to calculate a p-value and apply a “soft” threshold. P is 0.06? that’s fine. What about 0.1? it depends. 0.7? no way (I hope). A second possibility is the creation of arbitrary rules of thumbs that have no theoretical basis. In this paper, for example, the researchers expressed the effects in terms of Hedges’ g, and considered an effect as meaningful if g>1.3. Why 1.3? They did not provide any justification.

I worked many years in the pharmaceutical industry, and now I work in a research institute affiliated with a large healthcare organization. False positive results, as well as false negative results, can lead to serious consequences. Forget about the money involved. It can be a matter of life and death. Therefore, I care for the error rates, and strongly support hypothesis testing and insisting on statistical significance. And you should care too.

 

 

 

 

 

 

 

 

 

 

How to select a seed for simulation or randomization

If you need to generate a randomization list for a clinical trial, do some simulations or perhaps perform a huge bootstrap analysis, you need a way to draw random numbers. Putting many pieces of paper in a hat and drawing them is possible in theory, but you will probably be using a computer for doing this. The computer, however, does not generate random numbers. It generates pseudo random numbers. They look and feel almost like real random numbers, but they are not random. Each number in the sequence is calculated from its predecessor, so the sequence has to begin somewhere;  it begins in the seed – the first number in the sequence.

Knowing the seed is a good idea. It enables reproducing the analysis, the simulation or the randomization list. If you run a clinical trial, reproducibility is crucial. You must know at the end of the trial which patient was randomized to each treatment; otherwise you will throw all your data to the garbage. During the years I worked at Teva Pharmaceuticals, we took every possible safety measure: We burnt the randomization lists, the randomization SAS code and the randomization seed on a CD and kept it in a fire-proof safe. We also kept all this information in analog media. Yes, we printed the lists, the SAS code and the seed on paper, and these were also kept in the safe.

Using the same seed every time is not a good idea. If you use the same seed every time, you get the same sequence of pseudo-random numbers every time, and therefore your numbers are not pseudo-random anymore. Selecting a different seed every time is good practice.

How do you select the seed? Taking a number out of your head is still not a good idea. Our heads are biased. Like passwords, people tend to use some seeds more often than other possible seeds. Don’t be surprised if you see codes with seeds like 123, 999 or 31415.

The best practice is to choose a random seed, but this creates a magic circle. You can still turn to your old hat and some pieces of paper, but there is a simple alternative: generate the seed from the computer clock.

This is how you do it in R by using the Sys.time() function: Get the system time, convert it to an integer, and you’re done. In practice, I take only the last 5 digits of that integer. And off course, I keep the chosen seed on record. I also save the Sys.time value and its integer value, just in case. Finally, I hardcode the seed I get. This is the R code:

> # set the initial seed as the computer time
> initial_seed=Sys.time()
> print (initial_seed)
[1] "2019-03-16 10:56:37 IST"

> # convert the time into a numeric variable
> initial_seed=as.integer(initial_seed)
> print (initial_seed)
[1] 1552726597

> # take the last five digits f the initial seed
> the_seed=initial_seed %% 100000
> print(the_seed) # 
[1] 26597

> # do your simulation
> set.seed(26597)
> print(rnorm(3))
[1] -0.4759375 -1.9624872 -0.1601236

> # reproduce your simulation
> set.seed(26597)
> print(rnorm(3))
[1] -0.4759375 -1.9624872 -0.1601236
>

Cancer clusters and the Poisson distributions

On March 1, 2019, an article was published in Israel’s Ynetnews website, under the title “The curious case of the concentration of cancer”. The story reports on a concentration of cancer cases in the town of Rosh Ha’ayin in central Israel.

In the past few years dozens of cases of cancer have been discovered in the center of Rosh Ha’ayin. About 40 people have already died of the disease. Residents are sure that the cause of the disease is cellular antennas on the roof of a building belonging to the municipality. “Years we cry and no one listens”, They say, “People die one after the other.”

I do not underestimate the pain of the residents of Rosh Ha’ayin. I also do not intend to discuss the numbers mentioned in the article. I accept them as they are. I just want to relate only to the claim that the cause of the disease is cellular antennas. It is easy (at least for me) to explain why this claim is at least questionable: there are many more cellular antennas in many places, and around them there is no high rate of morbidity in cancer. If the antennas are carcinogenic, then they need cancer everywhere, not just in Rosh Ha’ayin. On the other hand, I can’t blame them for blaming the antennas. People try to rationalize what they see and find a cause.

I also must emphasize that the case of Rosh Ha’ayin must not be neglected. If it is not because the cellular antennas, it is possible that there is some other risk factor, and the state authorities must investigate.

So why does Rosh Ha’ayin have such a large cluster of cancer morbidity? One possible answer is that there is an environmental factor (other than the antennas) that does not exist elsewhere. Another possible answer is that there may be a non-environmental factor that does not exist elsewhere, maybe a genetic factor (most of the town’s residents are immigrants from Yemen and their descendants). A third and particularly sad possibility is that the local residents suffer from really bad luck. Statistics rules can be cruel.

Clusters happen: if there are no local (environmental or other) risk factors that cause cancer (or any other disease), and the disease spreads randomly across the whole country, then clusters are formed. If the country is divided into units of equal size, then the number of cases in a given area unit follows a Poisson distribution. Then there is a small, but not negligible possibility that one of these units will contain a large number of cases. The problem is that there is no way of knowing in advance where it will happen.

The opposite is also true: If the distribution of the number of cases in a given area unit is a Poisson distribution, then it can be concluded that the dispersion on the surface is random.

I will demonstrate the phenomenon using simulation.

Consider a hypothetical country that has a perfect square shape, and its size is 100 x 100 kilometers. I randomized 400 cases of morbidity across the country:

# generate n random points on 0:100 x 0:100
> set.seed(21534)
> n=400
> x=runif(n)*100
> y=runif(n)*100
> dat=data.frame(x,y)
> head(dat)
x y
1 15.73088 8.480265
2 12.77018 78.652808
3 45.50406 31.316797
4 86.46181 6.669138
5 27.25488 48.164316
6 17.42388 98.429575

I plotted the 400 cases.

> # plot the points
> plot(dat$x, dat$y, ylim=c(0,100), xlim=c(0,100), 
+ asp=1, frame.plot=FALSE, axes=FALSE,
+ xlab=' ', ylab=' ', col='aquamarine', pch=16,
+ las=1, xaxs="i", yaxs="i",
+ )
> axis(1, at=0:10*10)
> axis(2, at=0:10*10, las=1, pos=0)
> axis(3, at=0:10*10, labels=FALSE)
> axis(4, at=0:10*10, labels=FALSE, pos=100)

Here’s the map I got.

Next, I divided the map into 100 squares, each 10 x 10 kilometers:

> #draw gridlines
> for (j in 1:9){
+ lines(c(0,100), j*c(10,10))
+ lines(j*c(10,10), c(0,100))
+ }
>

In order to count the cases in each square, I assigned row and column numbers for each of the squares, and recorded the position (row and column) for every case/dot:

> # row and column numbers and cases positions
> dat$row=0
> dat$col=0
> dat$pos=0
> for (j in 1:nrow(dat)){
+ dat$row=ceiling(dat$y/10)
+ dat$col=ceiling(dat$x/10)
+ dat$pos=10*(dat$row-1)+dat$col
+ }
>

Now I can count the number of points/cases in each square:

> # calculate number of points for each position
> # ppp=points per position
> dat$count=1
> ppp=aggregate(count~pos, dat, sum)
> dat=dat[,-6]

But of course, it is possible that there are squares with zero cases; (actually the data frame ppp has only 97 rows). Let’s identify them:

> # add positions with zero counts, if any
> npp=nrow(ppp)
> if(npp<100){
+ w=which(!(1:100 %in% ppp$pos))
+ addrows=(npp+1):(npp+length(w))
+ ppp[addrows,1]=w
+ ppp[addrows,2]=0
+ ppp=ppp[order(ppp$pos),]
+ }
>

And now we can get the distribution of number of cases in each of the 100 squares:

> # distribution of number of points/cases in each position
> tb=table(ppp$count)
> print(tb)
0  1  2  3  4  5  6  7  8  9 11 
3  9 12 21 15 17 13  5  1  3  1 
>

We see that there is one very unlucky cluster with 11 cases, and there also 3 squares with 9 cases each. Let’s paint them on the map:

> # identify largest cluster
> mx=max(ppp$count)
> loc=which(ppp$count==11)
> clusters=dat[dat$pos %in% loc,]
> points(clusters$x, clusters$y, col='red', pch=16)
> 
> # identify second lasrgest cluster/s
> loc=which(ppp$count==9)
> clusters=dat[dat$pos %in% loc,]
> points(clusters$x, clusters$y, col='blue', pch=16)
>

Let’s also mark the squares with zero points/cases. In order to do this, we first need to identify the row and column locations of these squares:

> # identify sqaures without cases
> # find row and column locations
> loc=which(ppp$count==0)
> zeroes=data.frame(loc)
> zeroes$row=ceiling(zeroes$loc/10)
> zeroes$col=zeroes$loc %% 10
> w=which(zeroes$col==0)
> if(length(w)>0){
+ zeroes$col[w]=10
+ }
> print(zeroes)
loc row col
1 8 1 8
2 31 4 1
3 99 10 9
>

So there is one empty square in the 8th column of the first row, one in the first column of the 4th row, and one in the 9th column of the 10th row. Let’s mark them. To do that, we need to know the coordinates of each of the four vertices of these squares:

# mark squares with zero cases
> for (j in 1:nrow(zeroes)){
+ h1=(zeroes$col[j]-1)*10
+ h2=h1+10
+ v1=(zeroes$row[j]-1)*10
+ v2=v1+10
+ lines(c(h1,h2), c(v1,v1), lwd=3, col='purple')
+ lines(c(h1,h2), c(v2,v2), lwd=3, col='purple')
+ lines(c(h1,h1), c(v1,v2), lwd=3, col='purple')
+ lines(c(h2,h2), c(v1,v2), lwd=3, col='purple')
+ }

Do you see any pattern?

How well does the data fit the Poisson distribution? We can perform a goodness of fit test.
Let’s do the log-likelihood chi-square test (also known as the G-test):

> # log likelihood chi square to test the goodness of fit 
> # of the poisson distribution to the data
> 
> # the obserevd data
> observed=as.numeric(tb)
> values=as.numeric(names(tb))
> 
> # estimate the poisson distribution parameter lambda
> # it is the mean number of cases per square
> lambda=nrow(dat)/100
> print(lambda)
[1] 4
> 
> # calculate the expected values according to 
> # a poisson distribution with mean lambda
> expected=100*dpois(values, lambda)
> 
> # view the data for the chi-square test
> poisson_data=data.frame(values, observed, expected)
> print(poisson_data)
values observed expected
1 0 3 1.8315639
2 1 9 7.3262556
3 2 12 14.6525111
4 3 21 19.5366815
5 4 15 19.5366815
6 5 17 15.6293452
7 6 13 10.4195635
8 7 5 5.9540363
9 8 1 2.9770181
10 9 3 1.3231192
11 11 1 0.1924537
> 
> # calculate the degrees of freedom
> df=max(values)
> print(df)
[1] 11
> 
> # calculate the test statistic and p-value
> g2=sum(observed*log(observed/expected))
> pvalue=1-pchisq(g2,df)
> log_likelihood_chi_squrae_test=data.frame(g2, df, pvalue)
> print(log_likelihood_chi_squrae_test)
g2 df pvalue
1 4.934042 11 0.9343187
>

We cannot reject the hypothesis that the data follows the Poison distribution. This does not imply, of course, that the data follows the Poisson distribution, but we can say that the Poisson model fits the data well.

The delta method and its implementation in R

Suppose that you have a sample of a variable of interest, e.g. the heights of men in certain population, and for some obscured reason you are interest not in the mean height μ but in its square μ². How would you inference on μ², e.g. test a hypothesis or calculate a confidnce interval? The delta method is the trick you need.

The delta method is mathematical assertion that can yield estimates for the varinance of functons of statistics under mild condition. Loosley speaking, let bₙ is an estimate of β, where n is the sample size, and the distribution of bₙ is approximately normal with mean β and variance σ²/n.

Then, if g is a function, then g(bₙ) is approximately normal with mean g(β) and and variance [g’(β)² ]⋅σ²/n, provided that the sample size is large. Don’t let the calculus scare you. R’s deltmethod function will handle the calculus for you.

If bₙ is a Maximum Likelihood Estimate (MLE) of β then under mild conditions it is approximately normal for a large sample size, and g(bₙ) and g’(bₙ) are MLEs for g(β) and g’(β) respectively. Add on top of this a MLE for σ, and you can implement statistical inference.

I will demonstrate the use of the delta method using the Titanic survival data. For the sake of the example I will use only 3 variables: Survived is a 0/1 variable indicating whther a passenger survived the sinking of the Titanic, Age is obviously the passenger’s age, and Parch is the passenger’s number of parents/children aboard. Lets take a look at some of the data:

> # install.packages("titanic")
> library(titanic)
> titanic=titanic_train[, c(1, 2, 6,8)]
> head(titanic)
  PassengerId Survived Age Parch
1           1        0  22     0
2           2        1  38     0
3           3        1  26     0
4           4        1  35     0
5           5        0  35     0
6           6        0  NA     0
>

Let π be the probability of surviving. Then the odds of survival is π/(1-π). The sample size is n=891 is considered large, so we can apply the Central Limit Theorem to conclude that p is approximately normal with mean π and variance π/(1-π)/n. Then π can be estimated by p=0.384, the odds are estimated by p/(1-p)=0.623, and the variance of p is estimated by 0.000265:

> p_survival=mean(titanic$Survived)
> print(p_survival)
[1] 0.3838384
> survival_odds=p_survival/(1-p_survival)
> print(survival_odds)
[1] 0.6229508
> n=nrow(titanic)
> var_p_survival=(p_survival*(1-p_survival))/n
> print(var_p_survival)
[1] 0.0002654394
>

Let g(x)=p/(1-x). Then g`(x)=1/(1-x)², (you can check this at https://www.wolframalpha.com) so the variane of the odds can be estimated by

Var(p/(1-p))=[g’(p)² ]⋅σ²/n=[1/(1-p)²]² ⋅ [p(1-p)/n].

This can be implemented in the following R code:

> dev_g_odds=1/(1-survival_odds)^2
> var_odds=(dev_g_odds^2)*var_p_survival
> print(var_odds)
[1] 0.01313328
> se_odds=sqrt(var_odds)
> print(se_odds)
[1] 0.1146005
>

But of course, instead of doing all the calculus, you can use the deltamethod function of R’s msm package.

The function has three parameters:

  • g is a formula object representating of the transformation g(x). The formula variables must be labeled x1, x2 and so on.
  • mean is the estimate of g(β)
  • cov is the estimate of var(β) which is σ²/n

The use of the delta function provides the same result:

> # install.packages("msm")
> library(msm)
> se_odds_delta=deltamethod(g=~x1/(1-x1), mean=survival_odds, cov=var_p_survival)
> print(se_odds_delta)
[1] 0.1146005
>

The second example considers logistic regression. We will model the (logit of the) probability of survival using Age nd Parch. Using R’s glm function we can obtain estimates to the logistic regression coefficients band their standard errors se_b:

> model=glm(Survived ~ Age + Parch, family=binomial(link="logit")
> model_beta=data.frame(summary(model)$coefficients[2:3, 1:2])
> names(model_beta)=c("b", "se_b")
> print(model_beta)
                 b        se_b
Age   -0.008792681 0.005421838
Parch  0.191531303 0.090643635
>

Since exp(β) is usually interpreted as the odd ratio (OR) of the response variable with regard to the regression independent variable, reseachers are interested in inference on exp(β). In order to perform such inference one nees to estimate the standard error of exp(β). Since b is a Maximum Likelihood Estimate for β, it is approximately normal and its variance is estimated by se_b², and the delta method can be applied.

The calculus in this case is easy: g(x)=exp(x), so g’(x)=exp(x). Therefore, the standard error of exp(β) can be estimated by exp(b)⋅ se_b:

> model_beta$OR=exp(model_beta$b)
> model_beta$se_OR=exp(model_beta$b)*model_beta$se_b
> print(model_beta)
                 b        se_b        OR       se_OR
Age   -0.008792681 0.005421838 0.9912459 0.005374374
Parch  0.191531303 0.090643635 1.2111027 0.109778755
>

To use the deltamethod function, we will first use the vcov function to obtain the variance-covariance matrix of the resression coefficient estimates, and the variances will be the inputs of the deltamethod function:

> vc=vcov(model)[2:3, 2:3]
> print(vc)
               Age        Parch
Age   2.939632e-05 9.236876e-05
Parch 9.236876e-05 8.216269e-03

> model_beta$se_OR_delta[1]=deltamethod(~exp(x1), mean=model_beta$b[1], cov=vc[1,1])

> model_beta$se_OR_delta[2]=deltamethod(~exp(x1), mean=model_beta$b[2], cov=vc[2,2])

> print(model_beta)
                 b        se_b        OR       se_OR se_OR_delta
Age   -0.008792681 0.005421838 0.9912459 0.005374374 0.005374374
Parch  0.191531303 0.090643635 1.2111027 0.109778755 0.109778755
>

Of course, we get the same results.

 

Powerball demystified

The US Powerball lottery hysteria took another step when no one won the big jackpot in the last draw that took place on October 20, 2018. So, the total jackpot is now 2.22 billion dollars. I am sure that you want to win this jackpot. I myself want to win it.

Actually, there are two different lotteries: The Mega Million lottery prize is about 1.6 billion dollars, and the probability of winning when playing a single ticket is about 1 in 302 million. The Powerball lottery jackpot is “only” 620 million dollars but the probability of winning is slightly better: about 1 in 292 million.

The probability of winning both jackpots is therefore the multiplication of the two probabilities stated above, which 1 about 1 in 88000000000000000.

Let’s not be greedy, and aim just to 1.6 billion jackpot, although its probability of winning is slightly worse.

First, it should be noted that although the probability of winning is small, it is still positive. So if you buy a ticket you get a chance. If you do not buy a ticket you will not win, period.

Second, is buying a ticket a good investment? It looks like it is. The price of a ticket is two dollars. On the average, you will win the jackpot with probability of 1 to 302 million, and lose your two dollars dollar with probability of nearly 1. Therefore, your average return is about the jackpot multiplied by the probability of winning it minus the price of the ticket. Since the probability of winning is 1 in 302 million and the jackpot is 1600 million, then the expected return is 1600/302–2 , which is positive — about 3.30 dollars. Therefore, you should play. Or shouldn’t you?

The above figure — expected value of 3.30 dollars is an expectation of money. It is not money. You are not going to gain this expected sum of money when you play the lottery once. You either win the jackpot or lose your money. Of course, if you get a chance to participate in such a lottery with such a jackpot as many times as you wish, you should play, and the law of large numbers will be in your favor. This is not going to happen, of course. You only get to play this game once.

The next interesting question is what is the probability that someone will win?

Assume that you roll a die. The probability of rolling 6 is 1 to six. If two people roll a die, then the probability of at least one of them rolling six is about 1 in 3.3. If 3 people roll a die then the probability of at least one of them rolling six is even better: 1 in 137, and so on. The lottery is similar. Think of a lottery ticket as a die, only that the probability of rolling 6 is 1 to 302 million. If two people are rolling suck a dice, i.e. buying a lottery ticket, then the probability that at least one of them rolling a six is slightly better than 1 to 302 million. How many people should buy a lottery ticket to make the probability of a least one win greater than 5%? 10%? 50%? What is the probability that two or more people will share the jackpot? These probabilities depend on the amount of tickets sold. The more ticket sold, the higher the probability that someone wins. If you know the number of tickets sold, you can be approximated these probabilities using the Poisson distribution. You can also back-calculated the number of tickets need to be sold in order to set the probability that someone wins to any level you like. I’ll skip the technical details. According to my calculations, the number of tickets need to be sold to ensure that the probability of at least one winner exceeds 0.5 is about 210 million.

But wait: the price of a ticket is 2 dollars dollar, and there are only 302 million possible combinations of numbers. So, if I buy all possible tickets, it will cost me only 604 million, and I am guaranteed to win 1.6 billion. This is a net profit of nearly a billion dollars. Not bad. Can I do it?

The short answer is “Yes”. The long answer is “probably not”.

Yes. It was done before. In 1992, the jackpot of the Virginia lottery was 27 million dollars, while the probability of winning the jackpot was about 1 to 7 million, and the price of a ticket was 1 dollar. So it was possible to buy 7 million tickets for 7 million dollars to ensure a jackpot of 27 million. A consortium of 2500 small investors aimed to raise the money to buy all these tickets. However, they managed to buy only about 5 million tickets. They still managed to win, (See: The improbability Principle — David Hand, page 120)

To buy 302 million tickets is a different story. First you need to have 604 million dollars in cash. Second, you need to actually to buy all these tickets, and you have only 4 days to do all these shopping. In 4 days there are 345600 seconds, so you need to buy nearly 900 tickets per second, while you make sure that each ticket carries a different combination of numbers. The logistics may be difficult. In 1992, that consortium mange to but only 5 million tickets in about a week. You may not be able to buy all the tickets you need.

Second, when you win, you will probably want the money in cash, and not in annuity payments. So, the 1.6 billion reduces to 910 million, and the government will take 25% tax (and more later). You will end up with 684 million. Still a net profit of 80 million dollars. Not bad.

Third, it is possible that you will share the jackpot with someone else, maybe even more than one person. As we already saw, there is a good chance for this scenario — if 210 million tickets are sold then the probability for sharing the jackpot is about 50%. Even if you share the jackpot with just another person, your share is just 800 million, and if you want to cash it it will shrink to 342 million, a net loss of 262 million. That’s not good.

And finally: should you buy a ticket? Yes, buy one ticket. It’s fun. And for two dollars you buy the right to hope winning the jackpot, as Prof Robert Aumann, the Nobel Prize winner said.