Cancer clusters and the Poisson distributions

On March 1, 2019, an article was published in Israel’s Ynetnews website, under the title “The curious case of the concentration of cancer”. The story reports on a concentration of cancer cases in the town of Rosh Ha’ayin in central Israel.

In the past few years dozens of cases of cancer have been discovered in the center of Rosh Ha’ayin. About 40 people have already died of the disease. Residents are sure that the cause of the disease is cellular antennas on the roof of a building belonging to the municipality. “Years we cry and no one listens”, They say, “People die one after the other.”

I do not underestimate the pain of the residents of Rosh Ha’ayin. I also do not intend to discuss the numbers mentioned in the article. I accept them as they are. I just want to relate only to the claim that the cause of the disease is cellular antennas. It is easy (at least for me) to explain why this claim is at least questionable: there are many more cellular antennas in many places, and around them there is no high rate of morbidity in cancer. If the antennas are carcinogenic, then they need cancer everywhere, not just in Rosh Ha’ayin. On the other hand, I can’t blame them for blaming the antennas. People try to rationalize what they see and find a cause.

I also must emphasize that the case of Rosh Ha’ayin must not be neglected. If it is not because the cellular antennas, it is possible that there is some other risk factor, and the state authorities must investigate.

So why does Rosh Ha’ayin have such a large cluster of cancer morbidity? One possible answer is that there is an environmental factor (other than the antennas) that does not exist elsewhere. Another possible answer is that there may be a non-environmental factor that does not exist elsewhere, maybe a genetic factor (most of the town’s residents are immigrants from Yemen and their descendants). A third and particularly sad possibility is that the local residents suffer from really bad luck. Statistics rules can be cruel.

Clusters happen: if there are no local (environmental or other) risk factors that cause cancer (or any other disease), and the disease spreads randomly across the whole country, then clusters are formed. If the country is divided into units of equal size, then the number of cases in a given area unit follows a Poisson distribution. Then there is a small, but not negligible possibility that one of these units will contain a large number of cases. The problem is that there is no way of knowing in advance where it will happen.

The opposite is also true: If the distribution of the number of cases in a given area unit is a Poisson distribution, then it can be concluded that the dispersion on the surface is random.

I will demonstrate the phenomenon using simulation.

Consider a hypothetical country that has a perfect square shape, and its size is 100 x 100 kilometers. I randomized 400 cases of morbidity across the country:

# generate n random points on 0:100 x 0:100
> set.seed(21534)
> n=400
> x=runif(n)*100
> y=runif(n)*100
> dat=data.frame(x,y)
> head(dat)
x y
1 15.73088 8.480265
2 12.77018 78.652808
3 45.50406 31.316797
4 86.46181 6.669138
5 27.25488 48.164316
6 17.42388 98.429575

I plotted the 400 cases.

> # plot the points
> plot(dat$x, dat$y, ylim=c(0,100), xlim=c(0,100), 
+ asp=1, frame.plot=FALSE, axes=FALSE,
+ xlab=' ', ylab=' ', col='aquamarine', pch=16,
+ las=1, xaxs="i", yaxs="i",
+ )
> axis(1, at=0:10*10)
> axis(2, at=0:10*10, las=1, pos=0)
> axis(3, at=0:10*10, labels=FALSE)
> axis(4, at=0:10*10, labels=FALSE, pos=100)

Here’s the map I got.

Next, I divided the map into 100 squares, each 10 x 10 kilometers:

> #draw gridlines
> for (j in 1:9){
+ lines(c(0,100), j*c(10,10))
+ lines(j*c(10,10), c(0,100))
+ }
>

In order to count the cases in each square, I assigned row and column numbers for each of the squares, and recorded the position (row and column) for every case/dot:

> # row and column numbers and cases positions
> dat$row=0
> dat$col=0
> dat$pos=0
> for (j in 1:nrow(dat)){
+ dat$row=ceiling(dat$y/10)
+ dat$col=ceiling(dat$x/10)
+ dat$pos=10*(dat$row-1)+dat$col
+ }
>

Now I can count the number of points/cases in each square:

> # calculate number of points for each position
> # ppp=points per position
> dat$count=1
> ppp=aggregate(count~pos, dat, sum)
> dat=dat[,-6]

But of course, it is possible that there are squares with zero cases; (actually the data frame ppp has only 97 rows). Let’s identify them:

> # add positions with zero counts, if any
> npp=nrow(ppp)
> if(npp<100){
+ w=which(!(1:100 %in% ppp$pos))
+ addrows=(npp+1):(npp+length(w))
+ ppp[addrows,1]=w
+ ppp[addrows,2]=0
+ ppp=ppp[order(ppp$pos),]
+ }
>

And now we can get the distribution of number of cases in each of the 100 squares:

> # distribution of number of points/cases in each position
> tb=table(ppp$count)
> print(tb)
0  1  2  3  4  5  6  7  8  9 11 
3  9 12 21 15 17 13  5  1  3  1 
>

We see that there is one very unlucky cluster with 11 cases, and there also 3 squares with 9 cases each. Let’s paint them on the map:

> # identify largest cluster
> mx=max(ppp$count)
> loc=which(ppp$count==11)
> clusters=dat[dat$pos %in% loc,]
> points(clusters$x, clusters$y, col='red', pch=16)
> 
> # identify second lasrgest cluster/s
> loc=which(ppp$count==9)
> clusters=dat[dat$pos %in% loc,]
> points(clusters$x, clusters$y, col='blue', pch=16)
>

Let’s also mark the squares with zero points/cases. In order to do this, we first need to identify the row and column locations of these squares:

> # identify sqaures without cases
> # find row and column locations
> loc=which(ppp$count==0)
> zeroes=data.frame(loc)
> zeroes$row=ceiling(zeroes$loc/10)
> zeroes$col=zeroes$loc %% 10
> w=which(zeroes$col==0)
> if(length(w)>0){
+ zeroes$col[w]=10
+ }
> print(zeroes)
loc row col
1 8 1 8
2 31 4 1
3 99 10 9
>

So there is one empty square in the 8th column of the first row, one in the first column of the 4th row, and one in the 9th column of the 10th row. Let’s mark them. To do that, we need to know the coordinates of each of the four vertices of these squares:

# mark squares with zero cases
> for (j in 1:nrow(zeroes)){
+ h1=(zeroes$col[j]-1)*10
+ h2=h1+10
+ v1=(zeroes$row[j]-1)*10
+ v2=v1+10
+ lines(c(h1,h2), c(v1,v1), lwd=3, col='purple')
+ lines(c(h1,h2), c(v2,v2), lwd=3, col='purple')
+ lines(c(h1,h1), c(v1,v2), lwd=3, col='purple')
+ lines(c(h2,h2), c(v1,v2), lwd=3, col='purple')
+ }

Do you see any pattern?

How well does the data fit the Poisson distribution? We can perform a goodness of fit test.
Let’s do the log-likelihood chi-square test (also known as the G-test):

> # log likelihood chi square to test the goodness of fit 
> # of the poisson distribution to the data
> 
> # the obserevd data
> observed=as.numeric(tb)
> values=as.numeric(names(tb))
> 
> # estimate the poisson distribution parameter lambda
> # it is the mean number of cases per square
> lambda=nrow(dat)/100
> print(lambda)
[1] 4
> 
> # calculate the expected values according to 
> # a poisson distribution with mean lambda
> expected=100*dpois(values, lambda)
> 
> # view the data for the chi-square test
> poisson_data=data.frame(values, observed, expected)
> print(poisson_data)
values observed expected
1 0 3 1.8315639
2 1 9 7.3262556
3 2 12 14.6525111
4 3 21 19.5366815
5 4 15 19.5366815
6 5 17 15.6293452
7 6 13 10.4195635
8 7 5 5.9540363
9 8 1 2.9770181
10 9 3 1.3231192
11 11 1 0.1924537
> 
> # calculate the degrees of freedom
> df=max(values)
> print(df)
[1] 11
> 
> # calculate the test statistic and p-value
> g2=sum(observed*log(observed/expected))
> pvalue=1-pchisq(g2,df)
> log_likelihood_chi_squrae_test=data.frame(g2, df, pvalue)
> print(log_likelihood_chi_squrae_test)
g2 df pvalue
1 4.934042 11 0.9343187
>

We cannot reject the hypothesis that the data follows the Poison distribution. This does not imply, of course, that the data follows the Poisson distribution, but we can say that the Poisson model fits the data well.

9 thoughts on “Cancer clusters and the Poisson distributions

  1. Francis 2019-03-05 / 23:54

    I freely admit to the math being beyond my capabilities. Curious though– does your analysis also apply to conclusions such as, “veterans have twice as many suicides as non-veterans”? It seems to me that your 3 explanations for clustering (e.g., environmental, genetic, chance) might argue against a “military service increases suicide risk” explanation of a finding that comes from “binning” people according to some demographic characteristic. How would one go about testing the distribution of cases of veteran suicides–with some sort of national graphing routine? Is there some way to know how small the squares would have to be to define replicable or valid clusters (assuming that’s even an issue)?

    Thanks,

    Francis

    Like

    • Yossi Levy 2019-03-06 / 10:20

      These are excellent questions. I will try my best to answer 🙂
      First, let me clarify that my explanations for the clustering are no more than educated guesses, and there might be other posdible explanations.
      Regarding comparison of population: it all comes down to testing whether there is a significant and meaningful differences between the relevant parameters. For example, it is well known that Multiple Sclerosis relapses are generated in a Poisson process (at the patient level). So if you want to test if a new treatment is effective, you need to compare the relapse rates (“lambdas”) in a treated population and in an untreated population. This can be done in a setting of a controlled trial or in an observational study – if you are dealing with veteran suicides.
      Regarding the size of the squares (or any other unit in space or time): in theory the size does not matter. In practice you need to take into account your crude estimate of the process rate.

      Like

  2. Brandon 2019-03-06 / 03:50

    This was a great post, I always love seeing demonstrations of the Poisson distribution.

    Like

  3. Harry Pierson 2019-03-06 / 07:10

    In actuality, what you are “testing” is how good your random number generator is. You are conducting binomial trials, you are “flipping” 400 coins, where the probability of success (heads) is 1/100, and counting the number of heads.

    “Heads” in this case means the coin landed in a particular box.The expected number of heads in each box is 4, and the actual number in any specific box falls between 0 and 400 according to the binomial distribution. In particular, the probability no points in a box is

    It is well known that the Poisson distribution is a good approximation to the binomial distribution in this case (large number of trials n, small probability of success p, and the product np is not large.)

    The Poisson approximation to the binomial predicts there will be 1.83 cells out of 100 with no points, very close to the 3 that you observe.

    Click to access Poisson.pdf

    Beyond that, my problem with your approach is that the “map” has to consist of regions with equal populations, not equal areas.

    Otherwise, you can obviously expect more cases of cancer in areas with the highest population densities.

    Like

    • Yossi Levy 2019-03-06 / 09:58

      Thank you for all the insightful comments. I agree with all what you say, except for the claim that the process is binomial/multinomial. The two distributions are related, bit the data generation process is different. In any case, this is a different debate, and I might write about it in the future.

      I would like to emphasize that this was only a demonstration of what happens in a “clean” Poisson process. Life is much more complicated, of course.

      Like

      • Royi Avital 2019-03-11 / 10:40

        Hello,
        It seems to me that Harry’s analysis is right.
        Hence I don’t understand the line:

        “If the country is divided into units of equal size, then the number of cases in a given area unit follows a Poisson distribution. ”

        Can you derive the Poisson Distribution from the simulation you did or is it like Harry claimed – Only approximation?

        Like

  4. Mauro Cafundó de Morais 2019-03-07 / 22:34

    Hi there, thanks for the post!

    I’m not familiar with G-test, but according to Wikipedia, should it be something like:
    “`g2 = 2 * sum(observed * log(observed / expected))“` ?

    Cheers

    Like

You are welcome to comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.