## Visualization of the Debt/GDP ratio and national debt level

I saw this graph on Twitter a few days ago: [1]

Short googling revealed that this is a relatively old graph from October 2017. On one hand, this is a really cool visualization. On the other hand, it also belongs to Facebook pages such as Trust me, I’m a Statistician or Trust me, I’m a Data Scientist.

What do we see?

This is a kind of a pie chart. In a classic pie chart, the slices are in the form of “triangles”, or more precisely, circular sectors. In this chart the slices have other forms, including various types of triangles and quadrilaterals, other polygons, and shapes that I really don’t know their names. [2]

I admit that this chart pretty confused me. It presents national debt and Debt/GDP ratio data. Initially I referred to the Debt/GDP ratio, and for some reason I thought that the area of ​​each slice in this chart represents Debt/GDP ratio of each country, probably because my eye first caught the chart’s footer.

Actually, each slice shows the share of the country’s national debt out of the world’s total debts, so the areas of all pieces should sum to 100% [3]. We see clearly that the country with the largest share of debt out of the total world national debts (and therefore the highest absolute debt) is the United States. The country with the second largest share of debt is Japan, and China is third. Look for Italy, Germany, France and the United Kingdom. Can you determine which of the four states has a bigger share of the total debt by looking just at the area of their slices? Actually, their shares of the total debt are very similar.

The Debt/GDP ratio of each country is expressed by the color of its slice. The lighter the color, the higher the Debt/GDP ratio. You can immediately see that Japan has the highest Debt/GDP ratio, and I believe that most people will recognize that Greece also has a very high Debt/GDP ratio, actually the second highest ratio. Can you spot the country with the third largest ratio? It is Lebanon.  Look at the upper right area of the chart. Italy and Portugal, which occupy fourth and fifth place, are more prominent. Can you tell which country has the lowest Debt/GDP ratio?

Now that we understand the data presented in this graph, we can start looking for insights.

This chart is a two-dimensional chart, in the sense that it presents two different variables in the same graph. Such graphs are useful for exploring the relationship between the two variables. So, what is the relationship between the Debt/GDP ratio of a country and its share in the world’s total debts? Can you see anything? I can’t. It is to the credit of the authors that they did not try to discuss this matter at all.

Is there a better way to visualize these data? Of course there is. Let’s play.

I took the world’s Debt/GDP ratio data  and the world’s GDP data from Wikipedia. For the purpose of the demonstration, I focused on the OECD countries data from 2017. I calculated the absolute debt of each country using the Debt/GDP ratio and the GDP data, and then I calculated for each country its share of the total OECD debt. The data are available here.

The simplest possible visualization for two dimensional data is a scatter plot, although it is not as cool as that pie chart. Let’s forget what we learned by looking at that pie chart, and start from scratch.

This code generates a basic scatter plot of the OECD data:

```plot(c(0,250), c(0, 40), axes=FALSE, type="n", xlab="Debt/GDP ratio (%)", ylab="% Share of Debt", main="", cex.main=1)
axis(1, 50*(0:5),cex.axis=0.8)
axis(2,at=10*(0:4), las=1, cex.axis=0.8)
points(oecd\$debt.gdp.ratio, oecd\$share.of.debt, type="p", pch=16, cex=1, col="black")```

The plot clearly shows that there are two outlier dots/countries; One country has Debt/GDP ratio greater than 200%. Another country has an awfully large debt – its share of total OECD debt is higher than 30%.

A closer look reveals a country whose Debt/GDP ratio is greater than 150%, and two more countries whose Debt/GDP ratio is about 130%.

Since some economists who believe that high debt is bad, and that high Debt/GDP ratio is even worse, I decided to divide the countries into three groups: [4]

• The first group includes the countries with either Debt/GDP ratio which is greater than 100% or their share the total debt is greater than 10%. These are countries that are in “bad” economic situation according to these parameters.
• The second group includes the countries whose Debt/GDP ratio is less than 50% and their share the total debt is greater than 2%. These are countries that are in “good” economic situation according to these parameters.
• The third group includes the rest of the countries.

I decided to paint the dots that represent the countries which are in “bad” economic situation in red, and to add their names on the graph. I painted the dots that represent the countries which are in “good” economic situation in green. The rest of the dots are painted in orange.

This code generates the improved scatter plot:

```w1=which(oecd\$debt.gdp.ratio>100 | oecd\$share.of.debt>10)
w3=which(oecd\$debt.gdp.ratio<50 & oecd\$share.of.debt<2)
w2=which(!(1:36 %in% c(w1,w3)))

plot(c(0,250), c(0, 40), axes=FALSE, type="n", xlab="Debt/GDP ratio (%)", ylab="% Share of Debt", main="", cex.main=1)

axis(1, 50*(0:5),cex.axis=0.8)
axis(2,at=10*(0:4), las=1, cex.axis=0.8)

points(oecd\$debt.gdp.ratio[w1], oecd\$share.of.debt[w1], type="p", pch=16, cex=1, col="red")

text(x=oecd\$debt.gdp.ratio[w1],
y=oecd\$share.of.debt[w1],
labels=oecd\$country[w1],
pos=4, cex=0.7)

points(oecd\$debt.gdp.ratio[w2], oecd\$share.of.debt[w2], type="p", pch=16, cex=1, col="orange")

points(oecd\$debt.gdp.ratio[w3], oecd\$share.of.debt[w3], type="p", pch=16, cex=1, col="green")```

Now we can see that:

• The Debt/GDP ratio of the “good” states extends over the entire zero to 50% range, although the Debt/GDP of many countries in this group is close to 50%.
• The orange painted countries (the “middle” group) are roughly divided into two subgroups. The countries in the first subgroup have lower debts (as expressed by percent share of the total debt) and a Debt/GDP ratio in the range of 50 to 75%. The five countries in the second subgroup have higher debts , with no clear pattern for the Debt/GDP ratio.
• I cannot draw a general conclusion regarding the 6 countries which are in “bad” economic situation according to my definitions.

[1] I did some minor edits to the graph for the purpose of my demonstration.

[2] Look for the United Kingdom at the bottom of the chart, for example.

[3] I didn’t check any of the data. I trust the authors that the date is accurate.

[4] I chose the cut points of 10%, 100% etc. according to my best judgment. If you know an economist who has a more accurate method to determine such cut points, please introduce him to me.

## A lot of bad statistics in one picture

I found this picture on twitter, thanks to Kareem Carr who took a screenshot before the person who uploaded it deleted his tweet. I googled and found it in a website named Data Science Central (No link. Sorry). Can you tell what is wrong in this picture?

Well, almost everything is wrong in this picture.

The null hypothesis is that the average number of pizza slices is 4 and the alternative hypothesis is much higher than 4. What does “much higher” mean? Normally the hypotheses would be Ho:µ<=4 and H1:µ>4, or maybe Ho:µ=4, H1:µ>4, if you are looking for one sided hypothesis. I understand “much higher” as µ>>4, or µ>4+Δ. Now the hypotheses are not exhaustive: there is a gap of Δ. If you studied a course on statistical inference and are familiar with the Neyman-Pearson Lemma, it should not be a problem. You can construct a likelihood ratio test. But then you do not need this picture. The way the hypotheses are stated, standard tests as z-test or t-test are not valid.

Collect your data: there is not much information on how to collect the data except for mentioning to make sure that the observations are random. I suppose that the author assumes that the readers are familiar with standard sampling methods. Or maybe there is a “Data Collection Explained in One Picture” somewhere. I am too afraid to google it.

The author states that in this example the average is 5.6. What about the standard deviation and the sample size?

Next in the flowchart: test the result. The author requires setting the significance level, which is good. However, a good statistician will set the significance level before collecting the data. He will also set the desired power and collect the sample size.

Then you should run the test, e.g. chi-square, t-test or z-test. Again, this should be decided before collecting the data.

Next you need to decide if the result is significant. The author suggests four levels of significance according the value of the p-value. This is totally wrong. The result is either significant or not significant. Statistical significance is not a quantitative property.  The author suggests using the p-value not just for determining significance but also as a measure for the strength of evidence.

Oh, and one more thing: if p<=.01 it is also <=.05 and <=0.10.

Now let’s look at the nice graph with the bell shape curve, which is almost certainly the Normal Distribution density curve. But look at the Y-axis label: “probability of X number of slices eaten”.

No, it is not probability, it is density, under the assumption that the null hypothesis is true (and this assumption is not stated anywhere).

And it is not the density of the number of slices eaten per visit as the X-axis label says, but the density of the sample mean. Again, one should state that this distribution is derived assuming the null hypothesis is true.

And even if it was a probability, the maximal value of the curve would be much lower than 100%. In fact, the probability of a customer eating less than one slice is zero (otherwise, why would he pay for the pizza in the first place?), and as far as the restaurant concerns, if the customer ate 1.5 slices, it counts as two slices – there is no recycling. In any case, the sample mean cannot be lower than 1. This brings me back to the statistical test. The variable under consideration is a count variable, therefore there are better alternatives to test the hypotheses, even though a normal approximation should work if the sample size is sufficiently large.

Next is the p-value calculation. The calculated p-value is 0.49. This means that if the mean under the null is 4, and the sample mean is 5.6, then we can back calculate and find that the standard error is 0.967 and the 95% significance threshold is 5.591. That’s fine.

Finally, the explanation of what is actually the p-value is clumsy. The term “Probability value (Observed)” may lead someone to think that this is the probability of the observed sample mean.  However, the author states correctly that the p-value is the area (under the normal curve) right to the observed result.

So what we got here? Poor example, negligent descriptions of the statistical process and concepts (to say the least), somewhat reasonable explanation of the p-value calculation, no explanation of what the p-value actually is and what it means, and finally abusing the p-value and its interpretation.

And personal note: why would I care? Someone is wrong on the internet. So what?

I care because I care about statistics. There are similar pictures, blog posts and YouTube videos all over the internet. And they cause bad statistics and bad science. P-values and other statistical concepts are abused on a daily basis, even in leading journals such as JAMA. And bad statistics lead to loss of money if we are lucky and to loss of lives if we are not so lucky. So I care. And you should care too.

Credit: some of the issues discussed in the post were brought up by Kareem and other Twiterers.

## We must abandon the p-value

Just a couple of weeks ago I wrote here that we should not abandon statistical significance. No, I did not change my mind. We should not abandon statistical significance but we must abandon the p-values.

Executive summary: p-values make a false impression that statistical significance is a continuous concept, but it is not. It is dichotomous.

You are welcome to read on.

I will not get into a lengthy discussion on the difference between Fisher‘s significance testing and Neyman and Pearson‘s hypothesis testing. In short: Fisher had a null hypothesis, a test statistic and a p-value. The decision whether or not to reject the null hypothesis based on the p-value was flexible. Indeed, in one place Fisher suggested to use the 5% threshold for deciding significance, but in other places he used different thresholds.

Neyman and Pearson’s approach is totally different. In their framework there are two hypotheses: the null hypothesis and the alternative hypothesis. The Neyman-Pearson lemma provides as a tool to construct decision rule. Given data, you evaluate the likelihood of these data in two cases: once assuming that the null hypothesis is true, and then assuming that the alternative hypothesis is true. You reject the null hypothesis in favor of the alternative if the ratio of the two likelihoods exceeds a pre-defined threshold. This threshold is determined by the significance level: an acceptable probability of falsely rejecting the null hypothesis (in a frequentist manner), that meets your scientific standards.

When I was a young under-graduate student, we did not have easy access to computers in my university (yes, I am that old). So we calculated the test statistic (Z, t, F, chi-square) and looked in a table similar to this one to see if our test statistic exceeds the threshold.

Fisher was a genius, and so was Karl Pearson. They developed statistical tests such as F and Chi-square based on geometrical considerations. But when you try to construct a test for equality of means as is in ANOVA or a test for independence of two categorical variables by using the NP lemma, you get the same tests. This opened a computational shortcut: The NP decision rule is equivalent to a p-value decision rule. You can decide to reject the null by comparing your test statistic to the value you look up in a table, or by comparing the p-value to your pre-determined significance level. You will get the same result either way. But still, you either reject the null hypothesis or you don’t. The size of the test statistic or the p-value does not matter. There is are such things as “nearly significant”, “almost significant”, “borderline significant” or “very significant”. IF your pre-determined significance level is 0.05, and your p-value is 0.051, that’s really depressing. But if you declare that your result is “almost significant”, it means that you changed your decision rule after the fact.

Let me illustrate it with a story my teacher and mentor, Professor Zvi Gilula, told me once.

Someone robbed a convenience store in the middle of nowhere. The police caught him and brought him to court. There were three eye witnesses, and a video from the security camera. The jury found him guilty and he was sentenced to serve 6 months in jail. In statistical jargon, the null hypothesis states that the defendant is not guilty, and the jury rejected the null hypothesis.

Let’s look at another robber. This man robbed a convenience store in New York City, next to Yankee Stadium.  The robbery occurred when the game just ended, the Yankees won, 47000 happy fans witnessed the robbery, and they all came to testify in court, because the poor robber was a Red Sox fan. The jury found him guilty.

Is the second robber guiltier than the first robber? Is the first robber “almost not guilty” or “borderline guilty”? Is the second robber “very guilty”? What would be the right punishment to the second robber? 10 years? Life imprisonment? Death sentence? Do you think now that if you were in the jury of the first robber you would acquit him?

Here is what we need to do if we really want to improve science: when publishing results, one should provide all the necessary information, such as data, summary statistics, test statistics and more. And then state: “we reject the null hypothesis at the 5% significance level”, or maybe “we do not reject the null hypothesis at the 5% significance level” (or any other reasonable and justified significance level. No p-values. And if someone wants to know if the null hypothesis will be rejected at the 5.01% significance level, they can go and calculate their own p value. P values should be abandoned.

## We should not abandon statistical significance

The call to ban statistical significance is not new. Several researchers, the most notable of them is Andrew Gellman from Columbia University, are promoting this idea in the last few years. What is new is the recent article in the prestigious journal Nature, that brought up the issue again. Actually, it is a petition signed by about 800 researchers, some of them are statisticians (and yes, Gellman signed this petition too). Gellman and Blake McShane from Northwestern University, who is one of the lead authors of the current petition, also expressed this idea  in another provocative article at Nature titled “Five ways to fix statistics” published in November 2017.

I think that it is not a good idea. It is not throwing the baby out with the bathwater. It is attempting to take the baby and throw it out of the sixth-floor window.

Let me make it clear: I do not oppose taking into account prior knowledge, plausibility of mechanism, study design and data quality, real-world costs and benefits, and other factors (I actually quoted here what Gellman and McShane wrote in the 2017 article). I think that when editors are considering a paper that was submitted to publication, they must take all of this into account, provided that the results are statistically significant.

“Abandoning statistical significance” are just nice words for abandoning the Neyman-Pearson Lemma. The lemma guarantees that likelihood ratio tests control the rate of the false positive results at the acceptable significance level, which is currently equal to 5%. The rate is guaranteed in a frequentist manner. I do not know how many studies are published every year. Maybe tens of thousands, maybe even more. If all the researchers are applying the Neyman-Pearson theory with a significance level of 5%, then the proportion of the false positive results approaches 0.05. No wonder that Gellman, who is a Bayesian statistician, opposes this.

Once statistical significance is abandoned, bad things will happen. First, the rate of false positive published results will rise. On the other hand, there is no guarantee that the rate of false negative results (that are not published anyway) will fall. The Neyman-Pearson Lemma provides protection against the false negative rates as well. Granted, you need to have an appropriate sample size to control it. But if you don’t care for statistical significance anymore, you will not care for controlling the false negative rate. Researchers will get an incentive for performing many small sample studies. Such studies are less expensive, and have higher variation. The chance of getting a nice and publishable effect is larger. This phenomenon is known as “The law of small numbers“. What about external validity and reproducibility? In a world of “publish or perish”, nobody seems to care.

And this brings us to the question how one would determine if an effect is publishable. I can think of two possible mechanisms. One possibility is to calculate a p-value and apply a “soft” threshold. P is 0.06? that’s fine. What about 0.1? it depends. 0.7? no way (I hope). A second possibility is the creation of arbitrary rules of thumbs that have no theoretical basis. In this paper, for example, the researchers expressed the effects in terms of Hedges’ g, and considered an effect as meaningful if g>1.3. Why 1.3? They did not provide any justification.

I worked many years in the pharmaceutical industry, and now I work in a research institute affiliated with a large healthcare organization. False positive results, as well as false negative results, can lead to serious consequences. Forget about the money involved. It can be a matter of life and death. Therefore, I care for the error rates, and strongly support hypothesis testing and insisting on statistical significance. And you should care too.

## How to make children eat more vegetables

Let’s start from the end: I do not know how to make children eat more vegetables or even eat some vegetables. At least with my children, success is minimal. But two researchers from the University of Colorado had an idea: we would serve them the vegetables on plates with pictures of vegetables. To test whether the idea works, they conducted an experiment whose results were published in the prestigious journal JAMA Pediatrics . Because the results have been published you can guess that the result of the experiment was positive. But, did they really prove that their idea works?

### Design of the experiment and its results

18 kindergarten and school classes (children aged 3–8) were selected in one of the suburbs of Denver. At first the children were offered fruits and vegetables when they were given white plates. In each class a bowl of fruits and a bowl of vegetables were placed, and each child took fruit and vegetables for himself or herself and ate them as he pleased. The weights of the vegetables and fruits were recorded before they were served to the children, and when the children had finished their meal, the researchers weighed the remaining fruits and vegetables. The difference between the weights (before and after the meal) was divided by the number of children, and thus the average amount of fruit and vegetables each child ate was obtained. Fruit and vegetables averages were also calculated separately. The researchers repeated these measurements three times per class.

After a while, the measurements were repeated the same way, but this time the children were given plates with pictures of vegetables and fruits. The result was an average increase of 13.82 grams in vegetables consumption, between 3 and 5 years of age. This result is statistically significant. In percentages it sounds much better: this is an increase of almost 47%.

So, what’s the problem? There are several problems.

### First problem — extra precision

I will start with what is seemingly not a problem, but a warning: over-precision. When super precise results are published, you have to start worrying. I would like to emphasize: I mean precision, not accuracy. Accuracy refers to the distance between the measured value and the real, unobserved value, and is usually measured by standard deviation or confidence interval. The issue here is about precision: the results are reported at the level of two decimal places; they are very precise. I’m not saying it’s not important, but from my experience, when someone exaggerates, you have to look more thoroughly at what’s going on. Precision of two digits after decimal when it comes to grams seems excessive to me. You can of course think differently, but that’s the warning signal that made me read the article to the end and think about what was described in it.

### Second problem — on whom was the experiment conducted?

The second problem is much more fundamental: the choice of the experimental unit, or unit of observation . The experimental units here are the classrooms. The observations were made at the class level. The researchers measured how many vegetables and fruits were eaten by all the children in the class. They did not measure how many vegetables and fruits each child ate. Although they calculated an average for a child, I suppose everyone knows that the average alone is a problematic measure: it ignores the variation between the children. Before experimental intervention, Each child ate an average of about 30 grams of vegetables at a meal, but I do not think there will be anyone who disagrees with the statement that each child ate a different amount of vegetables. What is the standard deviation? We do not know, and the researchers do not know, but this is essential, because the difference between the children affects the final conclusion. Because the researchers ignored (regardless of the reason) the variation between the children, they practically assumed that the variance was very low, in fact zero. Had the researchers consider this variation, the conclusions of the experiment would be different: the confidence intervals would be different, and wider than the confidence intervals calculated by the researchers.

Another type of variance that was not considered is the variation within children. Let me explain: Even if we watched one child and saw that on average he ate 30 grams of vegetables at every meal, at different meals he eats a different amount of vegetables. The same the question arises again: What is the standard deviation? This standard deviation also has an impact on the final conclusion of the experiment. Of course, each child has a different standard deviation, and this variability should also be taken into consideration.

A third type of variation that was not considered is the variation between children of different ages: it is reasonable to assume that an 8-year-old will react differently to a painted plate than a 3-year-old. An 8-year-old will probably eat more vegetables than a 3-year-old.

I think that the researchers did not pay attention to all these issues. The words variation, adjust or covariate do not appear in the article. Because the researchers ignored these sources of variation, the confidence intervals they calculate are too narrow to reflect the real differences between the children and the types of successes.

Finally, although the experimental unit was the class, the results were reported as measurements were made at the child’s level. In my opinion, this also shows that the researchers were not aware of the variation between and within the children. For them, class and child are one and the same.

### Third problem — what about the control?

There is no control group in this experiment. At a first sight, there is no problem: according to the design of the experiment, each class constitutes its own control group. After all, the children received the vegetables in white plates as well as plates with paintings of vegetables and fruits. But I think that’s not enough.

There are lots of types of plates for children, with drawings by Bob the Builder, Disney characters, Adventure Bay, Thomas the engine, and the list goes on. Could it be that the change was due to the very fact of the paintings themselves, and not because they are paintings of vegetables and fruits? Maybe a child whose meal is served on a plate with pictures of his favorite superhero will eat even more vegetables? The experimental design does not answer this question. A control group is needed. In my opinion, two control groups are needed in this experiment. In one of them the children initially get white plates, and then plates of Thomas the engine, Disney or superheroes, depending on their age and preferences. In the second control group there will be children who will initially receive “ordinary” plates (i.e. Thomas, Disney, etc.) and then plates with paintings of vegetables and fruits.

### Fourth problem — subgroup analysis

Although the age group of the children in the study was 3–8, the researchers discuss the results for children in ages 3–5. What happened to children at age 6–8? Was the analysis for the two (or more) age groups pre-defined? The researchers do not provide this information.

##### Fifth Problem — What does all this mean?

First, it was found that there was a statistically significant change in the consumption of vegetables, but no significant change was observed in the fruit. The researchers referred to this in a short sentence: a possible explanation, they said, is the ceiling effect . Formally they are right. ceiling effect is a statistical phenomenon, and that is what happened here. The really important question they did not answer: Why did this effect occur?

And the most important question: Is the significant change also meaningful? What does the difference of 14 grams (sorry, 13.82 grams) mean? The researchers did not address this question. I’ll give you some food for thought. I went to my local supermarket and weighted one cucumber and one tomato (yes, it’s a small sample, I know). The weight of the cucumber was 126 grams, and the weight of the tomato was 124 grams. In other words, each child ate on average an extra half a bite of a tomato or a cucumber. Is this amount of vegetables meaningful in terms of health and / or nutrition? The researchers did not address this question, nor did the editors of the journal.

### Summary

It is possible that plates with vegetables and fruit paintings cause children to eat more vegetables and fruits. This is indeed an interesting hypothesis. The study that was described here does not answer this question. The manner in which it was planned and implemented does not allow even a partial answer to this question, apparently due to the lack of basic statistical thinking.