Creating MS Word reports using the officer package

Commonly, the final product that a data scientist or a statistician generates is a report, usually in MS Word format. The officer package enables generating such a report from within R. It also enables generating PowerPoint presentations, but this is beyond the scope of this post.

While the package has many great features,  using the package is not intuitive. The package manual covers no less than 80 pages. There are many functions that allow controlling many aspects of Word documents and PowerPoint presentation.

What I really need is to perform two tasks: inserting tables and figures into my Word document, and I also want to apply a standard format to all of my reports. I also need to add titles, empty lines and page breaks.

For these needs, I wrote a few functions that enable me to create standardized Word document reports using an intuitive syntax.

In addition to the officer package, the flextable package is also needed:

library(officer)
library(flextable)

The first function creates a new object that represents the Word document to be created. It simply wraps officer’s read_docx() function, with the caveat that if an object with the same name already exists, it overrides it with a new “clean” object:

# create new word document
new.word.doc=function(){
  my.doc=read_docx()
  return(my.doc)
}

The next two functions to add an empty line or a page break:

# add an empty line
add.empty.line=function(doc){
  body_add_par(doc, " ")
  return("empty line added")
}

#add page break
add.page.break=function(doc){
  body_add_break(doc, pos="after")
  return("page break added")
}

The next two functions are used to set the orientation of the next page to landscape, and then back to portrait:

# start landscape
start.landscape=function(doc){
  doc=body_end_section_continuous(doc)
  return("landscape orientation started")
}

# end landscape
end.landscape=function(doc){
  doc=body_end_section_landscape(doc)
  return("landscape orientation ended")
}

This function adds a title. I use the fp_text() function to set the font size to 14, and the font type to bold. I also add an empty line after the title:

# add a title
add.title=function(doc, my.title){
my.prop=fp_text(font.size = 14, bold = TRUE, font.family = "Times")
the.title=fpar(ftext(my.title, prop=my.prop))
body_add_fpar(doc, the.title)
body_add_par(doc, " ")
return("title added")
}

The next function is a little bit more complicated. It adds to the document a figure that already exists as an image file in the working directory. The parameters h and w indicate the height and width of image within the document, in inches. My usual values of choice are h=5 and w=5, so I set them as default. I align the figure to the center of the page.

# add an image, such as a jpg or png file
add.image=function(doc, image, h=5, w=5){
  body_add_img(doc, src=image,
               height=h, width=w,
               style="centered")
  return("image added")
}

The last function, add.table(), is the most complicated one. The table to be added is a data frame, and my function assumes that the variables in this data frame are either character, factor or numeric.

I have two parameters to set the number of decimals to present for each of the numeric variables. The first parameter receives the names of the numeric variables, the second contains the number of decimals for each of these variables, respectively.  Note that I do not check the inputs of these two parameters. I trust myself to insert the right input.

In addition, the function sets the format of the table (borders, fonts, and header formatting) according to my standard report style, using officer functions such as border_outer(), bold() etc.

# add a data frame as a table
add.table=function(doc, tbl, col.keys=NULL, col.digits=NULL){
  # create basic flextable
  f.table=qflextable(tbl)

  # set numbers of decimals for numeric variables, if specified
  if(!is.null(col.keys)){
    for(j in 1:length(col.keys)){
      f.table=colformat_num(x=f.table,
                            col_keys=col.keys[j],
                            digits=col.digits[j])
    }
  }
  
  # set table borders
  f.table=border_outer(f.table, part="all",
                       border=fp_border(color="black", width = 1))
  f.table=border_inner_h(f.table, part="all",
                         border=fp_border(color="black", width = 1))
  f.table=border_inner_v(f.table, part="all",
                         border=fp_border(color="black", width = 1))
  
  # set fonts
  f.table=font(f.table,  fontname = "Times", part = "all")
  # also set the table's header font as bold
  f.table=bold(f.table, part = "header")
  
  # add the table to the document
  flextable::body_add_flextable(doc, 
                                value = f.table, 
                                align = "left" )
  return("table added")
}

Now we are all set to create the report:

# create an histogram and save it as a png file:
png(filename="histogram.png", width = 6, height = 6, units = 'in', res = 300)
hist(mtcars$wt)
dev.off()

# create a data frame that will become a table in my report
wide.table=mtcars[1:6, ]
wide.table$car=rownames(wide.table)
wide.table=wide.table[, c(12, 1:11)]
narrow.table=wide.table[, 1:4]

# create a new document object
doc=new.word.doc()

# add the report title and an empty line
add.title(doc, "My report")
add.empty.line(doc)

add.title(doc, "narrow table")
add.table(doc, narrow.table, col.keys=c("mpg", "disp"), col.digits=c(1,0))

add.page.break(doc)

# add the histogram with an apropriate title
add.title(doc, "Histogram - portrait")
add.image(doc, "histogram.png", h=3, w=3)

# set the orientation to lndscape
start.landscape(doc)

add.title(doc, "narrow table - landscape")
add.table(doc, narrow.table, col.keys=c("mpg", "disp"), col.digits=c(1,0))

add.empty.line(doc)

# add the wide table in landsacape page orientation
add.title(doc, "wide table in landscape orientation")
add.table(doc, wide.table, col.keys=c("mpg", "disp"), col.digits=c(1,0))

#set the orientation back to portrait
end.landscape(doc)

add.title(doc, "narrow table")
add.table(doc, narrow.table, col.keys=c("mpg", "disp"), col.digits=c(1,0))

# generate the Word document using the print function
print(doc, target="My report.docx")

Since I cannot upload a Word document to WordPress, here is a screen print of the report:

Visualization of the Debt/GDP ratio and national debt level

I saw this graph on Twitter a few days ago: [1]

Short googling revealed that this is a relatively old graph from October 2017. On one hand, this is a really cool visualization. On the other hand, it also belongs to Facebook pages such as Trust me, I’m a Statistician or Trust me, I’m a Data Scientist.

What do we see?

This is a kind of a pie chart. In a classic pie chart, the slices are in the form of “triangles”, or more precisely, circular sectors. In this chart the slices have other forms, including various types of triangles and quadrilaterals, other polygons, and shapes that I really don’t know their names. [2]

I admit that this chart pretty confused me. It presents national debt and Debt/GDP ratio data. Initially I referred to the Debt/GDP ratio, and for some reason I thought that the area of ​​each slice in this chart represents Debt/GDP ratio of each country, probably because my eye first caught the chart’s footer.

Actually, each slice shows the share of the country’s national debt out of the world’s total debts, so the areas of all pieces should sum to 100% [3]. We see clearly that the country with the largest share of debt out of the total world national debts (and therefore the highest absolute debt) is the United States. The country with the second largest share of debt is Japan, and China is third. Look for Italy, Germany, France and the United Kingdom. Can you determine which of the four states has a bigger share of the total debt by looking just at the area of their slices? Actually, their shares of the total debt are very similar.

The Debt/GDP ratio of each country is expressed by the color of its slice. The lighter the color, the higher the Debt/GDP ratio. You can immediately see that Japan has the highest Debt/GDP ratio, and I believe that most people will recognize that Greece also has a very high Debt/GDP ratio, actually the second highest ratio. Can you spot the country with the third largest ratio? It is Lebanon.  Look at the upper right area of the chart. Italy and Portugal, which occupy fourth and fifth place, are more prominent. Can you tell which country has the lowest Debt/GDP ratio?

Now that we understand the data presented in this graph, we can start looking for insights.

This chart is a two-dimensional chart, in the sense that it presents two different variables in the same graph. Such graphs are useful for exploring the relationship between the two variables. So, what is the relationship between the Debt/GDP ratio of a country and its share in the world’s total debts? Can you see anything? I can’t. It is to the credit of the authors that they did not try to discuss this matter at all.

Is there a better way to visualize these data? Of course there is. Let’s play.

I took the world’s Debt/GDP ratio data  and the world’s GDP data from Wikipedia. For the purpose of the demonstration, I focused on the OECD countries data from 2017. I calculated the absolute debt of each country using the Debt/GDP ratio and the GDP data, and then I calculated for each country its share of the total OECD debt. The data are available here.

The simplest possible visualization for two dimensional data is a scatter plot, although it is not as cool as that pie chart. Let’s forget what we learned by looking at that pie chart, and start from scratch.

This code generates a basic scatter plot of the OECD data:

plot(c(0,250), c(0, 40), axes=FALSE, type="n", xlab="Debt/GDP ratio (%)", ylab="% Share of Debt", main="", cex.main=1)
axis(1, 50*(0:5),cex.axis=0.8)
axis(2,at=10*(0:4), las=1, cex.axis=0.8)
points(oecd$debt.gdp.ratio, oecd$share.of.debt, type="p", pch=16, cex=1, col="black")

The plot clearly shows that there are two outlier dots/countries; One country has Debt/GDP ratio greater than 200%. Another country has an awfully large debt – its share of total OECD debt is higher than 30%.

A closer look reveals a country whose Debt/GDP ratio is greater than 150%, and two more countries whose Debt/GDP ratio is about 130%.

Since some economists who believe that high debt is bad, and that high Debt/GDP ratio is even worse, I decided to divide the countries into three groups: [4]

  • The first group includes the countries with either Debt/GDP ratio which is greater than 100% or their share the total debt is greater than 10%. These are countries that are in “bad” economic situation according to these parameters.
  • The second group includes the countries whose Debt/GDP ratio is less than 50% and their share the total debt is greater than 2%. These are countries that are in “good” economic situation according to these parameters.
  • The third group includes the rest of the countries.

I decided to paint the dots that represent the countries which are in “bad” economic situation in red, and to add their names on the graph. I painted the dots that represent the countries which are in “good” economic situation in green. The rest of the dots are painted in orange.

This code generates the improved scatter plot:

w1=which(oecd$debt.gdp.ratio>100 | oecd$share.of.debt>10)
w3=which(oecd$debt.gdp.ratio<50 & oecd$share.of.debt<2)
w2=which(!(1:36 %in% c(w1,w3)))

plot(c(0,250), c(0, 40), axes=FALSE, type="n", xlab="Debt/GDP ratio (%)", ylab="% Share of Debt", main="", cex.main=1)

axis(1, 50*(0:5),cex.axis=0.8)
axis(2,at=10*(0:4), las=1, cex.axis=0.8)

points(oecd$debt.gdp.ratio[w1], oecd$share.of.debt[w1], type="p", pch=16, cex=1, col="red")

text(x=oecd$debt.gdp.ratio[w1],
     y=oecd$share.of.debt[w1], 
     labels=oecd$country[w1], 
     pos=4, cex=0.7)

points(oecd$debt.gdp.ratio[w2], oecd$share.of.debt[w2], type="p", pch=16, cex=1, col="orange")

points(oecd$debt.gdp.ratio[w3], oecd$share.of.debt[w3], type="p", pch=16, cex=1, col="green")

Now we can see that:

  • The Debt/GDP ratio of the “good” states extends over the entire zero to 50% range, although the Debt/GDP of many countries in this group is close to 50%.
  • The orange painted countries (the “middle” group) are roughly divided into two subgroups. The countries in the first subgroup have lower debts (as expressed by percent share of the total debt) and a Debt/GDP ratio in the range of 50 to 75%. The five countries in the second subgroup have higher debts , with no clear pattern for the Debt/GDP ratio.
  • I cannot draw a general conclusion regarding the 6 countries which are in “bad” economic situation according to my definitions.

Comments

[1] I did some minor edits to the graph for the purpose of my demonstration.

[2] Look for the United Kingdom at the bottom of the chart, for example.

[3] I didn’t check any of the data. I trust the authors that the date is accurate.

[4] I chose the cut points of 10%, 100% etc. according to my best judgment. If you know an economist who has a more accurate method to determine such cut points, please introduce him to me.

How to select a seed for simulation or randomization

If you need to generate a randomization list for a clinical trial, do some simulations or perhaps perform a huge bootstrap analysis, you need a way to draw random numbers. Putting many pieces of paper in a hat and drawing them is possible in theory, but you will probably be using a computer for doing this. The computer, however, does not generate random numbers. It generates pseudo random numbers. They look and feel almost like real random numbers, but they are not random. Each number in the sequence is calculated from its predecessor, so the sequence has to begin somewhere;  it begins in the seed – the first number in the sequence.

Knowing the seed is a good idea. It enables reproducing the analysis, the simulation or the randomization list. If you run a clinical trial, reproducibility is crucial. You must know at the end of the trial which patient was randomized to each treatment; otherwise you will throw all your data to the garbage. During the years I worked at Teva Pharmaceuticals, we took every possible safety measure: We burnt the randomization lists, the randomization SAS code and the randomization seed on a CD and kept it in a fire-proof safe. We also kept all this information in analog media. Yes, we printed the lists, the SAS code and the seed on paper, and these were also kept in the safe.

Using the same seed every time is not a good idea. If you use the same seed every time, you get the same sequence of pseudo-random numbers every time, and therefore your numbers are not pseudo-random anymore. Selecting a different seed every time is good practice.

How do you select the seed? Taking a number out of your head is still not a good idea. Our heads are biased. Like passwords, people tend to use some seeds more often than other possible seeds. Don’t be surprised if you see codes with seeds like 123, 999 or 31415.

The best practice is to choose a random seed, but this creates a magic circle. You can still turn to your old hat and some pieces of paper, but there is a simple alternative: generate the seed from the computer clock.

This is how you do it in R by using the Sys.time() function: Get the system time, convert it to an integer, and you’re done. In practice, I take only the last 5 digits of that integer. And off course, I keep the chosen seed on record. I also save the Sys.time value and its integer value, just in case. Finally, I hardcode the seed I get. This is the R code:

> # set the initial seed as the computer time
> initial_seed=Sys.time()
> print (initial_seed)
[1] "2019-03-16 10:56:37 IST"

> # convert the time into a numeric variable
> initial_seed=as.integer(initial_seed)
> print (initial_seed)
[1] 1552726597

> # take the last five digits f the initial seed
> the_seed=initial_seed %% 100000
> print(the_seed) # 
[1] 26597

> # do your simulation
> set.seed(26597)
> print(rnorm(3))
[1] -0.4759375 -1.9624872 -0.1601236

> # reproduce your simulation
> set.seed(26597)
> print(rnorm(3))
[1] -0.4759375 -1.9624872 -0.1601236
>

Cancer clusters and the Poisson distributions

On March 1, 2019, an article was published in Israel’s Ynetnews website, under the title “The curious case of the concentration of cancer”. The story reports on a concentration of cancer cases in the town of Rosh Ha’ayin in central Israel.

In the past few years dozens of cases of cancer have been discovered in the center of Rosh Ha’ayin. About 40 people have already died of the disease. Residents are sure that the cause of the disease is cellular antennas on the roof of a building belonging to the municipality. “Years we cry and no one listens”, They say, “People die one after the other.”

I do not underestimate the pain of the residents of Rosh Ha’ayin. I also do not intend to discuss the numbers mentioned in the article. I accept them as they are. I just want to relate only to the claim that the cause of the disease is cellular antennas. It is easy (at least for me) to explain why this claim is at least questionable: there are many more cellular antennas in many places, and around them there is no high rate of morbidity in cancer. If the antennas are carcinogenic, then they need cancer everywhere, not just in Rosh Ha’ayin. On the other hand, I can’t blame them for blaming the antennas. People try to rationalize what they see and find a cause.

I also must emphasize that the case of Rosh Ha’ayin must not be neglected. If it is not because the cellular antennas, it is possible that there is some other risk factor, and the state authorities must investigate.

So why does Rosh Ha’ayin have such a large cluster of cancer morbidity? One possible answer is that there is an environmental factor (other than the antennas) that does not exist elsewhere. Another possible answer is that there may be a non-environmental factor that does not exist elsewhere, maybe a genetic factor (most of the town’s residents are immigrants from Yemen and their descendants). A third and particularly sad possibility is that the local residents suffer from really bad luck. Statistics rules can be cruel.

Clusters happen: if there are no local (environmental or other) risk factors that cause cancer (or any other disease), and the disease spreads randomly across the whole country, then clusters are formed. If the country is divided into units of equal size, then the number of cases in a given area unit follows a Poisson distribution. Then there is a small, but not negligible possibility that one of these units will contain a large number of cases. The problem is that there is no way of knowing in advance where it will happen.

The opposite is also true: If the distribution of the number of cases in a given area unit is a Poisson distribution, then it can be concluded that the dispersion on the surface is random.

I will demonstrate the phenomenon using simulation.

Consider a hypothetical country that has a perfect square shape, and its size is 100 x 100 kilometers. I randomized 400 cases of morbidity across the country:

# generate n random points on 0:100 x 0:100
> set.seed(21534)
> n=400
> x=runif(n)*100
> y=runif(n)*100
> dat=data.frame(x,y)
> head(dat)
x y
1 15.73088 8.480265
2 12.77018 78.652808
3 45.50406 31.316797
4 86.46181 6.669138
5 27.25488 48.164316
6 17.42388 98.429575

I plotted the 400 cases.

> # plot the points
> plot(dat$x, dat$y, ylim=c(0,100), xlim=c(0,100), 
+ asp=1, frame.plot=FALSE, axes=FALSE,
+ xlab=' ', ylab=' ', col='aquamarine', pch=16,
+ las=1, xaxs="i", yaxs="i",
+ )
> axis(1, at=0:10*10)
> axis(2, at=0:10*10, las=1, pos=0)
> axis(3, at=0:10*10, labels=FALSE)
> axis(4, at=0:10*10, labels=FALSE, pos=100)

Here’s the map I got.

Next, I divided the map into 100 squares, each 10 x 10 kilometers:

> #draw gridlines
> for (j in 1:9){
+ lines(c(0,100), j*c(10,10))
+ lines(j*c(10,10), c(0,100))
+ }
>

In order to count the cases in each square, I assigned row and column numbers for each of the squares, and recorded the position (row and column) for every case/dot:

> # row and column numbers and cases positions
> dat$row=0
> dat$col=0
> dat$pos=0
> for (j in 1:nrow(dat)){
+ dat$row=ceiling(dat$y/10)
+ dat$col=ceiling(dat$x/10)
+ dat$pos=10*(dat$row-1)+dat$col
+ }
>

Now I can count the number of points/cases in each square:

> # calculate number of points for each position
> # ppp=points per position
> dat$count=1
> ppp=aggregate(count~pos, dat, sum)
> dat=dat[,-6]

But of course, it is possible that there are squares with zero cases; (actually the data frame ppp has only 97 rows). Let’s identify them:

> # add positions with zero counts, if any
> npp=nrow(ppp)
> if(npp<100){
+ w=which(!(1:100 %in% ppp$pos))
+ addrows=(npp+1):(npp+length(w))
+ ppp[addrows,1]=w
+ ppp[addrows,2]=0
+ ppp=ppp[order(ppp$pos),]
+ }
>

And now we can get the distribution of number of cases in each of the 100 squares:

> # distribution of number of points/cases in each position
> tb=table(ppp$count)
> print(tb)
0  1  2  3  4  5  6  7  8  9 11 
3  9 12 21 15 17 13  5  1  3  1 
>

We see that there is one very unlucky cluster with 11 cases, and there also 3 squares with 9 cases each. Let’s paint them on the map:

> # identify largest cluster
> mx=max(ppp$count)
> loc=which(ppp$count==11)
> clusters=dat[dat$pos %in% loc,]
> points(clusters$x, clusters$y, col='red', pch=16)
> 
> # identify second lasrgest cluster/s
> loc=which(ppp$count==9)
> clusters=dat[dat$pos %in% loc,]
> points(clusters$x, clusters$y, col='blue', pch=16)
>

Let’s also mark the squares with zero points/cases. In order to do this, we first need to identify the row and column locations of these squares:

> # identify sqaures without cases
> # find row and column locations
> loc=which(ppp$count==0)
> zeroes=data.frame(loc)
> zeroes$row=ceiling(zeroes$loc/10)
> zeroes$col=zeroes$loc %% 10
> w=which(zeroes$col==0)
> if(length(w)>0){
+ zeroes$col[w]=10
+ }
> print(zeroes)
loc row col
1 8 1 8
2 31 4 1
3 99 10 9
>

So there is one empty square in the 8th column of the first row, one in the first column of the 4th row, and one in the 9th column of the 10th row. Let’s mark them. To do that, we need to know the coordinates of each of the four vertices of these squares:

# mark squares with zero cases
> for (j in 1:nrow(zeroes)){
+ h1=(zeroes$col[j]-1)*10
+ h2=h1+10
+ v1=(zeroes$row[j]-1)*10
+ v2=v1+10
+ lines(c(h1,h2), c(v1,v1), lwd=3, col='purple')
+ lines(c(h1,h2), c(v2,v2), lwd=3, col='purple')
+ lines(c(h1,h1), c(v1,v2), lwd=3, col='purple')
+ lines(c(h2,h2), c(v1,v2), lwd=3, col='purple')
+ }

Do you see any pattern?

How well does the data fit the Poisson distribution? We can perform a goodness of fit test.
Let’s do the log-likelihood chi-square test (also known as the G-test):

> # log likelihood chi square to test the goodness of fit 
> # of the poisson distribution to the data
> 
> # the obserevd data
> observed=as.numeric(tb)
> values=as.numeric(names(tb))
> 
> # estimate the poisson distribution parameter lambda
> # it is the mean number of cases per square
> lambda=nrow(dat)/100
> print(lambda)
[1] 4
> 
> # calculate the expected values according to 
> # a poisson distribution with mean lambda
> expected=100*dpois(values, lambda)
> 
> # view the data for the chi-square test
> poisson_data=data.frame(values, observed, expected)
> print(poisson_data)
values observed expected
1 0 3 1.8315639
2 1 9 7.3262556
3 2 12 14.6525111
4 3 21 19.5366815
5 4 15 19.5366815
6 5 17 15.6293452
7 6 13 10.4195635
8 7 5 5.9540363
9 8 1 2.9770181
10 9 3 1.3231192
11 11 1 0.1924537
> 
> # calculate the degrees of freedom
> df=max(values)
> print(df)
[1] 11
> 
> # calculate the test statistic and p-value
> g2=sum(observed*log(observed/expected))
> pvalue=1-pchisq(g2,df)
> log_likelihood_chi_squrae_test=data.frame(g2, df, pvalue)
> print(log_likelihood_chi_squrae_test)
g2 df pvalue
1 4.934042 11 0.9343187
>

We cannot reject the hypothesis that the data follows the Poison distribution. This does not imply, of course, that the data follows the Poisson distribution, but we can say that the Poisson model fits the data well.

The delta method and its implementation in R

Suppose that you have a sample of a variable of interest, e.g. the heights of men in certain population, and for some obscured reason you are interest not in the mean height μ but in its square μ². How would you inference on μ², e.g. test a hypothesis or calculate a confidnce interval? The delta method is the trick you need.

The delta method is mathematical assertion that can yield estimates for the varinance of functons of statistics under mild condition. Loosley speaking, let bₙ is an estimate of β, where n is the sample size, and the distribution of bₙ is approximately normal with mean β and variance σ²/n.

Then, if g is a function, then g(bₙ) is approximately normal with mean g(β) and and variance [g’(β)² ]⋅σ²/n, provided that the sample size is large. Don’t let the calculus scare you. R’s deltmethod function will handle the calculus for you.

If bₙ is a Maximum Likelihood Estimate (MLE) of β then under mild conditions it is approximately normal for a large sample size, and g(bₙ) and g’(bₙ) are MLEs for g(β) and g’(β) respectively. Add on top of this a MLE for σ, and you can implement statistical inference.

I will demonstrate the use of the delta method using the Titanic survival data. For the sake of the example I will use only 3 variables: Survived is a 0/1 variable indicating whther a passenger survived the sinking of the Titanic, Age is obviously the passenger’s age, and Parch is the passenger’s number of parents/children aboard. Lets take a look at some of the data:

> # install.packages("titanic")
> library(titanic)
> titanic=titanic_train[, c(1, 2, 6,8)]
> head(titanic)
  PassengerId Survived Age Parch
1           1        0  22     0
2           2        1  38     0
3           3        1  26     0
4           4        1  35     0
5           5        0  35     0
6           6        0  NA     0
>

Let π be the probability of surviving. Then the odds of survival is π/(1-π). The sample size is n=891 is considered large, so we can apply the Central Limit Theorem to conclude that p is approximately normal with mean π and variance π/(1-π)/n. Then π can be estimated by p=0.384, the odds are estimated by p/(1-p)=0.623, and the variance of p is estimated by 0.000265:

> p_survival=mean(titanic$Survived)
> print(p_survival)
[1] 0.3838384
> survival_odds=p_survival/(1-p_survival)
> print(survival_odds)
[1] 0.6229508
> n=nrow(titanic)
> var_p_survival=(p_survival*(1-p_survival))/n
> print(var_p_survival)
[1] 0.0002654394
>

Let g(x)=p/(1-x). Then g`(x)=1/(1-x)², (you can check this at https://www.wolframalpha.com) so the variane of the odds can be estimated by

Var(p/(1-p))=[g’(p)² ]⋅σ²/n=[1/(1-p)²]² ⋅ [p(1-p)/n].

This can be implemented in the following R code:

> dev_g_odds=1/(1-survival_odds)^2
> var_odds=(dev_g_odds^2)*var_p_survival
> print(var_odds)
[1] 0.01313328
> se_odds=sqrt(var_odds)
> print(se_odds)
[1] 0.1146005
>

But of course, instead of doing all the calculus, you can use the deltamethod function of R’s msm package.

The function has three parameters:

  • g is a formula object representating of the transformation g(x). The formula variables must be labeled x1, x2 and so on.
  • mean is the estimate of g(β)
  • cov is the estimate of var(β) which is σ²/n

The use of the delta function provides the same result:

> # install.packages("msm")
> library(msm)
> se_odds_delta=deltamethod(g=~x1/(1-x1), mean=survival_odds, cov=var_p_survival)
> print(se_odds_delta)
[1] 0.1146005
>

The second example considers logistic regression. We will model the (logit of the) probability of survival using Age nd Parch. Using R’s glm function we can obtain estimates to the logistic regression coefficients band their standard errors se_b:

> model=glm(Survived ~ Age + Parch, family=binomial(link="logit")
> model_beta=data.frame(summary(model)$coefficients[2:3, 1:2])
> names(model_beta)=c("b", "se_b")
> print(model_beta)
                 b        se_b
Age   -0.008792681 0.005421838
Parch  0.191531303 0.090643635
>

Since exp(β) is usually interpreted as the odd ratio (OR) of the response variable with regard to the regression independent variable, reseachers are interested in inference on exp(β). In order to perform such inference one nees to estimate the standard error of exp(β). Since b is a Maximum Likelihood Estimate for β, it is approximately normal and its variance is estimated by se_b², and the delta method can be applied.

The calculus in this case is easy: g(x)=exp(x), so g’(x)=exp(x). Therefore, the standard error of exp(β) can be estimated by exp(b)⋅ se_b:

> model_beta$OR=exp(model_beta$b)
> model_beta$se_OR=exp(model_beta$b)*model_beta$se_b
> print(model_beta)
                 b        se_b        OR       se_OR
Age   -0.008792681 0.005421838 0.9912459 0.005374374
Parch  0.191531303 0.090643635 1.2111027 0.109778755
>

To use the deltamethod function, we will first use the vcov function to obtain the variance-covariance matrix of the resression coefficient estimates, and the variances will be the inputs of the deltamethod function:

> vc=vcov(model)[2:3, 2:3]
> print(vc)
               Age        Parch
Age   2.939632e-05 9.236876e-05
Parch 9.236876e-05 8.216269e-03

> model_beta$se_OR_delta[1]=deltamethod(~exp(x1), mean=model_beta$b[1], cov=vc[1,1])

> model_beta$se_OR_delta[2]=deltamethod(~exp(x1), mean=model_beta$b[2], cov=vc[2,2])

> print(model_beta)
                 b        se_b        OR       se_OR se_OR_delta
Age   -0.008792681 0.005421838 0.9912459 0.005374374 0.005374374
Parch  0.191531303 0.090643635 1.2111027 0.109778755 0.109778755
>

Of course, we get the same results.