
title: Multistate models and trial data
author: Terry Therneau, Mayo Clinic

```{r, echo=FALSE}
library(survival)
knitr::opts_chunk$set(echo = FALSE, comment="")
table2 < function(...) table(..., useNA="ifany")
palette("OkabeIto")
```
# Introduction
This is an annotated copy of the .Rmd file that generated my slides for
the talk at the Clinical Trials Conference at the University of Pennsylvania,
April 17 2023. It is intended to be useful for anyone who wants to
replicate the analyses done there.
## Backdrop
 "A single yes/no is the least amount of information that can be gleaned
from a patient". Charlie Odoroff
 "A single yes/no pvalue is the least amount that can be learned from a
research study".
 Learn as much as you can
## Multistate models
 Natural extension of the PH model
 They help me to glean more understanding about the disease process under study
 Disease processes are multifactorial, the path is much more interesting than
any single way station.
 Easy to use
The colon cancer data set is found in the survival package.
Each subject has 2 rows, one for time to recurrence, the other for time
to death. Split this into two data sets, and then create the
progression free survival time.
```{r, colon0}
c1 < subset(colon, etype== 1)
c2 < subset(colon, etype== 2)
pfs < ifelse(c1$status==1, 1, c2$status)
pfstime < ifelse(c1$status==1, c1$time, c2$time)
pfit < coxph(Surv(pfstime, pfs)~ rx + strata(extent, node4), c1)
```
## Colon cancer data
 929 patients: control, levamiole, levamisole + 5FU
 time to recurrence and death
 HR for progression free survival (PFS): 1, .97, .62 (.50.78)
```{r, colon1}
psurv < survfit(Surv(pfstime, pfs) ~ rx, c1)
plot(psurv, fun="event", xscale=365.25, lwd=2, col=1:3,
xlab="Years post randomization", ylab="Relapse or death")
legend(1400, .3, c("Control", "Levamisole", "Lev + 5FU"), lwd=2, lty=1,
col=1:3, bty='n')
```
Make the multistate data set.
Each subject is described by a set of rows with variable of subject id,
time1, time2, state, and covariates.
The primary rule for such a data set
is that each the rows for each subject describe a physically possible
path through the states.
 no overlapping intervals (can't be two places at once)
 no gaps in the time course (have to be somewhere)
 if you enter a state, the time in state is >0 (no teleporting)
The tmerge routine is a tool to help create such data sets, but it is not
necessary. Other than the state or "status" variable, the data has the same
form as timeddependent covariate data sets; many have their own process
for creating those.
The survcheck routine checks above conditions, and gives a table of the
observed transitions.
```{r, echo= FALSE}
# make the multistate version
# There are 5 subjects with recurrence coded on the same day as death.
# Make them recur one day earlier, so as to avoid a 0 length sojourn
tied.id < (c1$time == c2$time & c1$status==1 & c2$status==1)
c2$time[tied.id] < c2$time[tied.id] 1
cdata < tmerge(subset(c2, ,c(study, time, status, etype)),
c2, id=id, death= event(time, status),
options=list(tstart="time1", tstop="time2"))
cdata < tmerge(cdata, c1, id= id, recur= event(time, status),
td.recur = tdc(time, status, init=0))
cdata$state < with(cdata, factor(recur+2*death, 0:2,
c("censor", "recur", "death")))
# Add some useful recoded variables
cdata$age10 < cdata$age/10
cdata$male < 1*(cdata$sex ==1)
cdata$lev5 < 1*(cdata$rx == "Lev+5FU")
# Change from days to years (labels on the curves)
cdata$time1 < cdata$time1/365.25
cdata$time2 < cdata$time2/365.25
# double check construction
check < survcheck(Surv(time1, time2, state) ~ 1, cdata, id=id)
check
```
The first table tells the observed transtions in the data, each row of the
second how many subjects made 0, 1, ... transitions into that state.
The statefig routine draws simple state space figures.
Those who want prettier ones can employ one of the many routines for DAGs.
(The annotation below is about as fancy as you can get with this routine.)
```{r, state1}
states < c("Entry\n(929)", "Recur", "Death")
smat < matrix(0, 3, 3, dimnames=list(states, states))
smat[1, 2:3] < 1; smat[2, 3] < 1
oldpar < par(mar=c(1,2,1,2))
xy < statefig(1:2, smat)
center < (xy[c(1,1,2),] + xy[c(2,3,3),])/2
delta < .05 #text offset
text(center + cbind(c(0,0, delta/2), c(delta, delta, 0)),
format(check$transitions[c(1,4,5)]))
par(oldpar)
```
Fit the multistate model, and display the coefficients as a graph.
(Fitting is easy, all the work was setting up the data set.)
For those who know ggplot: feel free to make a much nicer figure.
```{r, forest1}
cfit < coxph(Surv(time1, time2, state) ~ lev5 + age10 + extent + node4,
cdata, id=id)
cfit
temp < summary(cfit)$conf.int[ , c(1,3,4)]
yy < c(outer(4:1, c(10, 0, 5), "+"))
oldpar < par(mar=c(5,7, 1, 1))
plot(c(.5,4), range(yy), log='x', type="n", xlab="Hazard ratio",
ylab="", yaxt= 'n')
axis(2, yy, rep(c("Treat", "Age (10)", "Extent", ">4 Nodes") ,3),
las=2)
points(temp[,1], yy, pch=19)
segments(temp[,2], yy, temp[,3], yy, lwd=1.5)
abline(h=c(5,10), lwd=2)
abline(v=1, lty=3)
text(c(2.5, 2.5, 2.8), c(14, 9, 4),
c("Entry:Recur", "Recur:Death", "Entry:Death"))
par(oldpar)
```
## Absolute risk
 Hazard ratios are not enough

 $p_k(t; x)$ = probability in state $k$ at time $t$
 $E(N_k(t); x)$ = number of visits to state $k$
* P(ever visit state $k$)
 E(sojourn time in each state) up to time $t$
 E(sojourn, per vist)
The set of probability in state curves for a fitted multistate PH model
can be generated for any target set of covariates.
(Predictions are always for a hypothetical *someone*).
Below, I dropped age to make things a bit simpler.
Then there is a trick: if we exclude any observation time after
recurrence (lines with td.recur==1), then the model is a competing risks
model and the probability in state for recurrence will be "ever recurred".
It is an easy way to get the probability of ever recurring.
The objects csurv2 and csurv will each have the curves for 16 subjects,
the number of rows in data set dummy; each curve has information on
3 states of (s0), recurrence and death. (If you don't provide an explicit
name for the initial state, and I didn't, it uses (s0) as a generic
label.) The intermediate printing below was not shown on the slides.
```{r, sojourn, echo=FALSE}
# Drop age to make the fit simpler
cfit2 < coxph(Surv(time1, time2, state) ~ lev5 + extent + node4,
cdata, id=id)
cfit3 < coxph(Surv(time1, time2, state) ~ lev5 + extent + node4,
cdata, id=id, subset= (td.recur==0))
dummy < expand.grid(lev5=0:1, extent=1:4, node4= 0:1)
csurv2 < survfit(cfit2, newdata=dummy)
csurv3 < survfit(cfit3, newdata=dummy)
dim(csurv3)
temp < summary(csurv2, rmean=8)$table
temp
sojourn < matrix(temp[,"rmean"], 16, 3,
dimnames=list(NULL, c("entry", "recur", "death")))
phat < drop(summary(csurv3, time= 8, extend=TRUE)$pstate)
# marginalize over extent of disease (majority are a 3)
epct < as.vector(table(c1$extent)/ nrow(c1))
epct
# knowing 'rowsum' marks me as an ancient Splus user. I'm sure
# you can get weighted averages over 'extent' in a dozen more modern ways
sj2 < rowsum(sojourn * epct[dummy$extent], dummy$lev5 + 2*dummy$node4)
phat2 < rowsum(phat * epct[dummy$extent], dummy$lev5 + 2*dummy$node4)
rownames(sj2) < c("Control, nodes <=4", "Trt, nodes <=4",
"Control, nodes >4", "Trt, nodes >4")
oldpar < par(mar=c(5,9,1,1))
barplot(t(sj2[4:1, 1:3]), horiz= TRUE,
xlab="Sojourn time (out of 8 years)",
col=c("blue", "red", "white"), xlim=c(0,8), las=1)
par(oldpar)
```
Now draw the barplot without collapsing over extent of disease.
```{r, allofthem, fig.height=7, fig.width=8}
rownames(sojourn) < paste(rep(rep(c("T1", "T2", "T3", "T4"), each=2),2),
rep(c("A", "B"), 8))
# put the controls together
xx < rev(c(1,3,5,7,2,4,6,8))
xx < c(xx+8, xx)
barplot(t(sojourn[xx,]), horiz=TRUE, col=c("blue", "red", "white"),
las=1)
oldpar < par(xpd=NA)
abline(h=c(4.9, 9.7,14.5), lwd=2, col=3)
par(oldpar)
```
Make a matrix out of the sojourn times, and print it out.
```{r, sojourn2, echo=FALSE}
temp < cbind(sj2[,1:2], phat2[,2], sj2[,2]/phat2[,2])
dimnames(temp) < list(rownames(sj2),
c("Sojourn(initial)", "Sojourn(recur)","P(recur)",
"recur duration"))
round(temp[c(2,1,3,4),] ,2)
```
For the curves below I got lazy. Over 80% of the subjects had extent=3, so I
draw the curves for extent =3 rather than margining it out.
The yates() routine is designed do the busywork for marginal estimates, but
currently doesn't work with mutltistate fits.
```{r, instate}
dummyx < expand.grid(lev5=0:1, node4=0:1, extent=3)
csurvx < survfit(cfit2, newdata=dummyx)
oldpar < par(mfrow=c(1,2), mar=c(5,5,1,1))
plot(csurvx[1:2,], ylim=c(0, .82), col=1:2, lwd=c(1,1,2,2),
# noplot='',
xlab="Years post enrollment", ylab="P(state)")
text(c(2270, 2670), c(.5, .12), c("Death", "Recurrence"))
text(1, .77, "4 or fewer nodes", cex=1.3, adj=0)
plot(csurvx[3:4,], ylim=c(0, .82), col=1:2, lwd=c(1,1,2,2),
# noplot='',
xlab="Years post enrollment", ylab="P(state)")
text(c(2130, 2400), c(.67, .10), c("Death", "Recurrence"))
text(1, .78, "> 4 nodes", cex=1.3, adj=0)
par(oldpar)
```
## Myeloid Data
 Analysis based on CALGB trial of patients with acute myeloid leukemia with
FLT3 mutation
 Chemo $\pm$ midostaurin arms
 Primary analysis: stratified logrank test
 Modified/blinded dataset
The person below with stem cell transplant on the same day as CR has to be
modified, to prevent a "0 days in the CR state" error. Making CR 1 day earlier
solves this, but is completely arbitrary.
The decision not to count CR after SCT is a more nuanced one. One of
the main discussions in the talk was CR and duration of CR, compraring arms A and
B. If someone has had stem cell transplant, however, should that CR be credited
to the earlier treatment, or is it entirely due to SCT?
Which way to go is much more a medical question than a statistical one.
This data set is also discussed in the main survival vigette as an example of
AalenJohansen curves, and in that section I kept these 18 CR.
```{r, myeloid}
tdata < myeloid # temporary working copy
# One subject had transplant on the same day as CR, move CR back a day
tied < with(tdata, (!is.na(crtime) & !is.na(txtime) & crtime==txtime))
tdata$crtime[tied] < tdata$crtime[tied] 1
# create multistate data set
mdata < tmerge(tdata[,1:3], tdata, id=id, death= event(futime, death),
sct = event(txtime), cr = event(crtime),
relapse = event(rltime), priorsct = tdc(txtime),
priorcr = tdc(crtime), priortx = tdc(txtime),
options= list(tstart="time1", tstop="time2"))
# we have chosen not to count CR after SCT as a cr, at least for assessing
# the effect of treatment (there are 11, 7 in arm B)
temp < with(mdata, cr*(priorsct==0) + 2*sct + 4*relapse + 8*death)
mdata$event < factor(temp, c(0,1,2,4,8),
c("none", "CR", "SCT", "relapse", "death"))
check < survcheck(Surv(time1, time2, event) ~ 1, mdata, id=id)
check
temp1 < with(mdata, ifelse(priorcr, 0, c(0,1,0,0,2)[event]))
mdata$crstat < factor(temp1, 0:2, c("none", "CR", "death"))
temp2 < with(mdata, ifelse(priortx, 0, c(0,0,1,0,2)[event]))
mdata$txstat < factor(temp2, 0:2, c("censor", "SCT", "death"))
temp3 < with(mdata, c(0,0,1,0,2)[event] + priortx)
mdata$tx2 < factor(temp3, 0:3,
c("censor", "SCT", "death w/o SCT", "death after SCT"))
# Draw the ideal transition matrix
mstate < c("Conditioning", "CR", "SCT", "Relapse", "Death")
mmat1 < matrix(0,5,5, dimnames=list(mstate, mstate))
mmat1[1,2] < mmat1[2,3] < mmat1[3,4] < 1
mmat1[1:4,5] < 1
oldpar < par(mar=c(1,1,1,1))
statefig(rbind(4,1), mmat1)
par(oldpar)
```
The variable crstat is used for the analysis that treats CR and death as
competing risks, txstat for treating transplant and death as competing risks,
and tx2 for the second transplant analysis.
One should run survcheck on each endpoint to verify it.
```{r, check}
# change a label and print out the transitions matrix
temp < check$transitions
rownames(temp)[1] < "Entry"
temp
```
The layout() command is an old way to create a montage of plots. I use it
here to allow the state space figures to be on the same page as the
survival plots.
```{r, myeloid2}
# I want curves in months instead of days. Easier to fix it once.
mdata$time1 < mdata$time1 * 12 /365.25
mdata$time2 < mdata$time2 * 12 /365.25
cfit1 < coxph(Surv(time1, time2, event=="death") ~ trt, mdata)
cfit2 < coxph(Surv(time1, time2, crstat) ~ trt, data=mdata, id=id)
cfit3 < coxph(Surv(time1, time2, txstat) ~ trt, data=mdata, id=id)
temphr < exp(c(coef(cfit2), coef(cfit3), coef(cfit1)))[c(1,3,5)]
mdummy < data.frame(trt=c('A', 'B')) # get curves for arms A and B
csurv1 < survfit(cfit1, newdata=mdummy)
csurv2 < survfit(cfit2, newdata=mdummy)
csurv3 < survfit(cfit3, newdata=mdummy)
layout(matrix(c(1,1,1,2,3,4), 3,2), widths=2:1)
oldpar < par(mar=c(5.1, 4.1, 1.1, .1))
mlim < c(0, 48) # and only show the first 4 years
plot(csurv2[,"CR"], xlim=mlim,
lty=3, lwd=2, col=1:2, xaxt='n',
xlab="Months post enrollment", ylab="Fraction with the endpoint")
lines(csurv1, mark.time=FALSE, xlim=mlim,
fun='event', col=1:2, lwd=2)
lines(csurv3[,"SCT"], xlim=mlim, col=1:2,
lty=2, lwd=2)
xtime < c(0, 6, 12, 24, 36, 48)
axis(1, xtime, xtime) #axis marks every year rather than 10 months
temp < outer(c("A", "B"),
c("CR", "transplant", "death"), paste)
temp[7] < ""
temp[c(2,4,6)] < paste(temp[c(2,4,6)], ": HR =", round(temphr, 2))
legend(25, .3, temp[c(1,2,7,3,4,7,5,6,7)], lty=c(3,3,3, 2,2,2 ,1,1,1),
col=c(1,2,0), bty='n', lwd=2)
abline(v=2, lty=2, col=3)
# add the state space diagrams
crisk < function(what, horizontal = TRUE, ...) {
nstate < length(what)
connect < matrix(0, nstate, nstate,
dimnames=list(what, what))
connect[1,1] < 1 # an arrow from state 1 to each of the others
if (horizontal) statefig(c(1, nstate1), connect, ...)
else statefig(matrix(c(1, nstate1), ncol=1), connect, ...)
}
state3 < function(what, horizontal=TRUE, ...) {
if (length(what) != 3) stop("Should be 3 states")
connect < matrix(c(0,0,0, 1,0,0, 1,1,0), 3,3,
dimnames=list(what, what))
if (horizontal) statefig(1:2, connect, ...)
else statefig(matrix(1:2, ncol=1), connect, ...)
}
par(mar=c(4,.1,1,1))
crisk(c("Entry", "CR", "Death"), alty=3)
crisk(c("Entry", "SCT", "Death"), alty=2)
crisk(c("Entry","Death"))
par(oldpar)
layout(1)
```
```{r, myeloid3}
cr2 < mdata$event
cr2[cr2=="SCT"] < "none" # ignore transplants
cfit4 < coxph(Surv(time1, time2, cr2) ~ trt,
data= mdata, id=id)
csurv4 < survfit(cfit4, newdata= mdummy)
cr.sojourn < summary(csurv4, rmean=24)$table[3:4, "rmean"]
cr.prob < summary(csurv2, time=24)$pstate[1,,2]
cr.time < cr.sojourn/cr.prob # time in CR, given you entered
en.sojourn < summary(csurv1, rmean=24)$table[, "rmean"]
# entry state has everyone, so entry sojourn = entry time
if (FALSE) {
# this command only works (as yet) in Therneau's working copy of survival,
# when applied to the survival curve from a Cox model. It does work for
# an AalenJohansen curve
temp < resid(csurv4, type="sojourn", time=24) # infinitesimal jackknife
vtemp < rowsum(temp^2, myeloid$trt) #variances
se.sojourn < sqrt(vtemp[, "CR"])
} else se.sojourn < c(.5, .5) # the actual values, so this will run for others
temp < c("TRT Sojourn P(CR) CR Time",
sprintf(" %s %4.1f(%3.1f) %4.2f %4.1f", c("A", "B"),
cr.sojourn, se.sojourn, cr.prob, cr.time))
layout(matrix(c(1,1,2,3), 2,2), widths=2:1)
oldpar < par(mar=c(5.1, 4.1, 1.1, .1))
plot(csurv2[,2], lty=3, lwd=2, col=1:2, xmax=24, xaxt='n',
xlab="Months", ylab="CR")
axis(1, c(0,6,12, 18, 24), c(0, 6, 12, 18, 24))
lines(csurv4[,2], lty=1, lwd=2, col=1:2, xmax=24)
text(c(6,6,6), c(.2, .17, .14), temp, adj=0, family="mono")
par(mar=c(4, .1, 1, 1))
crisk( c("Entry","CR", "Death"), alty=3)
state3(c("Entry", "CR", "Death/Relapse"))
par(oldpar)
layout(1)
```
```{r, transplant}
event2 < with(mdata, ifelse(event=="SCT" & priorcr==1, 6,
as.numeric(event)))
event2 < factor(event2, 1:6, c(levels(mdata$event), "SCT after CR"))
cfit5 < coxph(Surv(time1, time2, event2) ~ trt, mdata, id=id,
subset=(priortx ==0))
csurv5 < survfit(cfit5, newdata= mdummy)
#dim(csurv5) # number of strata by number of states
#txsurv$states # Names of states
layout(matrix(c(1,1,1,2,2,0),3,2), widths=2:1)
oldpar < par(mar=c(5.1, 4.1, 1,.1))
plot(csurv5[,c(3,6)], col=1:2, lty=c(1,1,2,2), lwd=2, xmax=48,
xaxt='n', xlab="Months", ylab="Transplanted")
axis(1, xtime, xtime)
legend(15, .13, c("A, transplant without CR", "B, transplant without CR",
"A, transplant after CR", "B, transplant after CR"),
col=1:2, lty=c(1,1,2,2), lwd=2, bty='n')
sname < c("Entry", "CR", "Transplant", "Transplant")
pattern < cbind(c(1/2, 3/4, 1/4, 3/4),
c(5/6, 1/2, 1/2, 1/6))
connect < matrix(0,4,4, dimnames=list(sname, sname))
connect[1, 2:3] < 1
connect[2,4] < 1
statefig(pattern, connect)
par(oldpar)
layout(1)
```
## Standard errors
 pstate = $p_k(t; x) =$ P(in state $k$ at time $t$, covariates $x$)
 $J(t; x)$ = infinitesimal jackknife = effect of obs $i$ on
$p(t)$
* (number of observations) x (number of time points) x (number of states)
* std for everything else follows
## Summary
 This is a useful addition to your toolkit
 Take the time to look deeper into your data
* Resist the ``tyranny of the urgent''
 Hazards are important, hazards are not enough
 Software
* all done with the survival package in R
* ongoing work to make it even easier