Title: | Receiver Operating Characteristic (ROC)-Guided Classification and Survival Tree |
---|---|
Description: | Receiver Operating Characteristic (ROC)-guided survival trees and ensemble algorithms are implemented, providing a unified framework for tree-structured analysis with censored survival outcomes. A time-invariant partition scheme on the survivor population was considered to incorporate time-dependent covariates. Motivated by ideas of randomized tests, generalized time-dependent ROC curves were used to evaluate the performance of survival trees and establish the optimality of the target hazard/survival function. The optimality of the target hazard function motivates us to use a weighted average of the time-dependent area under the curve (AUC) on a set of time points to evaluate the prediction performance of survival trees and to guide splitting and pruning. A detailed description of the implemented methods can be found in Sun et al. (2019) <arXiv:1809.05627>. |
Authors: | Yifei Sun [aut], Mei-Cheng Wang [aut], Sy Han Chiou [aut, cre] |
Maintainer: | Sy Han Chiou <schiou@utdallas.edu> |
License: | GPL (>= 3) |
Version: | 1.1.1 |
Built: | 2025-02-19 04:45:01 UTC |
Source: | https://github.com/stc04003/roctree |
The rocTree
package uses a Receiver Operating Characteristic (ROC) guided classification
algorithm to grow prune survival trees and ensemble.
The rocTree
package provides implementations to a unified framework for
tree-structured analysis with censored survival outcomes.
Different from many existing tree building algorithms,
the rocTree
package incorporates time-dependent covariates by constructing
a time-invariant partition scheme on the survivor population.
The partition-based risk prediction function is constructed using an algorithm guided by
the Receiver Operating Characteristic (ROC) curve.
The generalized time-dependent ROC curves for survival trees show that the
target hazard function yields the highest ROC curve.
The optimality of the target hazard function motivates us to use a weighted average of the
time-dependent area under the curve on a set of time points to evaluate the prediction
performance of survival trees and to guide splitting and pruning.
Moreover, the rocTree
package also offers a novel ensemble algorithm,
where the ensemble is on unbiased martingale estimating equations.
The package contains functions to construct ROC-guided survival trees and ensemble through
the main function rocTree
.
Maintainer: Sy Han Chiou schiou@utdallas.edu
Authors:
Yifei Sun ys3072@cumc.columbia.edu
Mei-Cheng Wang mcwang@jhu.edu
Surv
function imported from survival
This function is imported from the survival
package. See
Surv
.
rocTree
objectPlots an rocTree
object. The function returns a dgr_graph
object that is rendered in the RStudio Viewer or survival/hazard estimate at terminal nodes.
## S3 method for class 'rocTree' plot( x, output = c("graph", "visNetwork"), digits = 4, tree = 1L, rankdir = c("TB", "BT", "LR", "RL"), shape = "ellipse", nodeOnly = FALSE, savePlot = FALSE, file_name = "pic.pdf", file_type = "pdf", type = c("tree", "survival", "hazard"), ... )
## S3 method for class 'rocTree' plot( x, output = c("graph", "visNetwork"), digits = 4, tree = 1L, rankdir = c("TB", "BT", "LR", "RL"), shape = "ellipse", nodeOnly = FALSE, savePlot = FALSE, file_name = "pic.pdf", file_type = "pdf", type = c("tree", "survival", "hazard"), ... )
x |
an object of class " |
output |
a string specifying the output type; graph (the default) renders the graph using the |
digits |
the number of digits to print. |
tree |
is an optional integer specifying the |
rankdir |
is a character string specifying the direction of the tree flow. The available options are top-to-bottom ("TB"), bottom-to-top ("BT"), left-to-right ("LR"), and right-to-left ("RL"); the default value is "TB". |
shape |
is a character string specifying the shape style. Some of the available options are "ellipse", "oval", "rectangle", "square", "egg", "plaintext", "diamond", and "triangle". The default value is "ellipse". |
nodeOnly |
is a logical value indicating whether to display only the node number; the default value is "TRUE". |
savePlot |
is a logical value indicating whether the plot will be saved (exported); the default value is "FALSE". |
file_name |
is a character string specifying the name of the plot when "savePlot = TRUE". The file name should include its extension. The default value is "pic.pdf" |
file_type |
is a character string specifying the type of file to be exported. Options for graph files are: "png", "pdf", "svg", and "ps". The default value is "pdf". |
type |
is an optional character string specifying the type of plots to produce. The available options are "tree" for plotting survival tree (default),
"survival" for plotting the estimated survival probabilities for the terminal nodes, and "hazard" for plotting the estimated hazard for the terminal nodes.
The |
... |
arguments to be passed to or from other methods. |
See rocTree
for creating rocTree
objects.
## Not run: data(simDat) fit <- rocTree(Surv(Time, death) ~ z1 + z2, id = id, data = simDat, ensemble = FALSE) ## Plot tree plot(fit) ## Plot survival estimates at terminal nodes plot(fit, type = "survival") ## Plot hazard estimates at terminal nodes plot(fit, type = "haz") ## End(Not run)
## Not run: data(simDat) fit <- rocTree(Surv(Time, death) ~ z1 + z2, id = id, data = simDat, ensemble = FALSE) ## Plot tree plot(fit) ## Plot survival estimates at terminal nodes plot(fit, type = "survival") ## Plot hazard estimates at terminal nodes plot(fit, type = "haz") ## End(Not run)
rocTree
model.The function gives predicted values with a rocTree
fit.
## S3 method for class 'rocTree' predict(object, newdata, type = c("survival", "hazard"), control = list(), ...)
## S3 method for class 'rocTree' predict(object, newdata, type = c("survival", "hazard"), control = list(), ...)
object |
is an |
newdata |
is an optional data frame in which to look for variables with which to predict. If omitted, the fitted predictors are used. If the covariate observation time is not supplied, covariates will be treated as at baseline. |
type |
is an optional character string specifying whether to predict the survival probability or the cumulative hazard rate. |
control |
a list of control parameters. See 'details' for important special
features of control parameters. See |
... |
for future developments. |
Returns a data.frame
of the predicted survival probabilities or cumulative hazard.
data(simDat) fit <- rocTree(Surv(Time, death) ~ z1 + z2, id = id, data = simDat, ensemble = FALSE) ## testing data newdat <- data.frame(Time = sort(unique(simDat$Time)), z2 = runif(1)) newdat$z1 <- 1 * (newdat$Time < median(newdat$Time)) head(newdat) ## Predict survival pred <- predict(fit, newdat) plot(pred) ## Predict hazard pred <- predict(fit, newdat, type = "hazard") plot(pred)
data(simDat) fit <- rocTree(Surv(Time, death) ~ z1 + z2, id = id, data = simDat, ensemble = FALSE) ## testing data newdat <- data.frame(Time = sort(unique(simDat$Time)), z2 = runif(1)) newdat$z1 <- 1 * (newdat$Time < median(newdat$Time)) head(newdat) ## Predict survival pred <- predict(fit, newdat) plot(pred) ## Predict hazard pred <- predict(fit, newdat, type = "hazard") plot(pred)
rocTree
objectThe function prints an rocTree
object. It is a method for the generic function print of class "rocTree
".
## S3 method for class 'rocTree' print(x, digits = 5, tree = NULL, ...)
## S3 method for class 'rocTree' print(x, digits = 5, tree = NULL, ...)
x |
an |
digits |
the number of digits of numbers to print. |
tree |
an optional integer specifying the |
... |
for future development. |
data(simDat) ## Fitting a pruned survival tree rocTree(Surv(Time, death) ~ z1 + z2, id = id, data = simDat, ensemble = FALSE) ## Fitting a unpruned survival tree rocTree(Surv(Time, death) ~ z1 + z2, id = id, data = simDat, ensemble = FALSE, control = list(numFold = 0)) ## Not run: ## Fitting the ensemble algorithm (default) rocTree(Surv(Time, death) ~ z1 + z2, id = id, data = simDat, ensemble = TRUE) ## End(Not run)
data(simDat) ## Fitting a pruned survival tree rocTree(Surv(Time, death) ~ z1 + z2, id = id, data = simDat, ensemble = FALSE) ## Fitting a unpruned survival tree rocTree(Surv(Time, death) ~ z1 + z2, id = id, data = simDat, ensemble = FALSE, control = list(numFold = 0)) ## Not run: ## Fitting the ensemble algorithm (default) rocTree(Surv(Time, death) ~ z1 + z2, id = id, data = simDat, ensemble = TRUE) ## End(Not run)
Fits a "rocTree
" model.
rocTree( formula, data, id, subset, ensemble = TRUE, splitBy = c("dCON", "CON"), control = list() )
rocTree( formula, data, id, subset, ensemble = TRUE, splitBy = c("dCON", "CON"), control = list() )
formula |
is a formula object, with the response on the left of a '~' operator, and the terms on the right. The response must be a survival object returned by the 'Surv' function. |
data |
is an optional data frame in which to interpret the variables occurring in the 'formula'. |
id |
is an optional vector used to identify the longitudinal observations of subject's id. The length of 'id' should be the same as the total number of observations. If 'id' is missing, each row of 'data' represents a distinct observation from a subject and all covariates are treated as a baseline covariate. |
subset |
is an optional vector specifying a subset of observations to be used in the fitting process. |
ensemble |
is an optional logical value. If |
splitBy |
is a character string specifying the splitting algorithm. The available options are 'CON' and 'dCON' corresponding to the splitting algorithm based on the total concordance measure or the difference in concordance measure, respectively. The default value is 'dCON'. |
control |
a list of control parameters. See 'details' for important special features of control parameters. |
The argument "control" defaults to a list with the following values:
tau
is the maximum follow-up time; default value is the 90th percentile of the unique observed survival times.
maxTree
is the number of survival trees to be used in the ensemble method (when ensemble = TRUE
).
maxNode
is the maximum node number allowed to be in the tree; the default value is 500.
numFold
is the number of folds used in the cross-validation. When numFold > 0
, the survival tree will be pruned;
when numFold = 0
, the unpruned survival tree will be presented. The default value is 10.
h
is the smoothing parameter used in the Kernel; the default value is tau / 20
.
minSplitTerm
is the minimum number of baseline observations in each terminal node; the default value is 15.
minSplitNode
is the minimum number of baseline observations in each splitable node; the default value is 30.
disc
is a logical vector specifying whether the covariates in formula
are discrete (TRUE
) or continuous (FALSE
).
The length of disc
should be the same as the number of covariates in formula
. When not specified, the rocTree()
function assumes continuous covariates for all.
K
is the number of time points on which the concordance measure is computed.
A less refined time grids (smaller K
) generally yields faster speed but a very small K
is not recommended. The default value is 20.
An object of S4 class "rocTree
" representig the fit, with the following components:
Sun Y. and Wang, M.C. (2018+). ROC-guided classification and survival trees. Technical report.
See print.rocTree
and plot.rocTree
for printing and plotting an rocTree
, respectively.
data(simDat) ## Fitting a pruned survival tree rocTree(Surv(Time, death) ~ z1 + z2, id = id, data = simDat, ensemble = FALSE) ## Fitting a unpruned survival tree rocTree(Surv(Time, death) ~ z1 + z2, id = id, data = simDat, ensemble = FALSE, control = list(numFold = 0)) ## Not run: ## Fitting the ensemble algorithm (default) rocTree(Surv(Time, death) ~ z1 + z2, id = id, data = simDat, ensemble = TRUE) ## End(Not run)
data(simDat) ## Fitting a pruned survival tree rocTree(Surv(Time, death) ~ z1 + z2, id = id, data = simDat, ensemble = FALSE) ## Fitting a unpruned survival tree rocTree(Surv(Time, death) ~ z1 + z2, id = id, data = simDat, ensemble = FALSE, control = list(numFold = 0)) ## Not run: ## Fitting the ensemble algorithm (default) rocTree(Surv(Time, death) ~ z1 + z2, id = id, data = simDat, ensemble = TRUE) ## End(Not run)
A simulated data frame with variables
subjects identification
observation times
event/death indicator; 1 = an event (death) is recorded
baseline covariate generated from a standard uniform distribution
baseline covariate generated from a standard uniform distribution (independent from z1
A data frame with 5050 rows and 5 variables.
The sample dataset is generated by set.seed(1); simu(100, 0, 1.3)
.
See simu
for details of the simu
function.
This function is used to generate simulated data under various settings.
Let be a
-dimensional vector of possible time-dependent covariates and
be the vector of regression coefficient.
The survival times (
) are generated from the hazard function specified as follow:
Proportional hazards model:
where .
Proportional hazards model with noise variable:
where .
Proportional hazards model with nonlinear covariate effects:
where .
Accelerated failure time model:
where follows
Generalized gamma family:
where ,
follows Gamma(
),
Dichotomous time dependent covariate with at most one change in value:
where is the time-dependent covariate:
,
,
is a Bernoulli variable with equal probability, and
follows a uniform distribution over
.
Dichotomous time dependent covariate with multiple changes:
where ,
is a Bernoulli variable with equal probability, and
are the first three terms of a stationary Poisson process with rate 10.
Proportional hazard model with a continuous time dependent covariate:
where ,
and
are independent uniform random variables over
.
Non-proportional hazards model with a continuous time dependent covariate:
where ,
and
follow independent uniform distributions over
.
Non-proportional hazards model with a nonlinear time dependent covariate:
where ,
and
follow independent uniform distributions over
.
The censoring times are generated from an independent uniform distribution over ,
where
was tuned to yield censoring percentages of 25
simu(n, cen, scenario, summary = FALSE) trueHaz(dat) trueSurv(dat)
simu(n, cen, scenario, summary = FALSE) trueHaz(dat) trueSurv(dat)
n |
an integer value indicating the number of subjects. |
cen |
is a numeric value indicating the censoring percentage; three levels, 0%, 25%, 50%, are allowed. |
scenario |
can be either a numeric value or a character string. This indicates the simulation scenario noted above. |
summary |
a logical value indicating whether a brief data summary will be printed. |
dat |
is a data.frame prepared by |
simu
returns a data.frame
.
The returned data.frame consists of columns:
is the subject id.
is the observed follow-up time.
is the death indicator; death = 0 if censored.
is the possible time-independent covariate.
are the latent variables used to generate $Z_1(t)$ in Scenario 2.1 – 2.5.
The returned data.frame can be supply to trueHaz
and trueSurv
to generate the true cumulative hazard function and the survival function, respectively.
set.seed(1) simu(10, 0.25, 1.2, TRUE) set.seed(1) simu(10, 0.50, 2.2, TRUE)
set.seed(1) simu(10, 0.25, 1.2, TRUE) set.seed(1) simu(10, 0.50, 2.2, TRUE)