Friday, October 5, 2012

5 Ways to Estimate Concordance Index for Cox Models in R, Why Results Aren't Identical?

Harrell's concordance index (c-index) can be used to evaluate the discriminatory power and the predictive accuracy of Cox models. An easy way out as a surrogate for ROC analysis. 
=================================================
Approach 1
-------------------------------------------------------------------------
Use function "rcorrcens" in package "Hmisc"
Limitations:
  • Can only handle un-censored data
  • Roughly handle categorical predictor with more than 2 categories
=================================================
Sample code:
library(survival)
library(Hmisc)
attach(sample.data)
surv <- Surv(survival, censor)
rcorrcens(surv ~ group)

=================================================
Approach 2
-------------------------------------------------------------------------
Direct output from coxph
Require higher version of R, say R 2.15, didn't test with older versions
=================================================
Sample code:
library(survival)
attach(sample.data)
surv <- Surv(survival, censor)
sum.surv <- summary(coxph(surv ~ group))
c_index <- sum.surv$concordance

=================================================
Approach 3
-------------------------------------------------------------------------
Use function "survConcordance"
Result is the same as in Approach 2
=================================================
Sample code:
library(survival)
attach(sample.data)
surv <- Surv(survival, censor)
fit <- coxph( surv ~ group)
survConcordance(surv ~ predict(fit))

=================================================
Approach 4
-------------------------------------------------------------------------
Use package "survcomp" in bioconductor
Different result from Approach 2/3
The disparity is due to two different estimation approaches that are used to handle tied risks (i.e. cases when  two observations have the same survival with the same x). The method that approaches 4/5 use ignores the tied risks,  the other method (approaches 2/3) takes into consideration of the tied risks. In terms of formulation/symbol (for illustration only):

Approaches 4/5 used:
Concordance = #all concordant pairs/#total pairs ignoring ties.

Approaches 2/3 used:
Concordance = (#all concordant pairs + #tied pairs/2)/(#total pairs including ties).

Those #s can be obtained in the output of Approach 3.
=================================================
Sample code:
source("http://bioconductor.org/biocLite.R")
biocLite("survcomp")
library(survcomp)
surv <- Surv(survival, censor)
fit <- coxph(surv ~ group, data= sample.data)
coxPredict <- predict(fit, data=sample.data, type="risk")  
concordance.index(x=coxPredict, surv.time=survival, surv.event=censor, method="noether")


=================================================
Approach 5
-------------------------------------------------------------------------
Use function "cph" in package "rms"
Provide both un-adjusted and bias adjusted c-index
Un-adjusted c-index is the same as the one from Approach 4
=================================================
Sample code:
library(rms)
surv <- Surv(survival, censor)
set.seed(1)
fit.cph <- cph(surv ~ group, data= sample.data, x=TRUE, y=TRUE, surv=TRUE)
  
# Get the Dxy
v <- validate.cph(fit.cph, dxy=TRUE, B=1000)
Dxy = v[rownames(v)=="Dxy", colnames(v)=="index.corrected"]
orig_Dxy = v[rownames(v)=="Dxy", colnames(v)=="index.orig"]
  
# The c-statistic according to Dxy=2(c-0.5)
bias_corrected_c_index  <- abs(Dxy)/2+0.5
orig_c_index <- abs(Orig.Dxy)/2+0.5


=================================================
Combining Approaches 2/3 & 4/5 and Calculate p-value for Testing Two C-indices for Both Estimation Methods
-------------------------------------------------------------------------
Did not find it elsewhere online. Hope someone could find this helpful.
=================================================
Sample code:
surv <- Surv(survival, censor)
c_index <- function(group, ties=TRUE){
  fit <- coxph(surv ~ group, data=sample.data)
  coxPredict <- predict(fit, data=sample.data, type="risk")  
  
  # Approaches 4/5
  if (ties==F) {
  concordance.index(x=coxPredict, surv.time=survival, surv.event=censor, method="noether")
  }
  # Approaches 2/3
  else if (ties==T) {
  survConcordance(surv ~ coxPredict, data=sample.data)
  }
}
c_index_ties1 <- c_index(group=group1, ties=TRUE)
c_index_ties2 <- c_index(group=group2, ties=TRUE)

c_index_no_ties1 <- c_index_ties(group=group1, ties=F)
c_index_no_ties2 <- c_index_ties(group=group2, ties=F)

# p-value of testing two c-indices ignoring ties
round(cindex.comp(c_index_no_ties1, c_index_no_ties2)$p.value,3)

# Function for p-value of testing two c-indices accounting for ties
# t-test for dependent variables is used for significance 
# Input variables are objects obtained from the first function

cindex.p.ties <- function(c_index_ties1, c_index_ties2, c_index_no_ties1, c_index_no_ties2) {
    eps <- 1E-15
    n <- c_index_no_ties1$n
    r <- cor(c_index_no_ties1$data$x, c_index_no_ties2$data$x, use="complete.obs", method="spearman")
    if ((1 - abs(r)) > eps) {
      t.stat <- (c_index_ties1$concordance - c_index_ties2$concordance) / sqrt(c_index_ties1$std.err^2 + c_index_ties2$std.err^2 - 2 * r * c_index_ties1$std.err * c_index_ties2$std.err)
      diff.ci.p <- pt(q=t.stat, df=n - 1, lower.tail=FALSE)
    } else { diff.ci.p <- 1 }
    return(list("p.value"=diff.ci.p))
  }
cindex.p.ties(c_index_ties1=c_index_ties1, c_index_ties2=c_index_ties2, c_index_no_ties1=c_index_no_ties1, c_index_no_ties2=c_index_no_ties2)




Wednesday, August 29, 2012

Pass Parameters from Linux/Unix Shell to R


Suppose there are two parameters (START, END) to be carried from Linux shell (bash) script to R. The trick is to use the argument "--args START=1 END=10" in a SGE jobscript to submit a R job in the BATCH mode: qsub test.sh
R CMD BATCH --no-save --no-restore '--args START=1 END=10' sample.R

Here is a template test.sh
#!/bin/bash 
#$ -cwd 
#$ -N NAME_OF_JOB
#$ -M YOUR_EMAIL_ADDRESS
# email me when the job e-ends,a-aborts,s-suspends
#$ -m eas
#$ -S /bin/sh 
# Combining output/error messages into one file
#$ -j y

R CMD BATCH --no-save --no-restore '--args START=1 END=10' sample.R
 In the sample.R program, it should contain
#======================================================== 
# At the beginning of R, it should contain 
#======================================================== 
## First read in the arguments listed at the command line 
args = (commandArgs(TRUE)) 
## args is now a list of character vectors 
if (length(args)==0) {
                                print("No arguments supplied.") 
                                ## Supply default values 
                                START = 0 
                                END = 0 
                               }
else {
        for (i in 1:length(args)) 
                                            eval(parse(text=args[[i]])) 
                                           } 
       } 
#======================================================== 
print(START) 
print(END) 
#======================================================== 
# Main R program 
#======================================================== 
for (i in START:END) { ... }
Reference Click Me

Handling Large Datasets in R

Handling Large Datasets in R

Thursday, April 5, 2012

Decision Trees in R

Classification and Regression Trees (CART)
R packages: tree and rpart
Sample codes:
library("tree")
tree.result <- tree(Y~., data=mydata)
summary(tree.result)
plot(tree.result); text(tree.result)
library("rpart")
fit <- rpart(Y~., data=mydata, control=rpart.control(minsplit=100, method=class, cp=0.001))
printcp(fit) # display the results
plotcp(fit) # visualize cross-validation results
summary(fit) # detailed summary of splits
rsq.rpart(fit) # generate two plots - r square and relative error 
summary(predict(fit, mydata, type="class"))

Bagging with Regression Trees (BRT)
R package: ipred
Sample codes:
library(ipred)
bag.result <- bagging(Y~., data=mydata, nbagg=30)

Random Forest (RF)
R package: randomForest
library("randomForest")
rf <- randomForest(Y~., data=training.data, ntree=100, mtry=40)
pred <- predict(rf, test.data)
cmatrix <- table(observed=test.data [, "Y"], predicted=pred) # Confusion matrix
# pred <- predict(rf, test.data , type="prob")

Random Fields in R

Package 'RandomFields'
http://cran.r-project.org/web/packages/RandomFields/RandomFields.pdf
Package 'CompRandFld'
http://cran.r-project.org/web/packages/CompRandFld/CompRandFld.pdf

A Trick to Better Search R in Google

Include [R] instead of R as one of the keywords

In addition, in R
library("sos")
findFn("string")
Other helpful websites:

Great News for Lazy Bones Who are Addictive to R

One package: Rcmdr

Now cluster analysis is just several mouse clicks away...