Issue

Hi I'm trying to optimize the following program for a Monte Carlo Power study. I first use an algorithm taken from Efron and Tibshirani (1993) to find the p-value of a permutation test (upper-tailed) for equality of means. I wrote a function as follows called perm_test(), of which the output is a single p-value. Then, I call this function in another program, power_p(), that simulates 1000 permutation tests (returning 1000 p-values). My power estimate is the proportion of these 1000 p-values that are statistically significant, i.e., < 0.05. The entire process takes me about 8 minutes to run (2020 macBook pro). I'm wondering if anyone has any suggestions in terms of optimizing this process to make it run quicker. Many thnx.

perm_test <- function(delta) {
  
  # Permutation test as described in Efron and Tibshirani (1993), (Algorithm 15.1, p208)
  
  # Draw random samples from a normal distribution
  x <- rnorm(10, mean = delta, sd = 1)
  y <- rnorm(10, mean = 0, sd = 1)
  # observed diff in means, denoted as D_obs 
  D_obs <- mean(x) - mean(y)
  
  # Create a data frame "N" with  n_x + n_y obs (20 rows in our case)
  N <- data.frame("v" = c(x, y))
  # create a group variable "g" indicating which group each observation belongs to
  N$g <- as.factor(c(rep("x", 10), rep("y", 10)))
  # arrange column "v" in ascending order 
  # notice that column "g" is also re-ordered
  N <- arrange(N, v)
  
  ###############################################################################################
  # There are 20 choose 10 (184756) possibilities for the ordering of "g"                       #       
  # corresponding to all possible ways of partitioning 20 elements into two subsets of size 10  #
  # we take only a random sample of 5000 from those 184756 possibilities                        #
  ###############################################################################################
  
  # Initialize variables
  B <- 5000
  x_mean <- 0
  y_mean <- 0
  D_perm <- rep(0, 5000)
  
  # Loop to randomly generate 5000 different orderings of "g"
  for (i in 1:B) {
    
    # Shuffle the ordering of "g"
    N$g <- sample(N$g)
    # Permuted means of x and y
    x_mean <- tapply(N$v, N$g, mean)[1]
    y_mean <- tapply(N$v, N$g, mean)[2]
    # Find permuted diff in means, denoted as D_perm
    D_perm[i] <- x_mean - y_mean 
    }
  
  # Find p-value 
  P_perm <- sum(D_perm >= D_obs)/ B
  
  # Output
  return(round(P_perm, digits = 5))
  
}

Here's the program that simulates 1000 permutation tests:

power_p <- function(numTrial, delta) {
  
  # Initilize variables
  P_p <- rep(0, numTrial) 
  pwr_p <- 0
  
  # Simulation
  P_p <- replicate(n = numTrial, expr = perm_test(delta))
  
  # Power estimates are the proportions of p-values that are significant (i.e. less than 0.05)
  pwr_p <- sum(P_p < 0.05) / numTrial
  
  # Output 
  return(round(pwr_p, digits = 5))

}

Solution

perm_test2 <- function(delta) {
  x <- rnorm(10, mean = delta, sd = 1)
  y <- rnorm(10, mean = 0, sd = 1)
  D_obs <- mean(x) - mean(y)
  v <- c(x, y)
  g <- rep(1:2, each = 10)
  B <- 5000
  y_mean <- x_mean <- 0
  D_perm <- rep(0, B)
  for (i in 1:B) {
    ii <- sample(g) == 1L
    D_perm[i] <- (sum(v[ii]) - sum(v[!ii]))/10 
  }
  P_perm <- sum(D_perm >= D_obs)/ B
  return(round(P_perm, digits = 5))
}

In comparison to more complicated approaches I propose the above one, where I have made some simple improvements to your existing code.

we do not need to use data.frame, that only takes unnecessary time to subset vector each time
instead of factor group vector we can use simple integer vector
the ordering of data is not needed
we can reduce mean calculation. In your code 2 calls of tapply(N$v, N$g, mean) was the slowest part.
mean(x) is slower than sum(x)/n, it does additional checks etc., so in this situation we can use the faster approach. As the inner loop will be executed 1000x5000 times(sims x B).

bench::mark(perm_test_org(0.5), perm_test2(0.5), iterations = 5, check = F,
            relative = T)[, 1:8]
# A tibble: 2 x 8
#   expression           min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#   <bch:expr>         <dbl>  <dbl>     <dbl>     <dbl>    <dbl> <int> <dbl>
# 1 perm_test_org(0.5)  15.5   15.3       1        1.00     1        5    16
# 2 perm_test2(0.5)      1      1        13.5      1        1.69     5     2

Approximately 15x faster on my system. 1000 iterations took 33.89 seconds.

Update 2.

We can improve speed even more by:

replacing sample with sample.int
then we see that the g vector isn't needed at all to select random two groups
in loop we do not need to sum both parts of vector v as we can sum(v) before loop, so we can do one sum less inside the loop and calculate the result values at end.

perm_test3 <- function(delta) {
  x <- rnorm(10, mean = delta, sd = 1)
  y <- rnorm(10, mean = 0, sd = 1)
  D_obs <- mean(x) - mean(y)
  v <- c(x, y)
  B <- 5000
  s <- sum(v)
  D_perm2 <- rep(0, B)
  for (i in 1:B) {
    D_perm2[i] <- sum(v[sample.int(10) < 6])
  }
  D_perm <- D_perm2 - (s - D_perm2)
  P_perm <- sum(D_perm/10 >= D_obs) / B
  return(round(P_perm, digits = 5))
}

Runs in +/- 20 seconds for 1000 iterations. Now the slowest part is repeated call of sample.int. You can look into faster functions: https://www.r-bloggers.com/2019/04/fast-sampling-support-in-dqrng/

Answered By - minem

Answer Checked By - Cary Denson (PHPFixing Admin)

Sunday, October 9, 2022

[FIXED] How could I optimize this following program in R to boost performance? (Monte Carlo simulation involving compute-intensive permutation test)

Issue

Solution

Update 2.

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Sunday, October 9, 2022

Issue

Solution

Update 2.

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To