PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Sunday, October 9, 2022

[FIXED] How could I optimize this following program in R to boost performance? (Monte Carlo simulation involving compute-intensive permutation test)

 October 09, 2022     for-loop, optimization, performance, r, statistics     No comments   

Issue

Hi I'm trying to optimize the following program for a Monte Carlo Power study. I first use an algorithm taken from Efron and Tibshirani (1993) to find the p-value of a permutation test (upper-tailed) for equality of means. I wrote a function as follows called perm_test(), of which the output is a single p-value. Then, I call this function in another program, power_p(), that simulates 1000 permutation tests (returning 1000 p-values). My power estimate is the proportion of these 1000 p-values that are statistically significant, i.e., < 0.05. The entire process takes me about 8 minutes to run (2020 macBook pro). I'm wondering if anyone has any suggestions in terms of optimizing this process to make it run quicker. Many thnx.

perm_test <- function(delta) {
  
  # Permutation test as described in Efron and Tibshirani (1993), (Algorithm 15.1, p208)
  
  # Draw random samples from a normal distribution
  x <- rnorm(10, mean = delta, sd = 1)
  y <- rnorm(10, mean = 0, sd = 1)
  # observed diff in means, denoted as D_obs 
  D_obs <- mean(x) - mean(y)
  
  # Create a data frame "N" with  n_x + n_y obs (20 rows in our case)
  N <- data.frame("v" = c(x, y))
  # create a group variable "g" indicating which group each observation belongs to
  N$g <- as.factor(c(rep("x", 10), rep("y", 10)))
  # arrange column "v" in ascending order 
  # notice that column "g" is also re-ordered
  N <- arrange(N, v)
  
  ###############################################################################################
  # There are 20 choose 10 (184756) possibilities for the ordering of "g"                       #       
  # corresponding to all possible ways of partitioning 20 elements into two subsets of size 10  #
  # we take only a random sample of 5000 from those 184756 possibilities                        #
  ###############################################################################################
  
  # Initialize variables
  B <- 5000
  x_mean <- 0
  y_mean <- 0
  D_perm <- rep(0, 5000)
  
  # Loop to randomly generate 5000 different orderings of "g"
  for (i in 1:B) {
    
    # Shuffle the ordering of "g"
    N$g <- sample(N$g)
    # Permuted means of x and y
    x_mean <- tapply(N$v, N$g, mean)[1]
    y_mean <- tapply(N$v, N$g, mean)[2]
    # Find permuted diff in means, denoted as D_perm
    D_perm[i] <- x_mean - y_mean 
    }
  
  # Find p-value 
  P_perm <- sum(D_perm >= D_obs)/ B
  
  # Output
  return(round(P_perm, digits = 5))
  
}

Here's the program that simulates 1000 permutation tests:

power_p <- function(numTrial, delta) {
  
  # Initilize variables
  P_p <- rep(0, numTrial) 
  pwr_p <- 0
  
  # Simulation
  P_p <- replicate(n = numTrial, expr = perm_test(delta))
  
  # Power estimates are the proportions of p-values that are significant (i.e. less than 0.05)
  pwr_p <- sum(P_p < 0.05) / numTrial
  
  # Output 
  return(round(pwr_p, digits = 5))

}

Solution

perm_test2 <- function(delta) {
  x <- rnorm(10, mean = delta, sd = 1)
  y <- rnorm(10, mean = 0, sd = 1)
  D_obs <- mean(x) - mean(y)
  v <- c(x, y)
  g <- rep(1:2, each = 10)
  B <- 5000
  y_mean <- x_mean <- 0
  D_perm <- rep(0, B)
  for (i in 1:B) {
    ii <- sample(g) == 1L
    D_perm[i] <- (sum(v[ii]) - sum(v[!ii]))/10 
  }
  P_perm <- sum(D_perm >= D_obs)/ B
  return(round(P_perm, digits = 5))
}

In comparison to more complicated approaches I propose the above one, where I have made some simple improvements to your existing code.

  1. we do not need to use data.frame, that only takes unnecessary time to subset vector each time
  2. instead of factor group vector we can use simple integer vector
  3. the ordering of data is not needed
  4. we can reduce mean calculation. In your code 2 calls of tapply(N$v, N$g, mean) was the slowest part.
  5. mean(x) is slower than sum(x)/n, it does additional checks etc., so in this situation we can use the faster approach. As the inner loop will be executed 1000x5000 times(sims x B).

bench::mark(perm_test_org(0.5), perm_test2(0.5), iterations = 5, check = F,
            relative = T)[, 1:8]
# A tibble: 2 x 8
#   expression           min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#   <bch:expr>         <dbl>  <dbl>     <dbl>     <dbl>    <dbl> <int> <dbl>
# 1 perm_test_org(0.5)  15.5   15.3       1        1.00     1        5    16
# 2 perm_test2(0.5)      1      1        13.5      1        1.69     5     2

Approximately 15x faster on my system. 1000 iterations took 33.89 seconds.

Update 2.

We can improve speed even more by:

  1. replacing sample with sample.int
  2. then we see that the g vector isn't needed at all to select random two groups
  3. in loop we do not need to sum both parts of vector v as we can sum(v) before loop, so we can do one sum less inside the loop and calculate the result values at end.

perm_test3 <- function(delta) {
  x <- rnorm(10, mean = delta, sd = 1)
  y <- rnorm(10, mean = 0, sd = 1)
  D_obs <- mean(x) - mean(y)
  v <- c(x, y)
  B <- 5000
  s <- sum(v)
  D_perm2 <- rep(0, B)
  for (i in 1:B) {
    D_perm2[i] <- sum(v[sample.int(10) < 6])
  }
  D_perm <- D_perm2 - (s - D_perm2)
  P_perm <- sum(D_perm/10 >= D_obs) / B
  return(round(P_perm, digits = 5))
}

Runs in +/- 20 seconds for 1000 iterations. Now the slowest part is repeated call of sample.int. You can look into faster functions: https://www.r-bloggers.com/2019/04/fast-sampling-support-in-dqrng/



Answered By - minem
Answer Checked By - Cary Denson (PHPFixing Admin)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing