Issue
Hi I'm trying to optimize the following program for a Monte Carlo Power study. I first use an algorithm taken from Efron and Tibshirani (1993) to find the p-value of a permutation test (upper-tailed) for equality of means. I wrote a function as follows called perm_test(), of which the output is a single p-value. Then, I call this function in another program, power_p(), that simulates 1000 permutation tests (returning 1000 p-values). My power estimate is the proportion of these 1000 p-values that are statistically significant, i.e., < 0.05. The entire process takes me about 8 minutes to run (2020 macBook pro). I'm wondering if anyone has any suggestions in terms of optimizing this process to make it run quicker. Many thnx.
perm_test <- function(delta) {
# Permutation test as described in Efron and Tibshirani (1993), (Algorithm 15.1, p208)
# Draw random samples from a normal distribution
x <- rnorm(10, mean = delta, sd = 1)
y <- rnorm(10, mean = 0, sd = 1)
# observed diff in means, denoted as D_obs
D_obs <- mean(x) - mean(y)
# Create a data frame "N" with n_x + n_y obs (20 rows in our case)
N <- data.frame("v" = c(x, y))
# create a group variable "g" indicating which group each observation belongs to
N$g <- as.factor(c(rep("x", 10), rep("y", 10)))
# arrange column "v" in ascending order
# notice that column "g" is also re-ordered
N <- arrange(N, v)
###############################################################################################
# There are 20 choose 10 (184756) possibilities for the ordering of "g" #
# corresponding to all possible ways of partitioning 20 elements into two subsets of size 10 #
# we take only a random sample of 5000 from those 184756 possibilities #
###############################################################################################
# Initialize variables
B <- 5000
x_mean <- 0
y_mean <- 0
D_perm <- rep(0, 5000)
# Loop to randomly generate 5000 different orderings of "g"
for (i in 1:B) {
# Shuffle the ordering of "g"
N$g <- sample(N$g)
# Permuted means of x and y
x_mean <- tapply(N$v, N$g, mean)[1]
y_mean <- tapply(N$v, N$g, mean)[2]
# Find permuted diff in means, denoted as D_perm
D_perm[i] <- x_mean - y_mean
}
# Find p-value
P_perm <- sum(D_perm >= D_obs)/ B
# Output
return(round(P_perm, digits = 5))
}
Here's the program that simulates 1000 permutation tests:
power_p <- function(numTrial, delta) {
# Initilize variables
P_p <- rep(0, numTrial)
pwr_p <- 0
# Simulation
P_p <- replicate(n = numTrial, expr = perm_test(delta))
# Power estimates are the proportions of p-values that are significant (i.e. less than 0.05)
pwr_p <- sum(P_p < 0.05) / numTrial
# Output
return(round(pwr_p, digits = 5))
}
Solution
perm_test2 <- function(delta) {
x <- rnorm(10, mean = delta, sd = 1)
y <- rnorm(10, mean = 0, sd = 1)
D_obs <- mean(x) - mean(y)
v <- c(x, y)
g <- rep(1:2, each = 10)
B <- 5000
y_mean <- x_mean <- 0
D_perm <- rep(0, B)
for (i in 1:B) {
ii <- sample(g) == 1L
D_perm[i] <- (sum(v[ii]) - sum(v[!ii]))/10
}
P_perm <- sum(D_perm >= D_obs)/ B
return(round(P_perm, digits = 5))
}
In comparison to more complicated approaches I propose the above one, where I have made some simple improvements to your existing code.
- we do not need to use
data.frame
, that only takes unnecessary time to subset vector each time - instead of factor group vector we can use simple integer vector
- the ordering of data is not needed
- we can reduce mean calculation. In your code 2 calls of
tapply(N$v, N$g, mean)
was the slowest part. mean(x)
is slower thansum(x)/n
, it does additional checks etc., so in this situation we can use the faster approach. As the inner loop will be executed 1000x5000 times(sims x B).
bench::mark(perm_test_org(0.5), perm_test2(0.5), iterations = 5, check = F,
relative = T)[, 1:8]
# A tibble: 2 x 8
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
# 1 perm_test_org(0.5) 15.5 15.3 1 1.00 1 5 16
# 2 perm_test2(0.5) 1 1 13.5 1 1.69 5 2
Approximately 15x faster on my system. 1000 iterations took 33.89 seconds.
Update 2.
We can improve speed even more by:
- replacing
sample
withsample.int
- then we see that the
g
vector isn't needed at all to select random two groups - in loop we do not need to sum both parts of vector
v
as we cansum(v)
before loop, so we can do one sum less inside the loop and calculate the result values at end.
perm_test3 <- function(delta) {
x <- rnorm(10, mean = delta, sd = 1)
y <- rnorm(10, mean = 0, sd = 1)
D_obs <- mean(x) - mean(y)
v <- c(x, y)
B <- 5000
s <- sum(v)
D_perm2 <- rep(0, B)
for (i in 1:B) {
D_perm2[i] <- sum(v[sample.int(10) < 6])
}
D_perm <- D_perm2 - (s - D_perm2)
P_perm <- sum(D_perm/10 >= D_obs) / B
return(round(P_perm, digits = 5))
}
Runs in +/- 20 seconds for 1000 iterations.
Now the slowest part is repeated call of sample.int
. You can look into faster functions:
https://www.r-bloggers.com/2019/04/fast-sampling-support-in-dqrng/
Answered By - minem Answer Checked By - Cary Denson (PHPFixing Admin)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.