Parallel Programming in R

While parallel programming in R, the fork version is only compatible with Linux OS and not Windows (not to my knowledge). To do so, we will be using Ubuntu OS and will be implementing a function called mclapply, which is a spiffed up version of lapply, compatible with computations across more than one core.

Lets go over some ways that we can conduct repetition in R.

First, we will use a dataframe iris embedded in R to select 10000 samples with replacement. Then calculate means for each sample, after which we will be performing regressions based on each of the sample.

Writing a manual loop and lets time it.

set.seed(12756)
data <- iris
head(data)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

reps <- 10000  #number of replications
samp_mean <- matrix(nrow = reps, ncol = 4)   #matrix to store mean of each replication
store_coef <- matrix(nrow = reps, ncol = 4)  #matrix to store coefficients of each replication
system.time(
for(i in 1:reps) {
  samp_data <- data[sample(nrow(data), nrow(data), replace = T), ]
  samp_mean[i,] <- c(mean(samp_data[,1]), mean(samp_data[,2]) , mean(samp_data[,3]) 
                      , mean(samp_data[,4]))
  reg <- lm(samp_data[,1]~samp_data[,2] + samp_data[,3] + samp_data[,4])
  store_coef[i,] <- c(coefficients(reg))
}
)

##    user  system elapsed 
##   3.602   0.013   3.621

  head(samp_mean)

##          [,1]     [,2]     [,3]     [,4]
## [1,] 5.888000 3.008667 3.916000 1.275333
## [2,] 5.920667 3.060000 3.926667 1.270000
## [3,] 5.828667 3.031333 3.679333 1.158000
## [4,] 5.703333 3.090667 3.380667 1.019333
## [5,] 5.798000 3.074667 3.656667 1.156000
## [6,] 5.790667 3.078667 3.571333 1.122000

  head(store_coef)

##          [,1]      [,2]      [,3]       [,4]
## [1,] 2.046484 0.5814508 0.6999910 -0.5089182
## [2,] 1.439517 0.7568004 0.7546554 -0.6282994
## [3,] 2.011329 0.6003815 0.7933668 -0.7959236
## [4,] 2.088826 0.6158748 0.6661083 -0.5305879
## [5,] 1.860476 0.6219353 0.7766159 -0.7046237
## [6,] 2.268994 0.5240293 0.7521915 -0.6933740

The next way is to use the lapply function and lets time it.

set.seed(12756)
rep_count <- seq(1, 1000)

samp_mean <- matrix(nrow = reps, ncol = 4)   #matrix to store mean of each replication
store_coef <- matrix(nrow = reps, ncol = 4)  #matrix to store coefficients of each replication
#declare the function
system.time({
lm_fun <- function(rep_count) {
  samp_data <- data[sample(nrow(data), nrow(data), replace = T), ]
  mean_fn <- sapply(samp_data[,1:4], mean)
  #samp_mean <- rbind(data.frame(), c(mean_fn))
  samp_mean[rep_count,] <- c(mean_fn)
  reg <- lm(samp_data[,1]~samp_data[,2] + samp_data[,3] + samp_data[,4])
  #store_coef[rep_count,] <- c(coefficients(reg))
  store_coef <- rbind(data.frame(), unlist(c(coefficients(reg))))
  return(store_coef)
  
}
coef <- lapply(rep_count, lm_fun) #stores as a list

for (i in 1:1000) {
  store_coef[i,] <- unlist(coef[i]) #There might be an easier way of unlisting
}
})

##    user  system elapsed 
##   0.584   0.027   0.612

head(store_coef)

##          [,1]      [,2]      [,3]       [,4]
## [1,] 2.046484 0.5814508 0.6999910 -0.5089182
## [2,] 1.439517 0.7568004 0.7546554 -0.6282994
## [3,] 2.011329 0.6003815 0.7933668 -0.7959236
## [4,] 2.088826 0.6158748 0.6661083 -0.5305879
## [5,] 1.860476 0.6219353 0.7766159 -0.7046237
## [6,] 2.268994 0.5240293 0.7521915 -0.6933740

hist(store_coef[,1])

As it can be seen, usage of lapply function instead of the conventional looping in R reduces the execution time massively. Next, lets check speed with mclapply.

library(parallel)
RNGkind("L'Ecuyer-CMRG")
#M <- 16 ## start M workers 
#s <- .Random.seed
#for (i in 1:M) {
#    s <- nextRNGStream(s)# send s to worker i as .Random.seed 
#}
numcores <- detectCores()
store_coef_par <- matrix(nrow = reps, ncol = 4)  #matrix to store coefficients of each replication using mcapply
set.seed(12756)
system.time({
coef_par <- mclapply(rep_count, lm_fun, mc.cores = numcores, mc.set.seed=TRUE) 
for (i in 1:1000) {
  store_coef_par[i,] <- unlist(coef_par[i]) #There might be an easier way of unlisting
}
})

##    user  system elapsed 
##   0.754   0.266   0.172

head(store_coef_par)

##          [,1]      [,2]      [,3]       [,4]
## [1,] 1.938922 0.6275791 0.7327636 -0.6513998
## [2,] 1.677495 0.6761382 0.8272920 -0.8376547
## [3,] 2.034819 0.6026271 0.6455032 -0.3868924
## [4,] 1.712390 0.6784656 0.7517035 -0.6209716
## [5,] 1.740241 0.6681291 0.7342910 -0.6313829
## [6,] 1.663684 0.7394749 0.5613409 -0.1907460

What is happenning is that when lapply() is used, computation is being conducted on a single core. Setting seed using set.seed() ensures that the same set of randomly generated number (RGN) is used to extract the sample. This makes the task reproducible everytime the code is ran. However, setting the seed is different in terms of parallel computing.

When it comes to parallel computing, task will be shared between

detectCores()

## [1] 10

workers, so one needs to be careful when using parallelization when using the fork approach.¹ The analogy here is that each core will be acting as a cpu. When doing something repetitive using, say, mclapply function, what we want to do is make sure that each worker gets a different seed. This is done setting mc.set.seed = TRUE option. If mc.set.seed = FALSE, then each worker will take the master seed which means that same set of RGN will be used. This can be demonstrated as below:

mclapply(seq(1,9), function(x) {runif(7)}, mc.cores = 3, mc.set.seed = FALSE)

## [[1]]
## [1] 0.8915028 0.1611878 0.9919948 0.9355318 0.9845871 0.7261969 0.6365444
## 
## [[2]]
## [1] 0.8915028 0.1611878 0.9919948 0.9355318 0.9845871 0.7261969 0.6365444
## 
## [[3]]
## [1] 0.8915028 0.1611878 0.9919948 0.9355318 0.9845871 0.7261969 0.6365444
## 
## [[4]]
## [1] 0.3348094 0.3779876 0.6421789 0.6764229 0.2692946 0.7680416 0.6337699
## 
## [[5]]
## [1] 0.3348094 0.3779876 0.6421789 0.6764229 0.2692946 0.7680416 0.6337699
## 
## [[6]]
## [1] 0.3348094 0.3779876 0.6421789 0.6764229 0.2692946 0.7680416 0.6337699
## 
## [[7]]
## [1] 0.1716602 0.3733823 0.6777974 0.4555469 0.9167586 0.8309814 0.3541566
## 
## [[8]]
## [1] 0.1716602 0.3733823 0.6777974 0.4555469 0.9167586 0.8309814 0.3541566
## 
## [[9]]
## [1] 0.1716602 0.3733823 0.6777974 0.4555469 0.9167586 0.8309814 0.3541566

Here, we are generating 9 samples, each of which contains 7 numbers from a unifrom distribution. The task is to be divided between 3 cores. It can been seen that the generated samples are identical at each 3 steps, which is not a desirable property. A common starting seed (from the master file) is being used by all 3 cores, which then produces a sequence of seed that are the same. This makes each 3 generated samples identical. Then next seed is taken from the master, which again generates identical next three samples.

This can be taken care of by setting mc.set.seed = TRUE, which ensures that each worker or core would have a distinct seed. However, still generated numbers can be correlated after a certain number of steps. What we want to do is increase the period such that the probability of such systematic repetition of random number over streams is low. This is done by using “L’Ecuyer-CMRG” RNG in parallel package, which ensures that generated numbers are not easily correlated over streams.

Here are some links that discuss parallel programming in R in detail.

Setting seed while parallel programming R programming for data science Intro to parallel computing in R Parallelization of regression

The other is socket approach – each process on each core is unique.↩︎