Normally, it is better to avoid loops in R. But for highly individual tasks a vectorization is not always possible. Hence, a loop is needed – if the problem is decomposable.
Which different kinds of loops exist in R and which one to use in which situation?
In each programming language, for- and while-loops (sometimes until-loops) exist. These loops are sequential and not that fast – in R.
for(i in x)
{task}
i=y
while(i<=x)
{task
i=i+1}
Even for prototyping sometimes too slow.
But how to improve speed?
There are three options in R:
- apply loops
- parallelization
- RCPP
apply loops:
Normally, you can use apply for calculating some standard statistics of the columns, the rows, or both. But you can use a trick to adjust the apply order for a loop. The syntax is:
F <- function(i, x, y, z,…)
{task}
apply(as.data.frame(1:length(vector)), margin = 1, FUN = F)
In this case you use the vector not for direct calculation but as an index “i” instead.
The sapply order is even faster.
F <- function(i, x, y, z,…)
{task}
sapply(1:length(vector), FUN = F)
Parallelization:
You can use loops and apply orders also in parallel. You need:
library(“doParallel”)
library(“parallel”)
library(“foreach”)
Firstly defining the number of cores. Leave out at least one:
NumOfCores <- detectCores() – 1
registerDoParallel(NumOfCores)
Either using a loop:
foreach::foreach(x = 1:length(vector), .combine = rbind, .inorder = T, .multicombine = F) %dopar%
{task}
This loop creates a vector of results.
If the order is not important you can increase performance by .inorder = F. This means that a free processor takes the next iteration independent from the sequence of the iterations.
Or using a parSapply order:
clusters <- makeCluster(NumOfCores)
parSapply(cl = clusters, X = 1:length(vector), FUN = F, x = x, y = y, z = z,… )
In this case it is important to integrate the data in reference within the parentheses – you cannot directly connect to the workspace like in the ordinary sapply order.
RCPP:
Firstly you need to install RTools.
library(“Rcpp”)
define a function in C++, create a shared library and compile the code.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double NameOfFunction (NumericVector Vector)
{task}
Then you can call it in R:
sapply(X = 1:length(testVec), FUN = NameOfFunction, y = Vector)
But when to use which kind of loop?
Judging from the experience, I recommend to make the decision dependent from the number of iterations and the costs of each iteration.
|
Not costly |
costly |
Low number of iterations |
for-loop, while-loop |
RCPP, foreach |
Large number of iterations |
RCPP, sapply, apply, lapply, for-loop, while-loop |
parSapply, RCPP |