07 - Statistical Functions
Note
These functions ignore NA
values for now. Adjustments
for handling NA
values are covered in a separate
vignette.
R already provides efficient versions of the functions covered here. This is just to illustrate how to use C++ code.
The explanations and equations used for the functions are taken from Diez et al. (2015) and Hansen (2022). Some examples were adapted from Vaughan, Hester, and Francois (2024).
Sum
For a vector of \(n\) elements \(x_1, x_2, \ldots, x_n\), the sum is calculated as:
\[ \sum_{i=1}^{n} x_i = x_1 + x_2 + \ldots + x_n \]
The following C++ function calculates the sum of a vector’s elements:
[[cpp4r::register]]
double sum_cpp(doubles x) {
int n = x.size();
double total = 0;
for(int i = 0; i < n; ++i) {
total += x[i];
}
return total;
}
Its R equivalent is:
sum_r <- function(x) {
total <- 0
for (i in seq_along(x)) {
total <- total + x[i]
}
total
}
Document the functions and update the package as in the previous vignettes.
To test the functions, you can run the following benchmark code in the R console:
# install.packages("bench")
library(bench)
set.seed(123) # for reproducibility
x <- runif(1e3) # 1,000,000 elements
sum(x)
sum_cpp(x)
sum_r(x)
mark(
sum(x),
sum_cpp(x),
sum_r(x)
)
Arithmetic mean
The arithmetic mean of a vector of \(n\) elements \(x_1, x_2, \ldots, x_n\) is calculated as:
\[ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \]
The following C++ function calculates the mean of a vector’s elements:
[[cpp4r::register]]
double mean_cpp(doubles x) {
int n = x.size();
double y = 0;
for(int i = 0; i < n; ++i) {
y += x[i];
}
return y / n;
}
Its R equivalent would be:
mean_r <- function(x) {
sum_r(x) / length(x)
}
Document the functions and update the package as in the previous vignettes.
To test the functions, you can run the following benchmark code in the R console:
mean(x)
mean_cpp(x)
mean_r(x)
mark(
mean(x),
mean_cpp(x),
mean_r(x)
)
Variance
The variance of a vector of \(n\) elements \(x_1, x_2, \ldots, x_n\) is calculated as:
\[ \text{Var}(x) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]
The following C++ function calculates the variance of a vector’s elements:
[[cpp4r::register]]
double var_cpp(doubles x) {
int n = x.size();
double y1 = 0, y2 = 0;
for(int i = 0; i < n; ++i) {
y1 += x[i];
y2 += pow(x[i], 2.0);
}
return (y2 - pow(y1, 2.0) / n) / (n - 1);
}
Its R equivalent would be:
var_r <- function(x) {
mean_r((x - mean_r(x))^2)
}
Document the functions and update the package as in the previous vignettes.
To test the functions, you can run the following benchmark code in the R console:
var(x)
var_cpp(x)
var_r(x)
mark(
var(x),
var_cpp(x),
var_r(x)
)
Root Mean Square Error (RMSE)
The RMSE function measures the differences between observed values and the true value.
For a vector of \(n\) elements \(x_1, x_2, \ldots, x_n\) and a value \(x_0\), the RMSE is calculated as:
\[ \text{RMSE}(x, x_0) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - x_0)^2} \]
The following C++ function calculates the difference of a vector’s elements to a value and returns the square root of the mean of the squared differences:
[[cpp4r::register]]
double rmse_cpp(doubles x, double x0) {
int n = x.size();
double y = 0;
for (int i = 0; i < n; ++i) {
y += pow(x[i] - x0, 2.0);
}
return sqrt(y / n);
}
Its R equivalent would be:
#' Return the root mean square error (R)
#' @param x numeric vector
#' @param x0 numeric value
#' @export
rmse_r <- function(x, x0) {
sqrt(sum((x - x0)^2) / length(x))
}
Document the functions and update the package as in the previous vignettes.
To test the functions, you can run the following benchmark code in the R console:
# create a list with 100 normal distributions with mean 0 and 1 million elements
set.seed(123)
x <- list()
for (i in 1:1e3) {
x[[i]] <- rnorm(1e3)
}
# compute the mean of each distribution
x <- sapply(x, mean)
rmse_cpp(x, 0)
rmse_r(x, 0)
mark(
rmse_cpp(x, 0),
rmse_r(x, 0)
)