load_all()
Statistical Functions
Motivation
This chapter covers the implementation of simple statistical functions in C++ and R. The goal is to show the syntax differences between the two languages and compare their performance.
Fair Warning
These functions ignore NA
values. Adjustments for handling NA
values will be introduced in the sixth chapter.
R already provides efficient versions of the functions covered here. Code optimizations and improvements will be made in later chapters.
Statistical details
The explanations and equations used for the functions are taken from Diez et al. (2015) and Hansen (2022). Some examples were adapted from Vaughan, Hester, and Francois (2024).
Load the Package
I loaded the ece244
package as I added the functions from the next sections to it with the following code:
Additional Packages
I used the bench
package to compare the performance of the functions. The package was loaded with the following code:
library(bench)
Sum of Vector Elements (sum()
)
For a vector of \(n\) elements \(x_1, x_2, \ldots, x_n\), the sum is calculated as:
\[ \sum_{i=1}^{n} x_i = x_1 + x_2 + \ldots + x_n \]
The following C++ function calculates the sum of a vector’s elements:
[[cpp11::register]] double sum_cpp_(doubles x) {
int n = x.size();
double total = 0;
for(int i = 0; i < n; ++i) {
+= x[i];
total }
return total;
}
If the previous function were a cooking recipe, it would be:
- Ingredients: A vector “x” in a container of type “doubles” (
doubles x
). - Preparation:
- Count the vector’s coordinates and store the result in an integer variable “n” (
int n = x.size()
). - Take a mixing bowl “total” of type “double” and verify it is empty (
double total = 0
). - For each element \(x_i\) in the vector, take the element and add it to the total (
total += x[i]
). - After \(x_n\) was added to the total, return the total (
return total
).
- Count the vector’s coordinates and store the result in an integer variable “n” (
Its R equivalent is:
#' Return the sum of the coordinates of a vector (R)
#' @param x numeric vector
#' @export
<- function(x) {
sum_r <- 0
total for (i in seq_along(x)) {
<- total + x[i]
total
}
total }
To document the C++ function, I added the following wrapper to the R code:
#' Return the sum of the coordinates of a vector (C++)
#' @inheritParams sum_r
#' @export
<- function(x) {
sum_cpp sum_cpp_(x)
}
To test the functions, I ran the following benchmark code in the R console:
set.seed(123) # for reproducibility
<- runif(1e3) # 1,000,000 elements
x
sum(x)
[1] 497.2778
sum_cpp(x)
[1] 497.2778
sum_r(x)
[1] 497.2778
mark(
sum(x),
sum_cpp(x),
sum_r(x)
)
# A tibble: 3 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 sum(x) 759.03ns 842.15ns 1061624. 0B 106.
2 sum_cpp(x) 4.69µs 5.22µs 179337. 0B 17.9
3 sum_r(x) 16.45µs 17.5µs 54858. 16.6KB 0
Arithmetic Mean (mean()
)
The arithmetic mean of a vector of \(n\) elements \(x_1, x_2, \ldots, x_n\) is calculated as:
\[ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \]
The following C++ function calculates the mean of a vector’s elements:
[[cpp11::register]] double mean_cpp_(doubles x) {
int n = x.size();
double y = 0;
for(int i = 0; i < n; ++i) {
+= x[i];
y }
return y / n;
}
If the previous function were a cooking recipe, it would be:
- Ingredients: A vector “x” in a container of type “doubles” (
doubles x
). - Preparation:
- Count the vector’s coordinates and store the result in an integer variable “n” (
int n = x.size()
). - Take a mixing bowl “y” of type “double” and verify it is empty (
double y = 0
). - For each element \(x_i\) in the vector, take the element and add it to the total (
y += x[i]
). - After \(x_n\) was added to the total, return the total divided by the number of elements (
return y / n
).
- Count the vector’s coordinates and store the result in an integer variable “n” (
Its R equivalent is:
#' Return the mean of a vector (R)
#' @param x numeric vector
#' @export
<- function(x) {
mean_r sum_r(x) / length(x)
}
To document the C++ function, I added the following wrapper to the R code:
#' Return the mean of a vector (C++)
#' @inheritParams mean_r
#' @export
<- function(x) {
mean_cpp mean_cpp_(x)
}
To test the functions, I ran the following benchmark code in the R console:
mean(x)
[1] 0.4972778
mean_cpp(x)
[1] 0.4972778
mean_r(x)
[1] 0.4972778
mark(
mean(x),
mean_cpp(x),
mean_r(x)
)
# A tibble: 3 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 mean(x) 3.17µs 4.69µs 211058. 0B 21.1
2 mean_cpp(x) 5.11µs 5.38µs 172074. 0B 0
3 mean_r(x) 17.1µs 18.16µs 52822. 0B 5.28
Variance (var()
)
The variance of a vector of \(n\) elements \(x_1, x_2, \ldots, x_n\) is calculated as:
\[ \text{Var}(x) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]
The following C++ function calculates the variance of a vector’s elements:
[[cpp11::register]] double var_cpp_(doubles x) {
int n = x.size();
double y1 = 0, y2 = 0;
for(int i = 0; i < n; ++i) {
+= x[i];
y1 += pow(x[i], 2.0);
y2 }
return (y2 - pow(y1, 2.0) / n) / (n - 1);
}
If the previous function were a cooking recipe, it would be:
- Ingredients: A vector “x” in a container of type “doubles” (
doubles x
). - Preparation:
- Count the vector’s coordinates and store the result in an integer variable “n” (
int n = x.size()
). - Take two mixing bowls “y1” and “y2” of type “double” and verify they are empty (
double y1 = 0, y2 = 0
). - For each element \(x_i\) in the vector, take the element, add it to “y1” (
y1 += x[i]
), and then square it and
add it to “y2” (y2 += pow(x[i], 2.0)
). - After \(x_n\) was added to “y1” and “y2”, return the variance of the vector (
return (y2 - pow(y1, 2.0) / n) / (n - 1)
).
- Count the vector’s coordinates and store the result in an integer variable “n” (
Its R equivalent is:
#' Return the variance of a vector (R)
#' @param x numeric vector
#' @export
<- function(x) {
var_r mean_r((x - mean_r(x))^2)
}
To document the C++ function, I added the following wrapper to the R code:
#' Return the variance of a vector (C++)
#' @inheritParams var_r
#' @export
<- function(x) {
var_cpp var_cpp_(x)
}
To test the functions, I ran the following benchmark code in the R console:
var(x)
[1] 0.082647
var_cpp(x)
[1] 0.082647
var_r(x)
[1] 0.082647
mark(
var(x),
var_cpp(x),
var_r(x)
)
# A tibble: 3 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 var(x) 5.51µs 7.12µs 131779. 0B 39.5
2 var_cpp(x) 22.32µs 25.05µs 38863. 0B 0
3 var_r(x) 32.82µs 35.16µs 27834. 42.4KB 0
Root Mean Square Error (RMSE)
The RMSE function measures the differences between observed values and the true value.
For a vector of \(n\) elements \(x_1, x_2, \ldots, x_n\) and a value \(x_0\), the RMSE is calculated as:
\[ \text{RMSE}(x, x_0) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - x_0)^2} \]
The following C++ function calculates the difference of a vector’s elements to a value and returns the square root of the mean of the squared differences:
[[cpp11::register]] double rmse_cpp_(doubles x, double x0) {
int n = x.size();
double y = 0;
for (int i = 0; i < n; ++i) {
+= pow(x[i] - x0, 2.0);
y }
return sqrt(y / n);
}
If the previous function were a cooking recipe, it would be:
- Ingredients: A vector “x” in a container of type “doubles” and a value “x0” in a container of type “double” (
doubles x, double x0
). - Preparation:
- Count the vector’s coordinates and store the result in an integer variable “n” (
int n = x.size()
). - Take a mixing bowl “y” of type “double” and verify it is empty (
double y = 0
). - For each element \(x_i\) in the vector, take the element, subtract the value \(x_0\), square it, and add it to “y” (
y += pow(x[i] - x0, 2.0)
). - After \(x_n\) was added to “y”, return the square root of the mean of the squared differences (
return sqrt(y / n)
).
- Count the vector’s coordinates and store the result in an integer variable “n” (
Its R equivalent is:
#' Return the root mean square error (R)
#' @param x numeric vector
#' @param x0 numeric value
#' @export
<- function(x, x0) {
rmse_r sqrt(sum((x - x0)^2) / length(x))
}
To document the C++ function, I added the following wrapper to the R code:
#' Return the root mean square error (C++)
#' @inheritParams rmse_r
#' @export
<- function(x, x0) {
rmse_cpp rmse_cpp_(x, x0)
}
To test the functions, I ran the following benchmark code in the R console:
# create a list with 100 normal distributions with mean 0 and 1 million elements
set.seed(123)
<- list()
x for (i in 1:1e3) {
<- rnorm(1e3)
x[[i]]
}
# compute the mean of each distribution
<- sapply(x, mean)
x
rmse_cpp(x, 0)
[1] 0.03005874
rmse_r(x, 0)
[1] 0.03005874
mark(
rmse_cpp(x, 0),
rmse_r(x, 0)
)
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 rmse_cpp(x, 0) 19.97µs 20.95µs 45795. 0B 4.58
2 rmse_r(x, 0) 2.59µs 4.52µs 208579. 7.86KB 20.9