07 - Statistical Functions

Notes

These functions ignore NA values for now. Adjustments for handling NA values are covered in a separate vignette.
R already provides efficient versions of the functions covered here. This is just to illustrate how to use C++ code.
The explanations and equations used for the functions are taken from Diez et al. (2015) and Hansen (2022). Some examples were adapted from Vaughan, Hester, and Francois (2024).

Sum

For a vector of \(n\) elements \(x_1, x_2, \ldots, x_n\), the sum is calculated as:

\[ \sum_{i=1}^{n} x_i = x_1 + x_2 + \ldots + x_n \]

The following C++ function calculates the sum of a vector’s elements:

[[cpp4r::register]]
double sum_cpp(doubles x) {
  int n = x.size();
  double total = 0;
  for(int i = 0; i < n; ++i) {
    total += x[i];
  }
  return total;
}

Its R equivalent is:

sum_r <- function(x) {
  total <- 0
  for (i in seq_along(x)) {
    total <- total + x[i]
  }
  total
}

Document the functions and update the package as in the previous vignettes.

To test the functions, you can run the following benchmark code in the R console:

# install.packages("bench")
library(bench)

set.seed(123) # for reproducibility
x <- runif(1e3) # 1,000,000 elements

sum(x)
sum_cpp(x)
sum_r(x)

mark(
  sum(x),
  sum_cpp(x),
  sum_r(x)
)

Arithmetic mean

The arithmetic mean of a vector of \(n\) elements \(x_1, x_2, \ldots, x_n\) is calculated as:

\[ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \]

The following C++ function calculates the mean of a vector’s elements:

[[cpp4r::register]]
double mean_cpp(doubles x) {
  int n = x.size();
  double y = 0;

  for(int i = 0; i < n; ++i) {
    y += x[i];
  }
  return y / n;
}

Its R equivalent would be:

mean_r <- function(x) {
  sum_r(x) / length(x)
}

Document the functions and update the package as in the previous vignettes.

To test the functions, you can run the following benchmark code in the R console:

mean(x)
mean_cpp(x)
mean_r(x)

mark(
  mean(x),
  mean_cpp(x),
  mean_r(x)
)

Variance

The variance of a vector of \(n\) elements \(x_1, x_2, \ldots, x_n\) is calculated as:

\[ \text{Var}(x) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]

The following C++ function calculates the variance of a vector’s elements:

[[cpp4r::register]]
double var_cpp(doubles x) {
  int n = x.size();
  double y1 = 0, y2 = 0;

  for(int i = 0; i < n; ++i) {
    y1 += x[i];
    y2 += pow(x[i], 2.0);
  }
  return (y2 - pow(y1, 2.0) / n) / (n - 1);
}

Its R equivalent would be:

var_r <- function(x) {
  mean_r((x - mean_r(x))^2)
}

Document the functions and update the package as in the previous vignettes.

To test the functions, you can run the following benchmark code in the R console:

var(x)
var_cpp(x)
var_r(x)

mark(
  var(x),
  var_cpp(x),
  var_r(x)
)

Root Mean Square Error (RMSE)

The RMSE function measures the differences between observed values and the true value.

For a vector of \(n\) elements \(x_1, x_2, \ldots, x_n\) and a value \(x_0\), the RMSE is calculated as:

\[ \text{RMSE}(x, x_0) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - x_0)^2} \]

The following C++ function calculates the difference of a vector’s elements to a value and returns the square root of the mean of the squared differences:

[[cpp4r::register]]
double rmse_cpp(doubles x, double x0) {
  int n = x.size();
  double y = 0;
  for (int i = 0; i < n; ++i) {
    y += pow(x[i] - x0, 2.0);
  }
  return sqrt(y / n);
}

Its R equivalent would be:

#' Return the root mean square error (R)
#' @param x numeric vector
#' @param x0 numeric value
#' @export
rmse_r <- function(x, x0) {
  sqrt(sum((x - x0)^2) / length(x))
}

Document the functions and update the package as in the previous vignettes.

To test the functions, you can run the following benchmark code in the R console:

# create a list with 100 normal distributions with mean 0 and 1 million elements
set.seed(123)
x <- list()
for (i in 1:1e3) {
  x[[i]] <- rnorm(1e3)
}

# compute the mean of each distribution
x <- sapply(x, mean)

rmse_cpp(x, 0)
rmse_r(x, 0)

mark(
  rmse_cpp(x, 0),
  rmse_r(x, 0)
)

References

Diez, David, Mine Cetinkaya-Rundel, Christopher Barr, and OpenIntro. 2015. OpenIntro Statistics. Leanpub. https://leanpub.com/os.

Hansen, Bruce E. 2022. Econometrics. Princeton, New Jersey: Princeton University Press.

Vaughan, Davis, Jim Hester, and Roman Francois. 2024. “Get Started with Cpp11.” https://cpp11.r-lib.org/articles/cpp11.html#intro.