Statistical Functions

Motivation

This chapter covers the implementation of simple statistical functions in C++ and R. The goal is to show the syntax differences between the two languages and compare their performance.

Fair Warning

These functions ignore NA values. Adjustments for handling NA values will be introduced in the sixth chapter.

R already provides efficient versions of the functions covered here. Code optimizations and improvements will be made in later chapters.

Statistical details

The explanations and equations used for the functions are taken from Diez et al. (2015) and Hansen (2022). Some examples were adapted from Vaughan, Hester, and Francois (2024).

Load the Package

I loaded the ece244 package as I added the functions from the next sections to it with the following code:

load_all()

Additional Packages

I used the bench package to compare the performance of the functions. The package was loaded with the following code:

library(bench)

Sum of Vector Elements (sum())

For a vector of \(n\) elements \(x_1, x_2, \ldots, x_n\), the sum is calculated as:

\[ \sum_{i=1}^{n} x_i = x_1 + x_2 + \ldots + x_n \]

The following C++ function calculates the sum of a vector’s elements:

[[cpp11::register]] double sum_cpp_(doubles x) {
  int n = x.size();
  double total = 0;
  for(int i = 0; i < n; ++i) {
    total += x[i];
  }
  return total;
}

If the previous function were a cooking recipe, it would be:

  1. Ingredients: A vector “x” in a container of type “doubles” (doubles x).
  2. Preparation:
    1. Count the vector’s coordinates and store the result in an integer variable “n” (int n = x.size()).
    2. Take a mixing bowl “total” of type “double” and verify it is empty (double total = 0).
    3. For each element \(x_i\) in the vector, take the element and add it to the total (total += x[i]).
    4. After \(x_n\) was added to the total, return the total (return total).

Its R equivalent is:

#' Return the sum of the coordinates of a vector (R)
#' @param x numeric vector
#' @export
sum_r <- function(x) {
  total <- 0
  for (i in seq_along(x)) {
    total <- total + x[i]
  }
  total
}

To document the C++ function, I added the following wrapper to the R code:

#' Return the sum of the coordinates of a vector (C++)
#' @inheritParams sum_r
#' @export
sum_cpp <- function(x) {
  sum_cpp_(x)
}

To test the functions, I ran the following benchmark code in the R console:

set.seed(123) # for reproducibility
x <- runif(1e3) # 1,000,000 elements

sum(x)
[1] 497.2778
sum_cpp(x)
[1] 497.2778
sum_r(x)
[1] 497.2778
mark(
  sum(x),
  sum_cpp(x),
  sum_r(x)
)
# A tibble: 3 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 sum(x)     759.03ns 842.15ns  1061624.        0B    106. 
2 sum_cpp(x)   4.69µs   5.22µs   179337.        0B     17.9
3 sum_r(x)    16.45µs   17.5µs    54858.    16.6KB      0  

Arithmetic Mean (mean())

The arithmetic mean of a vector of \(n\) elements \(x_1, x_2, \ldots, x_n\) is calculated as:

\[ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \]

The following C++ function calculates the mean of a vector’s elements:

[[cpp11::register]] double mean_cpp_(doubles x) {
  int n = x.size();
  double y = 0;

  for(int i = 0; i < n; ++i) {
    y += x[i];
  }
  return y / n;
}

If the previous function were a cooking recipe, it would be:

  1. Ingredients: A vector “x” in a container of type “doubles” (doubles x).
  2. Preparation:
    1. Count the vector’s coordinates and store the result in an integer variable “n” (int n = x.size()).
    2. Take a mixing bowl “y” of type “double” and verify it is empty (double y = 0).
    3. For each element \(x_i\) in the vector, take the element and add it to the total (y += x[i]).
    4. After \(x_n\) was added to the total, return the total divided by the number of elements (return y / n).

Its R equivalent is:

#' Return the mean of a vector (R)
#' @param x numeric vector
#' @export
mean_r <- function(x) {
  sum_r(x) / length(x)
}

To document the C++ function, I added the following wrapper to the R code:

#' Return the mean of a vector (C++)
#' @inheritParams mean_r
#' @export
mean_cpp <- function(x) {
  mean_cpp_(x)
}

To test the functions, I ran the following benchmark code in the R console:

mean(x)
[1] 0.4972778
mean_cpp(x)
[1] 0.4972778
mean_r(x)
[1] 0.4972778
mark(
  mean(x),
  mean_cpp(x),
  mean_r(x)
)
# A tibble: 3 × 6
  expression       min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 mean(x)       3.17µs   4.69µs   211058.        0B    21.1 
2 mean_cpp(x)   5.11µs   5.38µs   172074.        0B     0   
3 mean_r(x)     17.1µs  18.16µs    52822.        0B     5.28

Variance (var())

The variance of a vector of \(n\) elements \(x_1, x_2, \ldots, x_n\) is calculated as:

\[ \text{Var}(x) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]

The following C++ function calculates the variance of a vector’s elements:

[[cpp11::register]] double var_cpp_(doubles x) {
  int n = x.size();
  double y1 = 0, y2 = 0;

  for(int i = 0; i < n; ++i) {
    y1 += x[i];
    y2 += pow(x[i], 2.0);
  }
  return (y2 - pow(y1, 2.0) / n) / (n - 1);
}

If the previous function were a cooking recipe, it would be:

  1. Ingredients: A vector “x” in a container of type “doubles” (doubles x).
  2. Preparation:
    1. Count the vector’s coordinates and store the result in an integer variable “n” (int n = x.size()).
    2. Take two mixing bowls “y1” and “y2” of type “double” and verify they are empty (double y1 = 0, y2 = 0).
    3. For each element \(x_i\) in the vector, take the element, add it to “y1” (y1 += x[i]), and then square it and
      add it to “y2” (y2 += pow(x[i], 2.0)).
    4. After \(x_n\) was added to “y1” and “y2”, return the variance of the vector (return (y2 - pow(y1, 2.0) / n) / (n - 1)).

Its R equivalent is:

#' Return the variance of a vector (R)
#' @param x numeric vector
#' @export
var_r <- function(x) {
  mean_r((x - mean_r(x))^2)
}

To document the C++ function, I added the following wrapper to the R code:

#' Return the variance of a vector (C++)
#' @inheritParams var_r
#' @export
var_cpp <- function(x) {
  var_cpp_(x)
}

To test the functions, I ran the following benchmark code in the R console:

var(x)
[1] 0.082647
var_cpp(x)
[1] 0.082647
var_r(x)
[1] 0.082647
mark(
  var(x),
  var_cpp(x),
  var_r(x)
)
# A tibble: 3 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 var(x)       5.51µs   7.12µs   131779.        0B     39.5
2 var_cpp(x)  22.32µs  25.05µs    38863.        0B      0  
3 var_r(x)    32.82µs  35.16µs    27834.    42.4KB      0  

Root Mean Square Error (RMSE)

The RMSE function measures the differences between observed values and the true value.

For a vector of \(n\) elements \(x_1, x_2, \ldots, x_n\) and a value \(x_0\), the RMSE is calculated as:

\[ \text{RMSE}(x, x_0) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - x_0)^2} \]

The following C++ function calculates the difference of a vector’s elements to a value and returns the square root of the mean of the squared differences:

[[cpp11::register]] double rmse_cpp_(doubles x, double x0) {
  int n = x.size();
  double y = 0;
  for (int i = 0; i < n; ++i) {
    y += pow(x[i] - x0, 2.0);
  }
  return sqrt(y / n);
}

If the previous function were a cooking recipe, it would be:

  1. Ingredients: A vector “x” in a container of type “doubles” and a value “x0” in a container of type “double” (doubles x, double x0).
  2. Preparation:
    1. Count the vector’s coordinates and store the result in an integer variable “n” (int n = x.size()).
    2. Take a mixing bowl “y” of type “double” and verify it is empty (double y = 0).
    3. For each element \(x_i\) in the vector, take the element, subtract the value \(x_0\), square it, and add it to “y” (y += pow(x[i] - x0, 2.0)).
    4. After \(x_n\) was added to “y”, return the square root of the mean of the squared differences (return sqrt(y / n)).

Its R equivalent is:

#' Return the root mean square error (R)
#' @param x numeric vector
#' @param x0 numeric value
#' @export
rmse_r <- function(x, x0) {
  sqrt(sum((x - x0)^2) / length(x))
}

To document the C++ function, I added the following wrapper to the R code:

#' Return the root mean square error (C++)
#' @inheritParams rmse_r
#' @export
rmse_cpp <- function(x, x0) {
  rmse_cpp_(x, x0)
}

To test the functions, I ran the following benchmark code in the R console:

# create a list with 100 normal distributions with mean 0 and 1 million elements
set.seed(123)
x <- list()
for (i in 1:1e3) {
  x[[i]] <- rnorm(1e3)
}

# compute the mean of each distribution
x <- sapply(x, mean)

rmse_cpp(x, 0)
[1] 0.03005874
rmse_r(x, 0)
[1] 0.03005874
mark(
  rmse_cpp(x, 0),
  rmse_r(x, 0)
)
# A tibble: 2 × 6
  expression          min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 rmse_cpp(x, 0)  19.97µs  20.95µs    45795.        0B     4.58
2 rmse_r(x, 0)     2.59µs   4.52µs   208579.    7.86KB    20.9 

References

Diez, David, Mine Cetinkaya-Rundel, Christopher Barr, and OpenIntro. 2015. OpenIntro Statistics. Leanpub. https://leanpub.next/os.
Hansen, Bruce E. 2022. Econometrics. Princeton, New Jersey: Princeton University Press.
Vaughan, Davis, Jim Hester, and Roman Francois. 2024. “Get Started with Cpp11.” https://cpp11.r-lib.org/articles/cpp11.html#intro.