Statistical Functions with Missing Values

Motivation

The previous functions ignore NA values, which is not ideal for real-world data. This chapter introduces an additional argument and checks to the functions that allow the user to remove missing values (including NaN) from the vector. These examples were adapted from Vaughan, Hester, and Francois (2024).

Load the Package

I loaded the ece244 package as I added the functions from the next sections to it with the following code:

load_all()

Additional Packages

I used the bench package to compare the performance of the functions. The package was loaded with the following code:

library(bench)

Sum of Vector Elements (sum())

The next function returns the sum of the elements of a vector:

[[cpp11::register]] double sum2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();
  double total = 0;
  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      total += x[i];
    }
  }
  return total;
}

Unlike the sum function from Chapter 3, this function has an additional argument na_rm that allows the user to remove missing values (including NaN) from the vector.

The corresponding auxiliary function for documentation is:

#' Return the sum of the elements of a vector (C++)
#' @inheritParams sum_r
#' @param na_rm logical. Should missing values (including `NaN`) be removed?
#' @export
sum2_cpp <- function(x, na_rm = FALSE) {
  sum2_cpp_(as.double(x), na_rm = na_rm)
}

To test the functions, I ran the following benchmark code in the R console:

set.seed(123) # for reproducibility
x <- runif(1e3) # 1,000 elements
x[sample(1:1e3, 1e2)] <- NA # randomly insert NA values

sum(x, na.rm = FALSE)
[1] NA
sum2_cpp(x, na_rm = FALSE)
[1] NA
sum(x, na.rm = TRUE)
[1] 447.4191
sum2_cpp(x, na_rm = TRUE)
[1] 447.4191
mark(
  sum(x, na.rm = TRUE),
  sum2_cpp(x, na_rm = TRUE)
)
# A tibble: 2 × 6
  expression                     min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 sum(x, na.rm = TRUE)         830ns  914.1ns   967313.        0B     0   
2 sum2_cpp(x, na_rm = TRUE)   11.5µs   12.1µs    79724.        0B     7.97

Arithmetic Mean (mean())

The next function calculates the mean of the elements of a vector:

[[cpp11::register]] double mean2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();
  int m = 0;
  
  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      ++m;
    }
  }

  if (m == 0) {
    return NA_REAL;
  }

  double total = 0;
  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      total += x[i];
    }
  }

  return total / m;
}

Unlike the mean function from Chapter 3, this function has an additional argument na_rm that allows the user to remove missing values (including NaN) from the vector.

The corresponding auxiliary function for documentation is:

#' Return the mean of the elements of a vector (C++)
#' @inheritParams sum2_cpp
#' @export
mean2_cpp <- function(x, na_rm = FALSE) {
  mean2_cpp_(as.double(x), na_rm = na_rm)
}

To test the functions, I ran the following benchmark code in the R console:

mean(x)
[1] NA
mean2_cpp(x)
[1] NA
mean(x, na.rm = TRUE)
[1] 0.4971323
mean2_cpp(x, na_rm = TRUE)
[1] 0.4971323
mark(
  mean(x, na.rm = TRUE),
  mean2_cpp(x, na_rm = TRUE)
)
# A tibble: 2 × 6
  expression                      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 mean(x, na.rm = TRUE)        6.18µs    7.5µs   110623.    22.5KB     44.3
2 mean2_cpp(x, na_rm = TRUE)  17.53µs   19.1µs    51086.        0B      0  

Variance (var())

The next function calculates the variance of a vector:

[[cpp11::register]] double var2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();
  int m = 0;
  double total = 0, sq_total = 0;

  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      ++m;
      total += x[i];
      sq_total += pow(x[i], 2);
    }
  }

  if (m <= 1) {
    return NA_REAL;
  }

  return (sq_total - total * total / m) / (m - 1);
}

Unlike the variance function from Chapter 3, this function has an additional argument na_rm that allows the user to remove missing values (including NaN) from the vector.

The corresponding auxiliary function for documentation is:

#' Return the variance of the elements of a vector (C++)
#' @inheritParams sum2_cpp
#' @export
var2_cpp <- function(x, na_rm = FALSE) {
  var2_cpp_(as.double(x), na_rm = na_rm)
}

To test the functions, I ran the following benchmark code in the R console:

var(x)
[1] NA
var2_cpp(x)
[1] NA
var(x, na.rm = TRUE)
[1] 0.08155043
var2_cpp(x, na_rm = TRUE)
[1] 0.08155043
mark(
  var(x, na.rm = TRUE),
  var2_cpp(x, na_rm = TRUE)
)
# A tibble: 2 × 6
  expression                     min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 var(x, na.rm = TRUE)        6.72µs   8.37µs   112391.    3.95KB     33.7
2 var2_cpp(x, na_rm = TRUE)  39.24µs  44.64µs    21989.        0B      0  

Root Mean Square Error (RMSE)

The next function calculates the root mean square error between observed values (\(x\)) and the true value (\(x_0\)):

[[cpp11::register]] double rmse2_cpp_(doubles x, double x0, bool na_rm = false) {
  int n = x.size();
  int m = 0;
  double total = 0;

  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      ++m;
      total += pow(x[i] - x0, 2);
    }
  }

  if (m == 0) {
    return NA_REAL;
  }

  return sqrt(total / m);
}

Unlike the RMSE function from Chapter 3, this function has an additional argument na_rm that allows the user to remove missing values (including NaN) from the vector.

The corresponding auxiliary function for documentation is:

#' Return the root mean square error (C++)
#' @inheritParams rmse_r
#' @param na_rm logical. Should missing values (including `NaN`) be removed?
#' @export
rmse2_cpp <- function(x, x0, na_rm = FALSE) {
  rmse2_cpp_(as.double(x), as.double(x0), na_rm = na_rm)
}

To test the functions, I ran the following benchmark code in the R console:

# create a list with 100 normal distributions with mean 0 and 1,000 elements each
set.seed(123)
x <- list()
for (i in 1:1e3) {
  x[[i]] <- rnorm(1e3)
}

# compute the mean of each distribution
x <- sapply(x, mean)

# insert NA values at random
x[sample(1:1e3, 1e2)] <- NA

rmse2_cpp(x, 0)
[1] NA
rmse2_cpp(x, 0, na_rm = TRUE)
[1] 0.02992032
mark(
  sqrt(mean((x - 0)^2, na.rm = TRUE)),
  rmse2_cpp(x, 0, na_rm = TRUE)
)
# A tibble: 2 × 6
  expression                            min  median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                         <bch:> <bch:t>     <dbl> <bch:byt>    <dbl>
1 sqrt(mean((x - 0)^2, na.rm = TRUE…  7.7µs  9.02µs   102173.    30.4KB     61.3
2 rmse2_cpp(x, 0, na_rm = TRUE)      33.2µs 34.68µs    28173.        0B      0  

References

Vaughan, Davis, Jim Hester, and Roman Francois. 2024. “Get Started with Cpp11.” https://cpp11.r-lib.org/articles/cpp11.html#intro.