Statistical Functions with Missing Values

Note

These functions ignore NA values for now. Adjustments for handling NA values are covered in a separate vignette.

R already provides efficient versions of the functions covered here. This is just to illustrate how to use C++ code.

Sum

The following function expands the previous sum_cpp() function to handle missing values.

[[cpp4r::register]]
double sum2_cpp(doubles x, bool na_rm = false) {
  int n = x.size();
  double total = 0;
  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      total += x[i];
    }
  }
  return total;
}

To test the functions, you can run the following benchmark code in the R console:

set.seed(123) # for reproducibility
x <- runif(1e3) # 1,000 elements
x[sample(1:1e3, 1e2)] <- NA # randomly insert NA values

sum(x, na.rm = FALSE)
sum2_cpp(x, na_rm = FALSE)

sum(x, na.rm = TRUE)
sum2_cpp(x, na_rm = TRUE)

mark(
  sum(x, na.rm = TRUE),
  sum2_cpp(x, na_rm = TRUE)
)

Arithmetic mean

The following function expands the previous mean_cpp() function to handle missing values.

[[cpp4r::register]]
double mean2_cpp(doubles x, bool na_rm = false) {
  int n = x.size();
  int m = 0;
  
  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      ++m;
    }
  }

  if (m == 0) {
    return NA_REAL;
  }

  double total = 0;
  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      total += x[i];
    }
  }

  return total / m;
}

To test the functions, you can run the following benchmark code in the R console:

mean(x)
mean2_cpp(x)

mean(x, na.rm = TRUE)
mean2_cpp(x, na_rm = TRUE)

mark(
  mean(x, na.rm = TRUE),
  mean2_cpp(x, na_rm = TRUE)
)

Variance

The following function expands the previous var_cpp() function to handle missing values.

[[cpp4r::register]]
double var2_cpp(doubles x, bool na_rm = false) {
  int n = x.size();
  int m = 0;
  double total = 0, sq_total = 0;

  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      ++m;
      total += x[i];
      sq_total += pow(x[i], 2);
    }
  }

  if (m <= 1) {
    return NA_REAL;
  }

  return (sq_total - total * total / m) / (m - 1);
}

To test the functions, you can run the following benchmark code in the R console:

var(x)
var2_cpp(x)

var(x, na.rm = TRUE)
var2_cpp(x, na_rm = TRUE)

mark(
  var(x, na.rm = TRUE),
  var2_cpp(x, na_rm = TRUE)
)

Root Mean Square Error (RMSE)

The following function expands the previous rmse_cpp() function to handle missing values.

[[cpp4r::register]]
double rmse2_cpp(doubles x, double x0, bool na_rm = false) {
  int n = x.size();
  int m = 0;
  double total = 0;

  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      ++m;
      total += pow(x[i] - x0, 2);
    }
  }

  if (m == 0) {
    return NA_REAL;
  }

  return sqrt(total / m);
}

To test the functions, you can run the following benchmark code in the R console:

# create a list with 100 normal distributions with mean 0 and 1,000 elements each
set.seed(123)
x <- list()
for (i in 1:1e3) {
  x[[i]] <- rnorm(1e3)
}

# compute the mean of each distribution
x <- sapply(x, mean)

# insert NA values at random
x[sample(1:1e3, 1e2)] <- NA

rmse2_cpp(x, 0)
rmse2_cpp(x, 0, na_rm = TRUE)

mark(
  sqrt(mean((x - 0)^2, na.rm = TRUE)),
  rmse2_cpp(x, 0, na_rm = TRUE)
)

Note

Sum

Arithmetic mean

Variance

Root Mean Square Error (RMSE)

References