Statistical Functions with Missing Values
Source:vignettes/09-statistical-functions-2.Rmd
09-statistical-functions-2.Rmd
Note
These functions ignore NA
values for now. Adjustments
for handling NA
values are covered in a separate
vignette.
R already provides efficient versions of the functions covered here. This is just to illustrate how to use C++ code.
Sum
The following function expands the previous sum_cpp()
function to handle missing values.
[[cpp4r::register]]
double sum2_cpp(doubles x, bool na_rm = false) {
int n = x.size();
double total = 0;
for (int i = 0; i < n; ++i) {
if (na_rm && ISNAN(x[i])) {
continue;
} else {
total += x[i];
}
}
return total;
}
To test the functions, you can run the following benchmark code in the R console:
Arithmetic mean
The following function expands the previous mean_cpp()
function to handle missing values.
[[cpp4r::register]]
double mean2_cpp(doubles x, bool na_rm = false) {
int n = x.size();
int m = 0;
for (int i = 0; i < n; ++i) {
if (na_rm && ISNAN(x[i])) {
continue;
} else {
++m;
}
}
if (m == 0) {
return NA_REAL;
}
double total = 0;
for (int i = 0; i < n; ++i) {
if (na_rm && ISNAN(x[i])) {
continue;
} else {
total += x[i];
}
}
return total / m;
}
To test the functions, you can run the following benchmark code in the R console:
Variance
The following function expands the previous var_cpp()
function to handle missing values.
[[cpp4r::register]]
double var2_cpp(doubles x, bool na_rm = false) {
int n = x.size();
int m = 0;
double total = 0, sq_total = 0;
for (int i = 0; i < n; ++i) {
if (na_rm && ISNAN(x[i])) {
continue;
} else {
++m;
total += x[i];
sq_total += pow(x[i], 2);
}
}
if (m <= 1) {
return NA_REAL;
}
return (sq_total - total * total / m) / (m - 1);
}
To test the functions, you can run the following benchmark code in the R console:
Root Mean Square Error (RMSE)
The following function expands the previous rmse_cpp()
function to handle missing values.
[[cpp4r::register]]
double rmse2_cpp(doubles x, double x0, bool na_rm = false) {
int n = x.size();
int m = 0;
double total = 0;
for (int i = 0; i < n; ++i) {
if (na_rm && ISNAN(x[i])) {
continue;
} else {
++m;
total += pow(x[i] - x0, 2);
}
}
if (m == 0) {
return NA_REAL;
}
return sqrt(total / m);
}
To test the functions, you can run the following benchmark code in the R console:
# create a list with 100 normal distributions with mean 0 and 1,000 elements each
set.seed(123)
x <- list()
for (i in 1:1e3) {
x[[i]] <- rnorm(1e3)
}
# compute the mean of each distribution
x <- sapply(x, mean)
# insert NA values at random
x[sample(1:1e3, 1e2)] <- NA
rmse2_cpp(x, 0)
rmse2_cpp(x, 0, na_rm = TRUE)
mark(
sqrt(mean((x - 0)^2, na.rm = TRUE)),
rmse2_cpp(x, 0, na_rm = TRUE)
)