Rolling functions with missing values

Motivation

This chapter expands Chapter 5 by using the same methods from Chapter 5. These examples were adapted from Vaughan, Hester, and Francois (2024).

Load the Package

I loaded the ece244 package as I added the functions from the next sections to it with the following code:

load_all()

Additional Packages

I used the bench package to compare the performance of the functions. The package was loaded with the following code:

library(bench)

Cumulative sum (cumsum())

The next function returns the cumulative sum of the elements of a vector:

[[cpp11::register]] doubles cumsum2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();

  writable::doubles out(n);
  out[0] = x[0];

  if (na_rm == true) {
    for (int i = 1; i < n; ++i) {
      double y1 = out[i - 1], y2 = x[i];
      if (ISNAN(y2)) {
        out[i] = y1 + 0.0;
      } else {
        if (ISNAN(y1)) {
          out[i] = 0.0 + y2;
        } else {
          out[i] = y1 + y2;
        }
      }
    }
  } else {
    for (int i = 1; i < n; ++i) {
      double y1 = out[i - 1], y2 = x[i];
      if (ISNAN(y2)) {
        out[i] = NA_REAL;
      } else {
        if (ISNAN(y1)) {
          out[i] = NA_REAL;
        } else {
          out[i] = y1 + y2;
        }
      }
    }
  }

  return out;
}

Unlike the cumsum() function from Chapter 5, this function has an additional argument na_rm that allows the user to remove missing values (including NaN) from the vector.

The corresponding auxiliary function for documentation is:

#' Return the cumulative sum of the coordinates of a vector (C++)
#' @param x numeric vector
#' @param na_rm logical. Should missing values (including `NaN`) be removed?
#' @export
cumsum2_cpp <- function(x, na_rm = FALSE) {
  cumsum2_cpp_(as.double(x), na_rm = na_rm)
}

To test the functions, I ran the following benchmark code in the R console:

set.seed(123) # for reproducibility
x <- rpois(1e6, lambda = 2) # 1,000,000 elements

cumsum(c(1, NA, 2, 4))
[1]  1 NA NA NA
cumsum2_cpp(c(1, NA, 2, 4))
[1]  1 NA NA NA
cumsum2_cpp(c(1, NA, 2, 4), na_rm = TRUE)
[1] 1 1 3 7
mark(
  cumsum(x),
  cumsum2_cpp(x)
)
# A tibble: 2 × 6
  expression          min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 cumsum(x)         1.6ms   2.04ms     463.     3.81MB     59.9
2 cumsum2_cpp(x)   29.1ms  32.65ms      31.3   15.26MB     78.3

Cumulative product (cumprod())

The next function calculates the cumulative product of the elements of a vector:

[[cpp11::register]] doubles cumprod2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();

  writable::doubles out(n);
  out[0] = x[0];

  if (na_rm == true) {
    for (int i = 1; i < n; ++i) {
      double y1 = out[i - 1], y2 = x[i];
      if (ISNAN(y2)) {
        out[i] = y1 * 1.0;
      } else {
        if (ISNAN(y1)) {
          out[i] = 1.0 * y2;
        } else {
          out[i] = y1 * y2;
        }
      }
    }
  } else {
    for (int i = 1; i < n; ++i) {
      double y1 = out[i - 1], y2 = x[i];
      if (ISNAN(y2)) {
        out[i] = NA_REAL;
      } else {
        if (ISNAN(y1)) {
          out[i] = NA_REAL;
        } else {
          out[i] = y1 * y2;
        }
      }
    }
  }

  return out;
}

The corresponding auxiliary function for documentation is:

#' Return the cumulative product of the coordinates of a vector (C++)
#' @inheritParams sum2_cpp
#' @export
cumprod2_cpp <- function(x, na_rm = FALSE) {
  cumprod2_cpp_(as.double(x), na_rm = na_rm)
}

To test the functions, I ran the following benchmark code in the R console:

cumprod(c(1, NA, 2, 4))
[1]  1 NA NA NA
cumprod2_cpp(c(1, NA, 2, 4))
[1]  1 NA NA NA
cumprod2_cpp(c(1, NA, 2, 4), na_rm = TRUE)
[1] 1 1 2 8
mark(
  cumprod(x),
  cumprod2_cpp(x)
)
# A tibble: 2 × 6
  expression           min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 cumprod(x)        4.93ms   5.22ms     190.     15.3MB    190. 
2 cumprod2_cpp(x)  33.54ms  33.76ms      28.3    15.3MB     51.0

Cumulative minimum (cummin())

The next function calculates the cumulative minimum of the elements of a vector:

[[cpp11::register]] doubles cummin2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();

  writable::doubles out(n);
  out[0] = x[0];

  if (na_rm == true) {
    for (int i = 1; i < n; ++i) {
      double y1 = x[i - 1], y2 = x[i];
      if (ISNAN(y1)) {
        out[i] = y2;
      } else {
        out[i] = std::min(y1, y2);
      }
    }
  } else {
    for (int i = 1; i < n; ++i) {
      double y1 = out[i - 1], y2 = x[i];
      if (ISNAN(y2)) {
        out[i] = NA_REAL;
      } else {
        if (ISNAN(y1)) {
          out[i] = NA_REAL;
        } else {
          out[i] = std::min(y1, y2);
        }
      }
    }
  }

  return out;
}

The corresponding auxiliary function for documentation is:

#' Return the cumulative minimum of the coordinates of a vector (C++)
#' @inheritParams sum2_cpp
#' @export
cummin2_cpp <- function(x, na_rm = FALSE) {
  cummin2_cpp_(as.double(x), na_rm = na_rm)
}

To test the functions, I ran the following benchmark code in the R console:

cummin(c(1, NA, 2, 4))
[1]  1 NA NA NA
cummin2_cpp(c(1, NA, 2, 4))
[1]  1 NA NA NA
cummin2_cpp(c(1, NA, 2, 4), na_rm = TRUE)
[1] 1 1 1 1
mark(
  cummin(x),
  cummin2_cpp(x)
)
# A tibble: 2 × 6
  expression          min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 cummin(x)        1.05ms   1.12ms     837.     3.81MB    166. 
2 cummin2_cpp(x)  29.56ms  30.26ms      32.2   15.26MB     88.6

Cumulative maximum (cummax())

The next function calculates the cumulative maximum of the elements of a vector:

[[cpp11::register]] doubles cummax2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();

  writable::doubles out(n);
  out[0] = x[0];

  if (na_rm == true) {
    for (int i = 1; i < n; ++i) {
      double y1 = out[i - 1], y2 = x[i];
      if (ISNAN(y1)) {
        out[i] = y2;
      } else {
        out[i] = std::max(y1, y2);
      }
    }
  } else {
    for (int i = 1; i < n; ++i) {
      double y1 = out[i - 1], y2 = x[i];
      if (ISNAN(y2)) {
        out[i] = NA_REAL;
      } else {
        if (ISNAN(y1)) {
          out[i] = NA_REAL;
        } else {
          out[i] = std::max(y1, y2);
        }
      }
    }
  }

  return out;
}

The corresponding auxiliary function for documentation is:

#' Return the cumulative maximum of the coordinates of a vector (C++)
#' @inheritParams sum2_cpp
#' @export
cummax2_cpp <- function(x, na_rm = FALSE) {
  cummax2_cpp_(as.double(x), na_rm = na_rm)
}

To test the functions, I ran the following benchmark code in the R console:

cummax(c(1, NA, 2, 4))
[1]  1 NA NA NA
cummax2_cpp(c(1, NA, 2, 4))
[1]  1 NA NA NA
cummax2_cpp(c(1, NA, 2, 4), na_rm = TRUE)
[1] 1 1 2 4
mark(
  cummax(x),
  cummax2_cpp(x)
)
# A tibble: 2 × 6
  expression          min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 cummax(x)         987µs   1.05ms     886.     3.81MB    111. 
2 cummax2_cpp(x)   28.9ms  29.73ms      33.5   15.26MB     50.2

References

Vaughan, Davis, Jim Hester, and Roman Francois. 2024. “Get Started with Cpp11.” https://cpp11.r-lib.org/articles/cpp11.html#intro.