Motivation
The previous package skeleton left out some essential details, such as testing for memory leaks and debugging. This vignette will cover these aspects in more detail. The references for this vignette are Vaughan, Hester, and Francois (2024), Padgham (2022), Vaughan (2019), and Wickham and Bryan (2023).
Instructional examples
Solving a matrix using the Gauss-Jordan method
This example package covers the following topics:
- Integers
- Doubles
- Matrices
- Conditionals
- Loops (
for
) - Vendoring
See the cpp4rgaussjordan package.
Details
This implementation is a naive approach, but it can be used, for example, to obtain the Ordinary Least Squares (OLS) estimator as shown in the next section.
Vendoring means that the dependency code, the cpp4r
C++
headers, are copied the project’s source tree. This ensures the
dependency code is fixed and stable until it is updated.
The advantage of vendoring is that changes to the cpp4r
package could never break the package’s code. The disadvantage is that
fixes and new features will be available after vendoring the code
again.
Vendoring
You can start with
create_package("~/github/cpp4rgaussjordan")
. You can use
VSCode, but all the steps also apply to RStudio.
After opening ~/github/cpp4rgaussjordan
, you can run
use_cpp4r()
to have a readily available skeleton in your
project.
You can run use_apache_licence()
to have a
LICENSE
file and indicate in DESCRIPTION
that
your package is distributed under the Apache License.
Then you can run cpp_vendor()
to copy the C++ headers
into inst/include
.
Ordinary Least Squares (OLS) estimator
This example package covers the following topics:
- Integers
- Doubles
- Doubles matrices
- Conditionals
See the cpp4rols package.
This implementation is extremely naive, quite similar to the Gauss-Jordan example with extra steps, and it is enough to show how to use C++ code within R.
One approach is to create one function per step, which means creating one function to obtain , another for (e.g, implementing the Gauss-Jordan method to invert a matrix), another for and then calling each of those functions to obtain .
A good challenge would be to implement the QR decomposition used by
the lm()
function in R and use it to obtain the OLS
estimator in C++. Drury (2016) provides a
good starting point, this is not trivial to implement.
It is hard to beat the performance of the lm()
function
in R, which calls compiled C and FORTRAN functions, and these functions
are fast and robust lm()
.
Linear programming (Simplex phase 2)
This example package covers the following topics:
- Integers
- Doubles
- Doubles matrices
- Conditionals
- Loops (
for
andwhile
) - Messages
See the cpp4rsimplex package.
Algorithm
The simplex algorithm is well described in Introduction to Linear Optimization and there is efficient software to solve this, including Berkelaar and Csardi (2023).
A problem written in canonical form is represented by a table such as:
where are the coefficients of the objective function (i.e., costs), are the coefficients of the constraints, and are the right-hand side of the constraints.
The simplex algorithm to solve the problem consists in the next steps:
- If for all , then the current solution is optimal. Basic variables are equal to and non-basic variables are equal to 0.
- If for some , choose it to enter the base. Choose the variable with the most negative , and assume that it is .
- If for all , then the problem is unbounded.
- If for some , choose such that and pivot on , to then go back to step 1.
The coefficients are updated according to:
- for
- for
This algorithm is equivalent to Gauss method to solve linear systems.
Numerical example
A simple example is the following minimization problem:
In canonical form, this problem is:
The initial tableau for the problem is:
The first row is the cost row, the last column is the right-hand side, and the rest is the matrix .
The first step is to pivot on row 2 and column 2:
The second step is to pivot on row 2 and column 1:
Here a stopping criterion is reached: the minimum cost is non-negative, therefore the solution is optimal and is with an optimal value .
Building and testing
You can use devtools
to build and test the package:
# build
devtools::clean_dll()
cpp4r::cpp_register()
devtools::document()
devtools::load_all()
# test
c <- c(-1, -3)
b <- c(3, 2)
A <- matrix(
c(1, -3, 1, 1),
nrow = 2,
ncol = 2,
byrow = FALSE
)
cpp4r_simplex_phase2(c, b, A)
The result should be:
Initial tableau:
-1 -3 0 0 0
1 1 1 0 3
-3 1 0 1 2
Minimum cost: -3
Pivot row: 2
Pivot column: 2
====
New tableau:
-10 0 0 3 6
4 0 1 -1 1
-3 1 0 1 2
Minimum cost: -10
Pivot row: 1
Pivot column: 1
====
New tableau:
0 0 2.5 0.5 8.5
1 0 0.25 -0.25 0.25
0 1 0.75 0.25 2.75
Minimum cost: 0
Optimal solution found in 2 steps !
Using OMP (parallelization)
This example package covers the following topics:
- Integers
- Doubles
- Lists (unnamed and named)
- Loops (
for
) - OpenMP parallelization
See the cpp4romp package.
Motivation
One common phrase that you might find when you need to search how to
do something with cpp4r
is: “cpp4r does not offer OpenMP
support.” This is a myth. cpp4r
does offer OpenMP support.
The requirements are: A processor and C++ compiler that support
OpenMP.
This has been tested on Windows, where you need to install Rtools, and
Ubuntu where you do not need anything special because the
gcc
compiler comes with the operating system and it just
works.
If you are using macOS, you need to install libomp
via
Homebrew in order to extend the clang
compiler, and this is
explained in the OpenBox
documentation (“Enable
OpenMP on macOS
For Building Scikit-Learn -
OpenBox Documentation” n.d.).
Enabling OpenMP
You can create an R package called cpp4romp
with the
following code:
usethis::create_project("cpp4romp")
usethis::use_cpp4r()
Then, you can create the file R/cpp4romp-package.R
with
the following contents:
## usethis namespace: start
#' @useDynLib cpp4romp, .registration = TRUE
## usethis namespace: end
NULL
In order to get the #pragma
instruction to work, you
need to add the following to src/Makevars
:
PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) -DARMA_OPENMP_THREADS=1
PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS)
CXX_STD = CXX11
Not adding this means that the pragma instruction will be ignored and the functions will run in a single thread.
Unnamed list
You can add a function called squared_unnamed_
in
src/code.cpp
that will square each element in a vector of
doubles, so the file content corresponds to the following:
#include <cpp4r.hpp>
#include <omp.h>
using namespace cpp4r;
[[cpp4r::register]] list squared_unnamed_(doubles x) {
// create vectors y = x^2 and z = thread number
int n = x.size();
writable::doubles y(n);
writable::doubles z(n);
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
y[i] = x[i] * x[i];
z[i] = omp_get_thread_num();
}
//create a list containing y and z
writable::list out;
out.push_back(y);
out.push_back(z);
return out;
}
The previous function returns an unnamed list with two elements: the
squared vector and the thread number. The function is registered with
[[cpp4r::register]]
so that it can be called from R.
C++ is strict with types, so you need to create a wrapper function
that will convert the integers to doubles to avoid accidental errors, it
will go in R/cpp4romp-package.R
:
#' Unnamed list with squared numbers and the threads used
#' @param x A vector of doubles
#' @export
squared_unnamed <- function(x) {
squared_unnamed_(as.double(x))
}
Named list
You can add a function called squared_named_
in
src/code.cpp
that does the same but returns a named list.
The additional content corresponds to the following:
[[cpp4r::register]] list squared_named_(doubles x) {
// create vectors y = x^2 and z = thread number
int n = x.size();
writable::doubles y(n);
writable::doubles z(n);
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
y[i] = x[i] * x[i];
z[i] = omp_get_thread_num();
}
//create a list containing y and z
writable::list out;
out.push_back({"x^2"_nm = y});
out.push_back({"thread"_nm = z});
return out;
}
As in the previous part, you can add a wrapper and documentation:
#' Named list with squared numbers and the threads used
#' @param x A vector of doubles
#' @export
squared_named <- function(x) {
squared_named_(as.double(x))
}
Extensive R/C++ tests
See the cpp4rtest
package. It covers over 1,000 unit tests for cpp4r
features, including:
- Vectors (integers, doubles, logicals, strings)
- Matrices (integers, doubles, logicals, strings)
- Lists (named and unnamed)
- Data frames
- Error handling (stop, warning, message)
- C++ side unit tests with testthat
- Roxygen documentation
Useful R/C++ Integration Examples
Here are some examples of C++ code integration with R using the
cpp4r
package:
-
arrow: An R package
that provides bindings to the Arrow C++ library. Arrow is a
columnar in-memory analytics format that is extremely fast and efficient. - cpp4rarmadillo: An R package that provides bindings to the Armadillo C++ library. Armadiilo is a high-quality linear algebra library with a syntax similar to MATLAB.
- cpp4reigen: An R package that provides bindings to the Eigen C++ library. Eigen is a high-performance linear algebra library with a permisive license.
-
cpp4rtesseract:
An R package that provides bindings to the Tesseract OCR
C++ engine. This package allows to extract text from images. - haven: A package that reads and writes SPSS, Stata, and SAS files in R.
- kendallknight: Implements the Kendall’s correlation coefficient in C++, achieving speedup by using an algorithm with a complexity of instead of in base R.
- mice: A package that imputes missing data using multivariate chained equations.
- redatam: C++ implementation of the Redatam file format, callable from both R and Python.
-
RPostgres: A C++
interface to PostgreSQL using
cpp4r
. - tidyr: A package that uses C++ functions to reshape data frames.