Duncan Temple Lang

Department of Statistics, UC Davis

Table of Contents

Introduction
Limitations of the PACKAGE argument approach
A Better Approach
Packages
Package versions
Summary
Acknowlegements
Bibliography
A. Example Code
NAMESPACE directives
Dynamic Approach
Pre-computed Individual Symbols
Pre-computed List of Symbols

Abstract

We discuss paradigms for dealing with native routines in R packages. The approaches resolve ambiguities created by having routines with the same names in different packages and dynamically loaded libraries (DLLs), while still allowing one to have multiple versions of the same package loaded within an R session. The approaches aim at allowing native routines to be used as regular R objects that can be passed

Introduction

Most high-level interpreted languages provide facilities for loading native code (e.g. dynamically loadable libraries (DLLs) containing compiled C/C++ and Fortran routines). Python and Perl provide a great deal of structure to accessing these routines from those programming languages ([Python C interface], [Perl XS]). Matlab provides a simple way to associate a routine with a variable in Matlab. Languages like R and S-Plus have a great deal of flexibility. This flexibility can result in ambiguous and unexpected behavior unless care is taken.

We focus on R in this paper. There are several interfaces to allow one to call native routines. The primary ones are .C() , .Call() , .External() and .Fortran() . The typical paradigm for using these is that we identify the routine of interest by its name and, preferably include a PACKAGE argument that identifies the DLL in which the routine is located. Again, the DLL is typically identified by its name. To make things concrete, let's suppose we have a routine named myRoutine which takes a vector of real values (i.e. double *) We compile the code into a DLL (or shared object) named myDLL.so. We can then load the DLL into the R session and invoke the routine with the following code:
dyn.load("myDLL.so")
.C("myRoutine", rnorm(10), PACKAGE = "myDLL")

The key thing to note here is that we use the name of the routine and the name of the DLL.

We should also note that the same mechanism is used in R an package, however the DLL is typically loaded using the library.dynam() function or indirectly using the useDynLib() directive in the NAMESPACE file for the package, if present.

The PACKAGE argument is optional in R, but its use is strongly recommended. S-Plus does not have a PACKAGE argument for its .C() , .Call() or .Fortran() interfaces. In S-Plus or if the PACKAGE argument is omitted in R, the systems resolve the native routine in much the same manner regular language variables are found on the search path, that is by looking through the list of DLLs in the reverse order in which they were loaded, i.e. most recent first, and using the first match.

The difficulty with this mechanism is reasonably obvious. Suppose there are two DLLs that have routines that share the same name, say myRoutine. Let's suppose these are located in DLLs named A.so and B.so. Now, consider the following sequence of commands in an R (or S-Plus) session.
dyn.load("A.so")
.C("myRoutine", rnorm(10))

dyn.load("B.so")
.C("myRoutine", rnorm(10))

The first call to myRoutine finds the version in A.so. The second call to myRoutine will search the loaded DLLs and find the one in B.so. If this is what we want, that is fine. Otherwise, we will get either incorrect results or crashes due to the signatures of the two routines not being compatible.

By adding the PACKAGE argument to the both .C() calls, we can clearly state which version of the routine named myRoutine we want. So if we had intended to perform the initial call again,
.C("myRoutine", rnorm(10), PACKAGE = "A")

would have sufficed.

So far this information flows from the R documentation RExtensionsManual. We now illustrate some limitations of this mechanism and then present a variety of programming idioms that circumvent these limitations. We also discuss some extensions to R that would potentially simplify these programming idioms.

Limitations of the PACKAGE argument approach

There is no doubt that the addition of the PACKAGE argument in R represents a significant improvement over the S-Plus interface. It allows the caller to be specify their true intent rather than relying on other computations and their order within the session to determine which routine will be invoked. And this mechanism works in almost all cases. And with minor modifications to ones source code, all known conflicts can be avoided. Unfortunately, it is the future, unknown conflicts that cause problems and the very rare events that can be so hard to reproduce and diagnose that warrants a solution that works in all cases.

What can go wrong with our computations if we use the PACKAGE? Firstly, if we want our S language code to be used in both R and S-Plus, we must omit the PACKAGE argument since it is not present in S-Plus. We will discuss two simple approaches to deal with this problem. The second problem arises in the rare situation that we have two different DLLs with the same name and with different versions of the same routine. This occurs by definition when we have two versions of the same package. And when we are trying to compare the characteristics of two versions, e.g. their results or efficiency, it is convenient to be able to load the two versions of the package into the same R session. More specifically, it is awkward and tedious to have to use two different sessions each with a different version of the package in order to compare results. In addition to the multiple-version problem, it is also quite possible for two DLLs to have the same name, such as myLibrary.

Both of these problems are caused by the same fundamental problem. We are using either just the name of the routine or the name of the routine and the DLL to identify a routine. When we provide this singleton or pair, the R engine then locates the routine by searching through a list of globally-held resources, that is the DLLs. As this list is changed by loading and unloading DLLs, the resolution mechanism deals with different states and so the results can vary from call to call. Essentially, we are using global variables and this is bad [Knuth - Global variables considered bad]. Different commands are modifying these global variables as the session progresses and the result is unpredictable. This is desirable for interactive use when the user may want to explicitly control the resolution of symbols. This is like attaching data frames or packages to the R search path to ensure that we find variables in different locations. For more formal code in packages, however, predictable behavior is essential and this is especially true for native routines.

The solution to this problem is to avoid using global variables. There are various paradigms to do this. At its simplest, the goal is to be able to identify a symbol directly within a DLL. Since (exported, accessible)[1] symbols are unique within a DLL, all we need is to be able to uniquely identify the DLL. And thanks to some facilities added to R in 2004, this is quite feasible. Of course, using this R-specific facilities makes the resulting code incompatible with S-Plus.

A Better Approach

If one can forgo the compatibility between R and S-Plus and develop code specifically for R, we can solve the problem of using global variables comprehensively. As we mentioned above, in order to avoid global variables, we need to be able to explicitly refer to the correct DLL rather than by its name. With this goal, let's see how we can achieve this in the different possible contexts in R.

Let's consider perhaps the simplest case where we have no package, but merely a DLL containing routines that we want to call from R. Packages are similar but more involved and we will look at these later. Again, let us suppose that we have a DLL named myDLL.so and we are interested in the routine named myRoutine. Instead of using the (symbol name, DLL name) pair to identify the symbol that performs two lookups, we save the value returned from loading the library and then resolve the routine within that object.
dll = dyn.load("myDLL.so")

The resulting object in dll is an object of class DLLInfo. Currently, this is an S3-style or informal class. It is a list with elements that describe the DLL for use in R. For our purposes, we do not need to know about its contents; we only need the object to pass to other functions.

Now that we have loaded the DLL, we can find a symbol within it using getNativeSymbolInfo() , as in
symbol = getNativeSymbolInfo("myRoutine", dll)

Note that we use the dll object for the PACKAGE argument rather than the name of the DLL. We are no longer performing the lookup of the DLL object by name, and relying on that name to uniquely identify the specific DLL. Instead, we have managed the reference to the DLL directly in R within our session.

The object returned from getNativeSymbolInfo() is of (S3) class NativeSymbolInfo. This is a description of the native symbol, including its name and its location or address in memory. The final step is then to use this value in the call to the native routine. Each of the native interface functions accepts either the name of the routine and the the DLL identifier via the PACKAGE argument or alternatively a NativeSymbolInfo object. So we can use our symbol variable directly in the invocation of the routine, as
.C(symbol, rnorm(10))

Since this contains the address of the symbol, R can call that routine directly and avoid the lookup again. In this respect, it is faster than (name, package) specification approach. If we want to call the routine again, there is no further lookup. In fact, we no longer need the dll.

This approach uses extra variables to store the DLL reference and then the symbol. This is not a particular problem, but it does make the invocation less direct. So for interactive use, people may prefer to use the (name, PACKAGE) specification. And in an interactive setting, the user is more likely to be aware of potential conflicts in DLLs or to recognize erroneous answers. It is when we use native routines as part of a larger computation rather than interactively that that knowledge is not available to us.

There are three issues with this approach of resolving the symbol directly in the DLL. Firstly, suppose that we resolve the symbol and assign it to a variable, e.g.
symbol = getNativeSymbolInfo("myRoutine", dll)

in the code above. Later, we unload the associated DLL from the R session. At that point, the contents of symbol identify an invalid symbol that is no longer available to us. If we were to use this in a native interface function, we would most likely crash R. And there is no built-in mechanism to update these variables when the DLL is unloaded. We can arrange to do this ourselves, however. And generally, DLLs are used within packages rather than interactively loaded directly via dyn.load() . In this way, the references to the symbols are contained in the package's environment or namespace. When the package is detached, the DLL is unloaded and the package's variables disappear and so will not be used in future computations. So these invalid references do not cause problems. The DLL is unloaded after calls to unload hooks, i.e. .onUnload() or .Last.lib() , so these functions can make calls native routines using the approach we describe.

A second issue to keep in mind when using this approach is that the computations to get the symbol references must be done in each session. One cannot use values computed in a different session and reloaded directly into the new session. The reason for this is that the actual addresses of the symbols are external pointers in R and when saved to a file are stored as NULL. If such references are restored and used in a native interface function, the result will be an error.

Again, most use of native routines occurs in a package. Rather than computing the references in the top-level code of the package, these computations must be done each time the package is loaded. So this should be done within the .First.lib() or .onLoad() functions.

The final issue is that any information about the native routine that is provided explicitly via the registration mechanism is not currently used. This is something that will be fixed internally and will not change the paradigm or programming interface for it.

Packages

In the previous section, we discussed using the DLL reference directly to resolve the native symbol references in the context of interactive use. As we mentioned, this is normally done in R packages and several of the potential consistency issues are easier to deal with in the context of a package. In this section, we will illustrate how to use this approach in the context of a package. The code in the example is available from RDotCall. The package is called RDotCall and it contains three functions: R_call() , R_c() and R_version() . These are simple wrappers for the corresponding C routines. There are also 2 alternative versions of the package within this package that exhibit the different approaches, and also that also allow us to test the ability to handle multiple versions of the same package and correctly resolve the native symbols in each.

The basic approach is the same as it was for the interactive loading of the DLL via dyn.load() . We recommend that each package you develop uses a namespace and explicitly exports variables that it wants to make available to others. We will focus on this scenario. Packages without a namespace work in much the same way and we will mention the differences at the end of this section.

A package that has an associated DLL needs to load that DLL into the R session when the package is itself loaded. When using a namespace, this can be achieved using the useDynLib() directive in the NAMESPACE file of the package. This causes R to load the DLL given as the argument to the "function". This is a good way to specify the association between the package and the DLL. Unfortunately, it does not allow us to get a reference to the DLL. So, in addition to this loading of the namespace, we will explicitly load the DLL, much as we did with dyn.load() . For a package, we can do this most easily using library.dynam() which takes care of finding the DLL in the relevant directory associated with the package. Like dyn.load() , the return value from library.dynam() is an object of class DLLInfo which we can use to access the native symbols in the DLL.

We explicitly load the DLL when the package is itself loaded. To do this, we use the .onLoad() hook that is used for packages with a namespace. This is the namespace equivalent of the .First.lib() function that is called when a regular package without a namespace is loaded. The .onLoad() function is called with the name of the directory in which the collection of packages (i.e. the library) containing the package being loaded and the name of the package itself. We call library.dynam() with the name of the DLL (without the directory or file extension) and the name of the package as given in the second argument to our .onLoad() function.
.onLoad =
function(libname, pkgname)
{
  dll = library.dynam("RDotCall", pkgname)  
  ...
}

Note that the first argument to library.dynam() is usually the same as the pkgname argument in the call to .onLoad() . If this were always the case, we should use that and not use the explicit, literal name "RDotCall". However, the current implementation of versioned packages uses the version qualified name for the package as the value for pkgname and this will not correspond to the name of the DLL within that package. So, for the present, we use the explicit name of the DLL which is, in general, the unqualified name of the package.

There are now several approaches we can take in order to work with the references to the native routines that we wish to invoke within the code for the package. One is to resolve all the symbols that are used in the package within the body of the .onLoad() function. We do this by calling getNativeSymbolInfo() for each of the routines of interest. getNativeSymbolInfo() in (very recent versions of) R (i.e. R-2.3.0 development) is now vectorized so can process multiple symbols in a single call. The result is a list of the NativeSymbolInfo objects, and the elements can be indexed by the name of the routine. So this allows us to get all the routines of interest in one call, i.e.

.onLoad =
function(libname, pkgname)
{
  dll = library.dynam("RDotCall", pkgname)  
  symbols <<- getNativeSymbolInfo(c("R_call", "R_c"), dll)
}


Notice that we have used the non-local assignment operator <<-. This is because we need these symbols after the .onLoad() completes. So we need to store these for use in the native interface calls. And in order to ensure that the symbols variable is local to our package, we must define it in our package outside of the .onLoad() function.

Now that we have the symbols, we can use them in what is hopefully the obvious way. We can index the symbols list using the name of the routine of interest as in
 .Call(symbols[["R_call"]], 2, 3)

This gives us the desired results. It passes the NativeSymbolInfo to the .Call() function and there is no need for a PACKAGE argument.

Rather than having a single variable (symbols) that contains all the references to the native routines, one could also create explicit variables for each of these symbols. In the .onLoad() function, we can assign the symbol information to individual variables in at least two ways. The first way is to have a call for each routine of interest, e.g.

R_call_sym = NULL
R_c_sym = NULL

.onLoad =
function(libname, pkgname) {
  dll <- library.dynam("RDotCall", pkgname)

  R_call_sym <<- getNativeSymbolInfo("R_call", dll)
  R_c_sym <<- getNativeSymbolInfo("R_c", dll)
}


Again, note that we use the non-local assignment operator as we want R_call_sym and R_c_sym to be available to the code in the package after .onLoad() returns and the package is loaded. And again, to ensure that these variables are not assigned to the global environment, we must define them in our package so that the <<- knows where the new values should be assigned.

Alternatively, one can use an explicit call to assign() to put the variable in the package.

Regardless of which mechanism is use to create these package variables, they can be used in the native interface function calls in the usual manner:
 .Call(R_call_sym, 2, 3)

The final approach that can be used avoids explicitly resolving and assigning the native symbols when the package is loaded. Instead, we can assign the reference to the DLL obtained by the call to library.dynam() and then use this to dynamically resolve the native symbol in the DLL when it is needed. So in the package's .onLoad() function, we assign the loaded DLL to a package variable. Then, we use getNativeSymbolInfo() either directly or indirectly in any of .Call() , .C() , etc. function calls. So the .onLoad() function is written as

dll = NULL
.onLoad =
function(libname, pkgname) {
  dll <<- library.dynam("RDotCall", pkgname)
}


Then, the wrapper function R_call() can be defined as
R_call =
function(a, b)
{
  .Call(getNativeSymbol("R_call", dll),  a, b)
}

The expression getNativeSymbol("R_call", dll) is a little "ugly", so instead we use the syntactic sugar dll$R_call. The function then looks more succinctly like
R_call =
function(a, b)
{
  .Call(dll$R_call,  a, b)
}

If the package does not have a namespace, the above description applies with only one essential difference: replace the .onLoad() function with a .First.lib() with the same body. The other important difference is that the package level variables such as dll, symbols or R_call_sym now become global variables. So they should be named appropriately. But, really, one should use a namespace. The time spent learning about namespaces is relatively short and worth the effort.

Package versions

The PACKAGE argument helps to remove ambiguities. However, one of the reasons for avoiding it arose in 2004 when Andy Liaw attempted to load two versions of the same package in the same R session. Since the name of the DLL in a package does not depend on the package version, the DLL name specified as the PACKAGE argument does not differentiate between the two versions of the package. Consider what happens when we load version 1.0 and then version 2.0 of a package, say A, into an R session. Both versions may have a call to a routine named, say, foo via code such as
 .C("foo",  PACKAGE = "A")

In this case, R will search for the DLL named A, and then find the one for the most recently loaded package, i.e. 2.0. Accordingly, the old version will end up using the routines in the new version.

namespaces allow us to provide a solution to this problem. Suppose a function in a package calls one of the native interface functions. If that package has a namespace, we can find the namespace's environment from that function (in typical cases). And the package's DLL(s) is stored in the namespace's environment and so available to us. Thus, if the PACKAGE is omitted in these native interface function calls, the dispatch mechanism that handles calling the native routines can find the implied DLL. And this handles the issue of different package versions. In our example above, the native interface call would be
 .C("foo")

without the PACKAGE argument. And then, in each case, the interface mechanism would find the DLL loaded by that particular version of the package from the package's namespace.

One of the difficulties with this approach of removing the PACKAGE argument is that there is now non-trivial computation to resolve the symbol in the DLL. The internals have to find the DLL for the namespace. In many cases, this is not very costly, but it is unnecessary if we are not using multiple versions of a package and if there are no DLLs with the same name. And the computations are performed each time a routine is called. I believe it is important to have a system that is predictable and correct, rather than having the potential for miscalculation that can go unnoticed, regardless of how remote that possibility is. The paradigms described in this paper, and specifically in this section, removes the extra computations repeated in each call to a native routine. If the native routines are called frequently, the computations will in fact be more efficient when amortized over the calls. It handles different package versions, avoids any ambiguities caused by similarly named DLLs in different packages, and can be used for DLLs other than the default one associated with a package. As R is used in different contexts, such generality and flexible computational model is important to facilitate rather than limit developments.

Summary

We have discussed the ambiguities in identifying native symbols in R and S-Plus. R provides the PACKAGE argument in the native interface functions to assist in removing the ambiguities. However, this does not remove the ambiguities entirely because it still relies on global variables rather than direct references. In this paper, we have outlined a programming idiom to remove all the ambiguities. By dealing directly with references to DLLs containing the native symbols of interest, we avoid the two step lookup necessitated by using names. The idiom manages the symbols correctly in the case of multiple versions of the same package, different DLLs with the same name (but different location), and does not incur a significant computational penalty, and can even improve performance for regularly used routines.

The idiom is described by the following steps
  • Namespace: A package should have a namespace. And the DLL(s) to be loaded should be specified via the useDynLib() directive in the NAMESPACE. And this directive should also include the names of the symbols that are to be resolved in that DLL for use in R, e.g.
    useDynLib(RDotCall, myRoutine, myOtherRoutine)
    

    These names can then be used to directly refer to the native symbols in the interface function calls, i.e.
     .Call(myRoutine, ...)
    

    Since these symbol names often conflict with R variable names, one can specify aliases by which they can be referenced in the R package code. This is done by supplying names for the symbols in the useDynLib() directive
    useDynLib(RDotCall, symbol.myRoutine = myRoutine,  myOtherRoutine)
    

    and using those in the interface calls
     .Call( symbol.myRoutine, ...)
    

    As illustrated in the above example, elements without an explicit name are accessed by the symbol name.
  • .onLoad() : In addition to the useDynLib() directive in the NAMESPACE to load the package's DLL, in the .onLoad() function in the package, we explicitly load the DLL using library.dynam() and assign the return value to a variable. There are then two different approaches.
    • Assign the DLL reference from library.dynam() to a (non-exported) variable in the package (say dll) and have all .C() , .Call() , .Fortran() , .External() calls use the form
       .Call(dll$routine_name, ...)
      

      where routine_name is the name of the native routine. This is the dynamic resolution approach as each call to dll$routine_name will involve resolving the routine.
    • Within the .onLoad() function, resolve all the routines that are used within the code in the package. This can be done with a single call to getNativeSymbol() with the names of all the routines and the DLL reference returned by library.dynam() . The result is a list[2] containing objects that identify each routine. Within this approach, there are again two different ways to use these symbols.
      • One can assign this list to a package variable, say symbols. The elements can then be used in calls to the native interface functions using expressions like
        .Call(symbols[["routine_name"]], ...)
        

      • One can avoid the indexing of the list (symbols[["routine_name"]]) by explicitly assigning the individual elements to individual variables in the package. The commands
        dll = library.dynam("myPackage", pkgname)
        symbols = getNativeSymbol(c("foo", "bar"), dll)
        e = getNamespace("myPackage")
        sapply(names(symbols), function(id) 
                                 assign(paste(id, "_sym", sep = ""), symbols[[id]]))
        

        create variables named foo_sym and bar_sym that identify the routines foo and bar respectively. These variables can the be used in native interface calls with code such as
         .C(foo_sym, ...)
        

The dynamic approach (dll$routine) is probably the simplest to program; you need only to assign the DLL reference and the .C() , .Call() , ... calls are very similar to the tradition form. This approach causes each native routines to be resolved in the DLL each time it is invoked. Accordingly, if one expects many calls to different routines, there will be a small, unnecessary overhead in resolving these routines each time. Instead, one can resolve the routines when the package is first loaded and then use these pre-resolved symbols in the native interface routines. This will slightly increase the initial time in loading the library, but will speed up repeated calls to routines. The dynamic approach enjoys the advantage of simplicity and less programming. If one adds a new routine to the DLL to be called from R, there is no need to add it to the list of symbols resolved in the .onLoad() function. Instead, the native interface function need only access the existing dll variable.

There are two changes that are needed in R. One is to connect the existing registration information for native routines specifying the number and type of expected arguments to this framework. This is relatively minor and does not change the programming idioms described in this paper. The second requires a change to the quality assurance (QA) tools which validate packages. Currently, the QA tools check for the presence of the PACKAGE argument in the foreign function interface calls. Using the idioms described in this paper makes these tests more complex. Essentially, they are run-time tests rather than static code analysis. We need to support these idioms in the QA tools.

It is possible to assign the symbol information to one or more variables in the package's namespace directly from the C code that is called in the initialization of the DLL when the package is loaded. This initialization routine (R_init_packageName) is often used to register native routines. We can provide additional facilities in the C-level API for R to assign the NativeSymbolInfo objects into the package's namespace. It is not clear that this will actually simplify the development of a package with a DLL relative to using one of the three approaches above. Additionally, rather than implementing them all, we will wait until we determine which of the three different approaches is most widely adopted by R programmers, and then add the facilities for that approach.

Acknowlegements

The original need for a more general mechanism than the PACKAGE argument was brought to my attention by Robert Gentleman and Andy Liaw. And Robert has provided helpful comments on this paper.

Bibliography

[Perl XS] Dean Roehrich and The Perl Porters. perlxs- XS language reference manual. perlxs at perldoc.perl.org .

[RExtensionsManual] R Development Core Team. Writing R Extensions.

[Python C interface] Guido van Rossum. Extending and Embedding the Python Interpreter. Python documentation .

A. Example Code

In this appendix, we present example code for each of the three different idioms we are proposing in this paper.

NAMESPACE directives

This is the simplest and "official" way of using symbols in a DLL.
#
#useDynLib("RDotCall", R_call = R_call, R_c = R_c_sym, R_version = R_version_sym, R_c)
bob = useDynLib(RDotCall, R_call_sym = R_call, R_c_sym = R_c, R_version_sym = R_version)


export(R_call, R_c, R_version)







And the corresponding R code to use these symbols is given as
R_call =
function(a, b)
{
 .Call(R_call_sym, as.numeric(a), as.numeric(b))
}


R_c =
function(a, b)
{
 .C(R_c_sym, a = as.numeric(a), as.numeric(b))$a
}

R_version =
function()
{
 .Call(R_version_sym)
}


Dynamic Approach

# This is an example of using the dynamic symbol resolution
# in each call to a routine.

if(FALSE) {
  # Uses the bob = useDynLib() directive.
dll = NULL

.onLoad =
function(libname, pkgname) {
  dll <<- library.dynam("RDotCall", pkgname)
}
}

R_call =
function(a, b)
{
 .Call(bob$R_call, as.numeric(a), as.numeric(b))
}


R_c =
function(a, b)
{
 .C(bob$R_c, a = as.numeric(a), as.numeric(b))$a
}

R_version =
function()
{
 .Call(bob$R_version)
}


Pre-computed Individual Symbols

<xi:include></xi:include>

Pre-computed List of Symbols

symbols = NULL

.onLoad =
function(libname, pkgname) {

  dll <- library.dynam("RDotCall", pkgname)
 
  symbols <<- getNativeSymbolInfo(c("R_call", "R_c", "R_version"), dll)
}

R_call =
function(a, b)
{
 .Call(symbols[["R_call"]], as.numeric(a), as.numeric(b))
}

R_c =
function(a, b)
{
 .C(symbols[["R_c"]], a = as.numeric(a), as.numeric(b))$a
}


R_version =
function()
{
 .Call(symbols[["R_version"]])
}




[1] A DLL may have several symbols with the same name as long as they are within different namespaces or scopes such as static variables within different C source files.

[2] getNativeSymbol() returns a list if there is more than one routine requested or if unlist is FALSE.