Aspects of S-Plus/R Compatibility

This document outlines some of the main issues to be resolved in writing S language software and supporting C software that will work with both R and S-Plus.

Strategy

The approach to achieving compatibility has to balance R/S-Plus compatibility with back-compatibility within the individual system. Our strategy, as far as possible, is to provide tools in the SLanguage package that can be used in either R or S-Plus to write software that works in the other system, but without breaking existing software within the particular system.

In other words, including the package in R should be transparent as far as possible to working software within R. But to write R functions that also work in S-Plus, the programmer may have to take some extra steps. Similarly, attaching the SLanguage library in S-Plus should not break existing S-Plus software, but the programmer will again need to follow some guidelines to ensure that new functions work in R.

The trade-offs may be clearer after looking at an example. One of the areas where help is needed involves get and other functions that access objects (see below). The R SLanguage package has a definition of get that (we claim) can be used without change as a replacement for the get in package base. The R programmer who also wants software to work in S-Plus (with the SLanguage library attached) may have to do some extra work. In this case, the extra requirement is to name any optional arguments to get. For example,


get("myData", 2)
means, in R, to search in position 2 of the search list. But in S-Plus, the second argument refers to the evaluation frame, so the call above is not compatible. The programmer needs to write:

get("myData", pos = 2)
instead. This will work in both systems, but only with the SLanguage library attached (because some of the argument names are different in the standard versions of the functions).

Fixes of this style will go a long way to providing compatibility, especially for S language programming. There remain semantic inconsistencies that must be avoided in the program design, but these tend to be sufficiently obscure that most applications will not suffer from them.

The strategy for programmers using C will be described later. One essential component is a file of C preprocessor macros that hide differences in the data structures and support routines used by the two implementations. Programmers using these macros instead of the macros supplied with R and S-Plus directly will have a substantially better chance of writing compatible code.

Sources of Incompatibility

Life will be pleasanter if, before starting to write software that intends to work in both S-Plus and R, you review some of the main causes of incompatibility, to see how they may be avoided.

There is no standard for the S language. (Not yet, anyway, although we hope that experience with this package and other work will lead to discussions about such a standard.) The language was roughly defined by the books published in 1988 (the ``blue book'', on the language) and in 1992 (the ``white book'', on statistical models). The implementation of S-Plus was based on the S software from Bell Labs described in those books. R is an open-source implementation using (mainly) the description in the books as a starting point.

The situation is further complicated in that S was, essentially, re-implemented in a version described in the 1998 ``green book'', although with a large degree of back-compatibility. The many new tools described in this version, plus new developments in R, throw in many possibilities for extensions.

Beyond smoothing over incompatibilities in the existing common part of the language, a useful, growing standard for the S language needs to incorporate the important extensions. Fortunately, a number of these are already being developed, within the R project and elsewhere.

And S needs to continue to grow as well. There are a number of new developments underway and others that are needed. It would be appropriate for the Omegahat project, and the SLanguage package in particular, to provide support for these. A separate document on extensions discusses some related topics.

Objects and Databases (get, exists, assign, etc.)

There are two problems here: first, the R and S-Plus semantics for storage and for finding objects are slightly different; and, second, the arguments to the functions that access objects are inconsistent between the two systems. The problems are related but the second problem is both easier to handle and of more practical relevance. We'll discuss the first below.

The SLanguage package provides versions of get, exists, assign, remove, and objects that deal with all the arguments of either the S-Plus or R standard implementations. The matching arguments in each system come first and are interpreted consistently with the standard version, so that existing software should continue to work in that system. Since all arguments are provided in both systems, however, naming the optional arguments should make the code portable to the other system, up to the limits imposed by different semantics (we'll get to that in the next section).

Recommendation: To use these functions compatibly, always supply arguments relating to position, frame or environment by name.

Objects, Names, Environments, and Databases

The two implementations of S differ in the way they implement the association of names with objects. The differences in implementation sometimes leak into the way programming is done. For the most part, the differences are internal, meaning they are not as easy to remove as differences in specific functions. The main message to users is to avoid the use of features that work only in one implementation.

The original S notion was that objects were organized in frames, roughly analogous to a named list. Arguments in a function and local assignments created or changed elements of the list associated with the corresponding name. Each function call had a corresponding frame.

When objects were assigned at the top level, the analogous database was originally pictured as a directory, with individual files corresponding (when possible) to objects of the same name. Since functions are first-class objects, libraries of functions were stored identically.

R added to the notion of frames that of an environment, roughly a list of frames (but see comments on Lisp-style object structure). When a function is created (not when it is called), the current environment is stored with it. The main applications of this idea have been to create functions that can share objects in their parent environment(s) (called enclosures in R).

One particular environment, the global environment, is associated with top-level assignments (corresponding to the working directory in the terminology of the blue book). But this is not a directory; all the objects in the global environment are kept in memory and copied to disk only by explicit user command or at the end of a session.

Programmers can get involved with all this in two ways:

  1. You want to use the features of environments. This usually arises when you want a function to have memory from one call to the next (say the current element in a list or some current option). The SLanguage library for S-Plus provides a way to set the environment of a function explicitly. Using this function will make the simple use of environments compatible between S-Plus and R. The mechanism is not automatic, of course, and is inefficient if you call the function many times to do only a little computing each time.

    Note that to make the code work without change, both the R and S-Plus versions must set the function's environment explicitly. (Those of us with some reservations about the safety of the mechanism might think of explicit environment settings as a desirable feature anyway.)

  2. You don't want to use environments explicitly and want to ensure that your code doesn't get tripped up by the incompatibilities. Read on in the present document.

The rules for R are described in the Scope of Variables section of the R Language Definition (supplied with each copy of R). The rules for S-Plus are defined in section 5.4 of the blue book; see particularly the diagram on page 118.

For those who like such discussions, the relative advantages of the two approaches are interesting. But our focus here is on avoiding bad results when using software in both systems.

For the R programmer, the first warning is that some features cannot be used if compatibility with S-Plus is a goal. The features mostly center around the explicit use of environments, explicit references to a nested set of frames in which names are bound to objects. In the future, some of the uses of environments and related ideas in R may re-surface in compatible form (one of the projects building on SLanguage package is an implementation of objects references, which in R is based on using environments). Meanwhile, environments are off-limits if you want compatible code.

For the S-Plus programmer, the only likely way to trip up would be to use locally defined functions in a nested way that explicitly assumed that different locally defined functions did not see the objects in the parent function. This is unlikely; what happens more often is that programmers initially assume something more like the R semantics, only to discover that it doesn't work.

The consequences of the different approach to databases show up more often in relative efficiencies rather than incompatibilities. The main potential for incompatibility arises from running more than one S process in the same place simultaneously. Assignments in one S-Plus process will be seen in the other, but not in the case of R.

However, this is generally a bad way to work in either system, because it can lead to confusion if done interactively and to bad synchronization errors if done non-interactively.

The efficiency issues are easy to see, but not so easy to deal with in general. Because R holds all objects in memory, it is faster dealing with a number of relatively small objects, but can suffer if there are a moderate or large number of large objects. In principle, S-Plus can suffer from having to read data repeatedly from disk within one session (modern implementations can avoid some of this cost).

At the moment, we don't offer much help here. But at a future stage, using tools such as formal methods and classes (applied to databases) and inter-system interfaces, we may be able to give all S language programmers access to more flexible ways to deal with large amounts of data.

Recommendation: To use environments portably, set them explicitly, and avoid assumptions about inherited environments in R. Basically, keep the use of environments simple and explicit.

To avoid accidental incompatibilities: In R, avoid use of implicitly defined environments or closures. In S-Plus, don't define local functions that override global names, though usually you will get away with it.

Internal Object Structure

The important issues here deal with language objects, objects that represent functions and expressions in the S language itself. It's important in the design of the language that these are first-class objects that can be manipulated, in principle, just the same as any other objects. This allows some powerful techniques for manipulating models and other symbolic objects, for interactive programming tools, and many other applications (still under-utilized by most users).

The catch is that R and S-Plus have fundamentally different ways of representing these objects. Programmers need to avoid using the explicit representation when dealing with language objects. We will provide tools to help. These are currently under construction: The first step defines consistent methods to extract and replace elements of language objects in the two systems. (See function languageEl in the package.)

The original history of S (if you're interested see the note) led to a vector-oriented approach to all objects; that is, all objects were vectors (one-way arrays). Lists and other recursive objects were special only in that the elements of the vector were themselves objects. When functions and expressions became first-class objects, they shared this form, differing only in the ``mode'', which was specific to the kind of language object (function, call, braced list of expressions, etc.). Chapter 11 of the blue book describes a model for the language based on this structure. Current S-Plus retains the same model.

R came partly from a background of the Lisp family of languages, so that recursive objects originally tended to be viewed as linked lists. The fundamental Lisp operations on lists extract the first element or the rest of the list, and form a new list by combining a first element and an existing list. R uses a vector-style form for data lists, but language objects are stored internally in linked list form.

Both implementations have all the tools necessary to manipulate language objects, but the tools are not compatible. In both, the most general approach is to turn language objects into S lists, manipulate the lists and then turn them back into the desired language objects. But the resulting lists are different, different expressions are used to turn them back, and care has to be taken not to lose information along the way. (In R, the general approach is to use as.list to get the list and as.call to go back, whether or not the language object is conceptually a call. In S-Plus, the reverse of the as.list is as.mode for the corresponding mode.)

One approach to compatibility is to provide some tools at a more abstract level, which wouldn't be a bad idea anyway for readability.

Recommendation: Use the tools supplied as the base for explicit manipulations of the pieces of language objects. Right now, this is mainly languageEl to extract or replace elements; we could make this into methods for the [[ ]] operator, but at the moment we're holding off until the back-compatibility issues are a little clearer.


Historical Note: S Object Structure

The original version of S (circa 1976) was cobbled together from various Fortran-based pieces of software. Among these was a system for managing data as hierarchical structures on disk. (The application, interestingly given future statistical analysis on such data, was some early data from testing semi-conductors.)

The basic approach was hierarchical: data was stored with what we can call an S-style mode, a length and a sequence of values. For recursive objects, the values were themselves similarly defined data. It was all done at a very low level.

When it came time to represent S data internally, the natural approach was to follow a similar style, but with dynamically allocated arrays in Fortran. For this there was already some early software designed to allocate data temporarily as working storage for numerical computations (in what later became the Port library), by chopping up a large array in Fortran common block storage. By using indices into the block as ``pointers'', the S hierarchical structure from disk could be carried over into the internal representation.


John Chambers<jmc@research.bell-labs.com>
Last modified: Sat Mar 10 09:17:35 EST 2001