R/S-Database Interface (RS-DBI) definition proposal. $Id: RS-DBI.proposal,v 1.1 2000/02/08 00:12:54 dj Exp dj $ This is a first draft of a common set of classes and methods for interfacing R/S and relational databases (RDBMS). The idea is that a common interface will simplify data access to any RDBMS. The emphasis is on querying databases, and not so much in a full low-level interface for database development (like in JDBC or ODBC). Also, unlike the JDBC and Perl API's, we want to approach the problem from the "whole-object" perspective so natural to R/S and Python (e.g., by fetching all fields and records simultaneously) yet allowing finer control for users that need it. Currently we envision providing drivers for Oracle, Informix, mySQL,, and Postgres. It's quite possible that we may want to create an interface to UNIX (gzipped) flat file databases. A typical use may be Example: mgr <- dbManager("MySQL") con <- dbConnect(mgr, user = "user", passwd = "passwd") rs <- dbExecStatement(con, "select fld1, fld2, fld3 from MY_TABLE") tbls <- fetch(rs, n = 100) hasCompleted(tbls) [1] T close(rs) rs <- dbExecStatement(con, "select id_name, q25, q50 from liv2") res <- fetch(rs) getRowCount(rs) [1] 73 close(con) The exact same script should work with other RDBMS, (say, Oracle) by replacing the first line with mgr <- dbManager("Oracle") S-DBI Classes: These are the main classes for the interface. dbManager: Virtual class extended by actual database manager, e.g., Oracle, MySQL, Informix. dbConnection: Virtual class that captures a connection to a database instance. dbResult Virtual class that describes the result of an SQL operation dbResultSet: Virtual class, extends dbResult to fully describe the output of a SELECT (or SELECT-like) SQL statement. All these classes should implement the show(), describe(), and getMetaData() methods. show() (print in R) prints a one-line identifying the object. describe() prints a short summary of the meta-data of the specified object (like summary). getMetaData() takes an object of one of the above classes and a string specifying a facility and returns the meta-data associated with it (NULL for unrecognized facilities). The class dbManager may implement two getMetaData() methods: one for the signature that includes dbManager and string, and another for the signature that includes dbManager, dbObject, and string which simply invokes getMetaData() on the second object. Example: mgr <- dbManager("MySQL") getMetaData(mgr, "version") con <- dbConnect(mgr, ...) getMetadata(mgr, con, "type") # is it a TCP or a socket connection? # which is equivalent to getMetaData(con, "type") The reason we implement the meta-data in this way is to simplify the writing of drivers. We don't want to overwhelm the developers of drivers (ourselves, most likely) with hundreds of methods as in JDBC. In addition, the following methods should also be implemented: getDatabases() List all available databases known to the dbManager. getTables() List tables in db. getTableFields() Fields in a table in a database. getTableIndeces() indices defined for a table in a database. these methods may be implemented using the appropriate getMetaData() method above. Class: dbManager This is a virtual class that identifies the relational database management system (RDBMS). It needs to be extended by individual drivers (oracle, postgres, ...). The dbManager class defines the following methods. load() initialization of the driver code. We suggest having the generator, dbManager("driver"), automatically load() the driver. unload() releases whatever resources the driver is using. version() returns the version of the S-DBI currently implemented, plus any other relevant information about the implementation itself and the RDBMS being used. Class: dbConnection This virtual class captures a connection to a RDBMS. It provides access to dynamic SQL, result sets, RDBMS session management (transactions), etc. Note that the dbManager may allow multiple simultaneous dbConnections. dbConnect() opens a connection to the database "dbname". Other likely arguments include "host", "user", and "password". It returns an object that extends "dbConnection" in a driver-specific manner. Note that we could separate the steps of connecting to a RDBMS and opening a database there (i.e., opening an *instance*). I think that for simplicity we should do the 2 steps in this method. If the user needs to open another instance in the same RDBMS, just open a new connection. close() closes the connection (and discards all pending work) in the connection. dbExecStatement() submits one SQL statement. It returns a dbResult object. In the case of a SELECT statement, the object should be a dbResultSet (which extends dbResult); the dbResultSet object will be needed for fetching the actual rows. The result of a non-SELECT statement (e.g., UPDATE, DELETE, CREATE, ALTER, ...) is defined as the number of rows affected (this seems to be common in SQL). commit() commits pending transaction (optional). rollback() Undo current transaction (optional). callProc() Invokes a stored procedure in the RDBMS (tentative). Stored procedures are NOT part of the ANSI SQL standard and possibly vary a lot from one RDBMS to another. Oracle seems to have a fairly decent implementation. SQL Scripts: How should SQL scripts be run? We could execute statements without returning until we encounter a query (SELECT-like) statement and return its result set. The application is then responsible for fetching these rows, and then invoke "dbNextResultSet" which repeats the exec/fetching until we encounter the next query. And so on. The following 2 methods are the initial proposal: dbExec() Submit an SQL "script" (multiple statements). May be implemented by looping with dbExecStatement(). dbNextResultSet() When running SQL scripts (multiple statements), close the current result set in the dbConnection, execute the next statement and return its result set. [How about commitments and rollbacks for SQL scripts?] Class: dbResult This class describes the result of an SQL statement and the state of the operation. Data definition statements (e.g., CREATE, UPDATE, DELETE) set the "completed" state to 1, while SELECT statements to 0. Error conditions set this slot to a negative numbers. The dbResult class defines the following methods: getStatement() returns the SQL statement associated with the result set. getDBConnection() returns the dbConnection associated with the result set. getRowsAffected() returns the number of rows affected by the operation. hasCompleted() was the operation completed? SELECT's, for instance, are never completed and their output needs to be fetch()'ed. getException() extracts the last exception. Class: dbResultSet This class extends dbResult, and it describes additional information from the result of a SELECT statement and the state of the operation. The "completed" state is set to 0 so long as there are pending rows to fetch. The dbResultSet class defines the following methods: getFields() describes the SELECTed fields. The description includes field names, RDBMS internal data types, internal length, internal precision and scale, null flag (i.e., column allows NULL's), and corresponding S class (which can be over-ridden with user-provided classes). setDataMappings() defines a conversion between the internal RDBMS data types and R/S. We expect the default mappings to be by far the most common ones, but users that need more control may specify a class generator for individual fields in the result set. (See next section for details.) getRowCount() rows fetched so far (in the case of SELECT's result sets) getNullOk() returns a logical describing with fields accept NULL's. The current S-Oracle and S-MySQL implementations represent a dbResultSet as a list with the following members: connection -- the connection object associated with this result set; statement -- is a string with the SQL statement being processed; description -- a field description data.frame with as many rows as there are fields in the SELECT output, and columns specifying the "name", "type, "length", "precision", "scale", "Sclass" of the corresponding output field. rowsAffected -- how many rows were affected (non-SELECT statements); rowCount -- number of rows currently fetched; completed -- logical describing whether the operation has completed or not (SELECT statement return an uncompleted result set that is used by fetch(). When completed, "rowsAffected" will equal "rowCount". nullOk -- logical vector specifying whether the corresponding column may take NULL values. The methods above indicated are implemented as accessor functions in the obvious way. In addition to the above methods, dbResultSet objects support the following: close() Closes the result set and frees resources both in S and the RDBMS. fetch() Fetches the next "max.rec" records (-1 means all). S-DBI Data Type mappings: Since S and R have few primitive data types as compared to databases, by default we'll do the "obvious". Any of the many character objects are mapped to CHAR. Numbers are mapped into either doubles (numeric) or long (integer). Dates are mapped to character using the appropriate TO_CHAR() function in the RDBMS (which should take care of any locale information). Some RDBMS's support the type CURRENCY or MONEY which should be mapped to numeric. Large object (character, binary, file, etc. also need to be mapped). User-defined functions may be specified to do the actual conversion as follows: (1) run the query (either dbExec or dbExecStatement): rs <- dbExecStatement(con, "select whatever-You-need") (2) extract the output field definitions with flds <- getFields(rs) (3) replace the class generator in the, say 3rd field, by the user own generator: > flds[3, "Sclass"] # default mapping [1] "character" to flds[3, "Sclass"] <- "myOwnGeneratorFunction" (4) setDataMappings(resutlSet, flds) [TBD] Large objects (up to 4GB, at least in theory) could be fetched by repeatedly invoking a specified R/S function that takes as argument chunks of specified number of raw bytes. In the case of S4 (and Splus5.x) the S-DBI implementation can write into an opened connection for which the user has defined a reader (but can we guarantee that we won't overflow the connection?). In the case of R it is not clear what data type should BLOB's be mapped into. Open Issues: We may need to provide some additional utilities, for instance to convert dates, to strip blanks, etc. Should we provide hooks so this is done at the C level, or should the post-processing be done at the R/S level? What kind of data object is the output of a SQL query. We can implement this as a data.frame, a list, or some class that extends one of these. Perhaps jmc's dataTable? One (simple) issue with data frames is that they automatically re-label the fields according to S syntax, changing the actual RDBMS labels of the variables; the issue of non-numeric data being coerced into factors automatically "at the drop of a hat" (as someone in s-news wrote) is also annoying. What kind of error/exception handling should we implement? Should we print warnings from the C implementation, or return some kind of structure describing the state of the computation (e.g., if there are syntax errors we may want to report where the error occur). How should large object be handled? (See previous section.) Limitations: Currently we allow only one SQL statement at a time (thus, the user is forced to split SQL scripts into individual statements). Transaction management is not fully described. This interface between R/S and RDBMS is heavily biased towards queries, as opposed to general purpose database development. In particular we made no attempt to define "bind variables", that is a mechanism by which the contents S object are implicitly moved to the database during SQL execution. For instance, the following embedded C statement SELECT * from emp_table where emp_id = :sample_employee would take the array "sample_employee" and iterate over each of its elements to get the result. Perhaps S-DBI could at some point in the future do the same.