<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="http://wwww.omegahat.org/XSL/myarticle.xsl" ?>
<!--<?xml-stylesheet type="text/xsl" href="../../../../org/omegahat/Docs/XSL/myarticle.xsl" ?> -->


<article xmlns:s="http://cm.bell-labs.com/stat/S4"
         xmlns:omegahat="http://www.omegahat.org"
  	 xmlns:sh="http://www.shell.org">

<abstract>

</abstract>


<section>
<title>Introduction</title>
The goal here is to provide an alternative and richer format than Rd
for documenting R objects - functions, classes, data - and R packages,
tasks and sessions, and demonstrations.  Importantly, we want that the
tools can be extended with additional markup or alternative
formatting.  There are various possible approaches.  XML is a natural
one as it is widely used and well designed for documents. Docbook is
one example of a system for technical documentation. MathML is also
available for mathematical markup.  And of course, we can also use an
XML structure containing <LaTeX/> content.  Additionally, we have
several tools already in R to deal with XML and XSL.

<para/>
We want to introduce additional markup than exists currently in the Rd
format.  Previously, we tried to develop tools to do this.  However,
they were complicated by trying to parse the Rd formats in entirety
and correctly us.  Rather than just using the Rd format, extensibility
is important.  While most users will use the standard tools, a good
flexible infrastructure will encourage others to develop new tools and
push the envelope of what is currently done.  Hopefully, more research
will be fostered by providing tools that allow documentation and
authoring to be done in new ways in the narrow domain of statistical
computing.

<para/>


There are three (3) primary target types for our documentation of R objects
<itemizedlist>
<listitem>
text for use within the console
</listitem>
<listitem>
HTML (and compiled HTML)
</listitem>
<listitem>
PDF
</listitem>
</itemizedlist>

</section>



<section>
<title>A Model</title>

The idea is quite simple. We should be able to have documentation for
R objects in a variety of forms and we should be able to move between
them relatively easily.  Rd<footnote><para>We include <LaTeX/> here as
Rd is an restricted form of an extended <LaTeX/>.</para></footnote>
and XML are two obvious rich formats.  HTML, raw text, PDF are not
rich formats that we can read back into R and recover the information
about the documentation objects.  So these are output or target
formats.  Rd and XML are input formats.

<para/>
To move between Rd and XML, we can write appropriate filters.  For
instance, we can write an XSL file that takes an XML file and creates
the corresponding Rd file.  In this way, we can use the same tools as
we currently have, albeit with rendering for a potentially reduced set
of markup elements.  An alternative and perhaps more natural mechanism
is to use R as the language.  We can develop input filters that
transform an XML object into an R object.  And then we can convert
that object into one of the different formats (or back to XML).
Additionally, for help within the R console session, we can use the
help object to dynamically generate the displayed documentation.

<para/>

The basic structure of a help document is a collection of expected
sections/stanzas each of which either has further structure or is made
up of free-form marked-up text.  This is true for at least the Rd or
XML formats, and probably for others also.  When defining the classes,
we want to allow the specification of the expected elements, and also
allow for new "named" but unexpected elements to be introduced.

<para/>

So we start by defining text sequences with markup.
A parser will typically break the text up based on the markup.
For example, an XML segment such as
<![CDATA[
 The function <r:func>x11</r:func> produces a new 
   <r:concept>graphics device</r:concept>.
]]>
would be available as 5 elements:
<itemizedlist>
<listitem> (text) "The function "
</listitem>
<listitem> (XML node) r:func with the child "x11"
</listitem>
<listitem> (text) produces a new 
</listitem>
<listitem> (XML node) r:concept with the child "graphics device"
</listitem>
<listitem> (text) "."
</listitem>
</itemizedlist>

We can of course collapse this list back to a single string, but then
we lose the information about the markup.  If we wanted to find all
references to the function x11, we would have to parse the string again.

<para/>

So marked-up text is a list whose elements are either raw text or
marked-up elements.  In the case of XML, the marked-up elements are
<s:class>XMLNode</s:class> elements.  <note>We need an abstract class
for these to handle C-level/internal and R-level XML node
objects.</note>
This identifies three (3) concepts/classes: MarkedUpText, Text and MarkedUpElement.
(The names are not important at this point.)

<para/>


As mentioned earlier, the remainder of the upper-levels of the
document typically have more structure and are elements of a
hierarchy.




<para/>
A document may contain help for multiple "topics". This might be in
the current Rd format where we have aliases.  Alternatively, we may
have multiple documentation elements, e.g. help for one function,
followed by help for another function.  Also, it is convenient to be
able to share common elements of documentation descriptions within a
file share.  For example, suppose we have a file documenting two
functions. The author, references and keyword elements may be the same
and we may explicitly want them to be the same by contract,
i.e. if one changes, the other also changes.
Then the author element in the second description might
refer to a node in the previous description within the file.
Alternatively, the two descriptions might link to a common 
node, either in the document or in a different document.





<section>
<title>Generating the templates</title>

It would be nice to have a semi-automated mechanism for creating the
template of a document.  If we had a class definition for, say,
function documentation. Then, if this were very strongly typed, then
we could serialize the R object using the slots in the class.  For
this we want virtual classes that give us the style of an interface so
that we can maintain extensibility.  <note>We have to determine how to
get extensibility in an XML schema for validation. Derived
types are feasible.</note>

</section>

<section>
<title>To R Objects</title>

The usage section of the Rd object probably arose from the fact that
one manipulated the documentation outside of R, i.e. in Perl scripts.
As such, the document must be describe all aspects of the function
without being able to go back to the definition.  When we have an
object in R representing the documentation for a function, we should
be able to locate the actual object to which the documentation
corresponds, i.e. the function.  In that case, we don't need to insert
the usage section into the documentation object.

<para/>

However, when we serialize the object to an editable form (i.e. a file
the user edits using an external application), then the author might
want to provide a modified usage, e.g. to format it more
appropriately, modify the links within it).  In that case, we must
recognize that the content has changed from the default and store
that.  Otherwise, we should maintain it as a dynamically generated
value. This allows us to regenerate it correctly when the definition
changes.  If the user has explicitly supplied a version of that field,
then we must maintain it and potentially show them both when providing
the material for editing.


<para/>

Given the documentation object in R, we can dynamically perform
validation on the documentation object, e.g comparing the names of the
parameters and those documented in the relevant arguments section of
the documentation.  If we use conventions for where the documentation
is stored (i.e.  which environment, list and the name by which it is
identified), then when a new definition of an object is assigned, we
can validate the documentation and using a simple GUI, we can flag the
ones that need to be updated.  This would be more helpful than using 
<sh:command>R CMD check</sh:command> at regular intervals. Instead, it would encourage developers
to update the documentation when changes are made but not require it,
e.g. in cases where potentially transient changes are being performed
experimentally.


</section>
</section>

<section>
<title>Software Tools</title>

<section>
<title>Authoring Documents</title>
I use Emacs and nXML and flyspell mode to write XML documents.
XML mode for editing SGML documents is also a possibility.

<para/>
XMLSpy is a commercial tool for editing XML.

<para/>
Using the RDCOMClient and RDCOMEvents packages,
we can develop tools to write templates
of documentation directly to Word
and read them back to R again.
Additionally, we can provide XSL filters for
Word to save documents in appropriate form
and to convert them for rendering in Word.

<para/>
And we can develop simpler GUIs within
R to allow editing of documentation.
We can have simple GUI tools for this 
to provide information about 
objects that are within the package
of the object being documented.
</section>

<section><title>XSL Processors</title>
xsltproc is a good XSL processor.
<ulink url="http://xml.apache.org/xalan-j">XAlan</ulink> is another


By providing the XSL files locally, we can display the XML files
directly in HTML browsers such as Firefox and IE.
<note>We need to deal with security limitations of referencing
XSL files from other domains. We cannot use catalogs here.</note>

</section>

<section>
<title>PDF</title>

<ulink url="http://xmlgraphics.apache.org/fop/">FOP</ulink> from Apache can be used to generate PDF.
</section>


</section>


<section>
<section>Notes on XML</section>
<title>Writing XML</title>
<itemizedlist>
<listitem>
When including code (R, C, etc.) within
a document, it is often simplest to enclose it within
a CDATA block (&lt;[!CDATA[.....]]&gt;).  This ensures that 
special characters such as &lt; and &gt; are not
interpreted by the XML parser. Alternatively, one can use entities.
</listitem>

<listitem>
When including code via CDATA constructs, put the CDATA
on the same line as the enclosing XML element,
e.g.
<code>
&lt;output&gt;&lt;![CDATA[
  text
 ]]&gt;&lt;/output&gt;
</code>
This will avoid extra spaces before and after the text.
</listitem>

<listitem>
Attributes within elements must be in quotes.
</listitem>


<listitem>
It is important to match the XML/XSL namespaces correctly.
Make certain to use exact URI in the definition.
</listitem>


<listitem>
We can use default namespaces or explicitly enumerate the <ns>rhelp</ns>.
To make rhelp, say, the default namespace for all elements,
you can use the idiom
<![CDATA[
 <function xmlns="http://www.r-project.org/rhelp">
   ...
 </function>
]]>
The nodes descending from function will then all have this namespace.
If you need to refer to an element from a different namespace,
e.g. func, then you need to qualify this explicitly, i.e.
&lt;r:func&gt;.
 
</listitem>
</itemizedlist>
</section>

<section>
<title>Running XSLT</title>

 Use catalogs to locate the XSL files locally.  The files we provide
 refer to other XSL files via a URI, e.g. http://www.omegahat.org/XSL.
 <sh:command>xsltproc</sh:command> will go to the Web and load the
 contents of the URI. However, this is time consuming and of course
 requires that one has network access.  Using a catalog file, one can
 have xsltproc rewrite these URIs to refer to local files and then
 avoid the network download.

</section>



<section>
<title>Possible Enhancements</title>
<itemizedlist>
<listitem>
The examples section currently is one large
block of code used to manage
multiple, separate examples.
Each example has (optionally) some comments,
initialization code, cleanup code.
The expected output is not shown.
The user must run the example to see it.
</listitem>

<listitem>
Using an R object to represent the documentation, 
the programmer can modify the documentation directly
within R, either manually or programmatically.
She can add annotations to it also to track 
thoughts, modifications, etc.
</listitem>
<listitem>
Type information can be syncrhonize from the documentation
or from the function or class. This can be bi-directional.
</listitem>


<listitem>
Using <omegahat:package>RGtkHTML</omegahat:package> or other suitable
GUI tools connected to R such as Internet Explorer via <omegahat:package>RDCOMEvents</omegahat:package>, 
we can display the rendered documentation within R and 
interactively run examples, resolve links, etc.
</listitem>

<listitem>
We can differentiate between meaning and appearance by introducing
markup for different concepts. 
This might seem like a small point, but it turns out to be useful.
Of course, we can do this in the Rd format also. But we haven't.
We do not differentiate between 
\\code{x} and \\arg{x}. When modifying the documentation,
this is important. The reader has to lookup
the name in the parameter list to see if it is a reference
to something local or not.
</listitem>

<listitem>
Generation of the rendered target can 
be extended to provide ways to dynamically
follow links to other documentation.
We can introduce new rendering facilities via "simple"
XSL specification.
This includes new XSL namespaces, such as omegahat, bioconductor, 
etc and then define different rules for these.
</listitem>


<listitem>
We have the ability to put in conditionally included material.
For example, we can have material that is not rendered
such as implementation notes.
We can also have documentation in different languages
within the same document.
We can tag these in different, flexible  ways
that are specified by XSL manipulations.
</listitem>

<listitem>
Using documentation objects
and parseable documentation,
we can allow arbitrary
objects to be documented,
and not just the ones that
are in packages.
We would like to encourage users
to document reused objects, whether
they are packages or not.
And we would like to facilitate the 
evolution from writing commands
to writing functions to writing
packages. This is a step in
making this easier.
<para/>

We would also like to allow the development of functions and packages
within an XML document that could be read into R.  This is already
feasible using XML tools but needs to be improved.  But the
documentation can live within the document, alongside the function
definition in the spirit of literate programming.

</listitem>

</itemizedlist>

</section>


<acknowledgments>
Robert Gentleman, Byron Ellis
</acknowledgments>

<bibliography>
<title>Bibliography</title>

<biblioentry xreflabel="OfficeXML">
 <authorgroup>
 <author><firstname>Simon</firstname><surname>St. Laurent</surname></author>
  <author><firstname>Evan</firstname><surname>Lenz</surname></author>
  <author><firstname>Mary</firstname><surname>Mc Rae</surname></author>
 </authorgroup>
 <title>Office XML 2003
 </title>
 <publisher>O'Reilly and Associates</publisher>
</biblioentry>

<biblioentry xreflabel="Docbook">
 <authorgroup>
    <author><firstname>Norman</firstname><surname>Walsh</surname></author>
    <author><firstname>Leonar</firstname><surname>Muellner</surname></author>
 </authorgroup>
 <title>Docbook</title>
 <subtitle>The Definitive Guide</subtitle>
 <publisher>O'Reilly and Associates</publisher>

</biblioentry>


<biblioentry xreflabel="RExtensionsManual">
 <author>
  <firstname>
  </firstname>
  <surname>
  The R Core Group
  </surname>
 </author>
 <title>Writing R Extensions
 </title>
 <isbn>3-900051-11-9</isbn>
</biblioentry>
</bibliography>

</article>