Todo list for R-level XML Parser

  • Appears to be an oddity on Solaris with the event driven parsing.
          source("dataFrameHandler.R")
          z <- xmlEventParse("../DTDs/Examples/mtcars.xml", handler=handler())
    
    causes problems with an incorrect number of elements in the third record. It reads the 22.8 as 2 and then 2.8 Removing some of the spaces before the 22.8 at the beginning of the record makes this go away. Need to investigate further.
  • Develop DTDs for basic types.
  • Additional chapter/package to write XML
    Handle standard types such as data frame, time series, factors, graphics/plots, etc.

    Can cat() output or paste(), but can do more to ensure well-formed documents relative to a DTD. Have a filter that knows what DTD, or collection of DTDs, to use and how to ensure that individual calls do the correct thing in the context. So basically keep a cursor.
    Can read DTDs within this one. The filter can be built from this. See Writing XML.

  • Facility for dynamically modifying the user-level handler functions for a parser from the body of one of these handlers.
    For example, the document may contain its own functions for a particular language and we would see these in the preamble and switch to using them.
  • Add facility for stopping the parsing mid-way through via a call to stop() or whatever, but that doesn't cause an error.
    Exceptions may work when Robert finishes these.
  • We can make this significantly more class-based, i.e. object oriented.
  • Process external entities.
    These are not currently being seen by the event mechanism. Probably a switch needs to be turned on.
    Fixed now!
    At present, internal references are substituted directly. See test.xml in Docs directory. h <- .Call("R_XMLParse", "Docs/test.xml",xmlHandler(), F, F)
    See replaceEntities in xmlTreeParse().
  • We could kill off the children element in a node if there aren't any.
  • [ and [<- methods for the different types of nodes. And also functions such as those in the w3c spec for nodes, getElementsByTagName, etc.
  • Also add the [[ for accessing children, avoiding the need for $children[[]].
  • Could kill off the attributes and/or children for certain node types such as comment, text node.
  • Handle the namespaces.
    Done, for libxml. Added a field to the XMLNode.
  • Support S, at least for the document/tree parser without the callbacks.
    The callbacks require the driver mechanism used in the CORBA and Java interfaces to provide mutable state.
    All done, except mutable state. See the interface drivers in S4.
  • Add the contextual information to the function calls.
    Depth, last node, node path, etc
  • Done

  • Allow XML text to be specified rather than treating it as a file.
    Done for libxml parser. Done for Expat.
  • Call the user level functions in the document parser.
    Done.
    If return NULL, remove from tree (or actually don't add it).
    Pass in additional information.

  • Duncan Temple Lang <duncan@research.bell-labs.com>
    Last modified: Wed Dec 22 12:14:56 EST 1999