[Ginger Alliance Homepage]

Sablotron 0.44

Tomas Kaiser (Ginger Alliance)

September 13, 2000

Abstract

This is a description of the current version of the XSLT processor called Sablotron, including an overview of its limitations as compared to the XSLT specification.

Contents

 1  This text
 2  Changes from the last release
 3  Introduction 4  The sources 5  Implementation. Supported instructions and functions 6  Other implementation-related notes 7  The C interface 8  The command line interface
 9  References

1  This text

The HTML form of this description was compiled by Sablotron from the XML source Sablot-0-44.xml.

The material in the following sections includes:

2  Changes from the last release

Please see the RELEASE file.

3  Introduction

3.1  XSLT

XSLT is a language allowing to transform given XML data (the input) according to a stylesheet. XSLT stylesheets are themselves XML documents; that is, all instructions of the language are expressed in the form of XML elements. The output, i.e. the result of the processing, is typically a XML document as well, although the syntactic requirements can be relaxed to allow the creation of a HTML document (one that contains unclosed tags and the like), or even plain text.

XSLT was designed by the World Wide Web Consortium (W3C) as a part of the XSL stylesheet language, where it is complemented by a powerful set of formatting instructions. The most precise information about XSLT can be found in the W3C Recommendation [XSLT]. In particular, Appendix B of the Recommendation contains a handy syntax table. A good tutorial is [XMLBible14].

Other W3C Recommendations one often needs to consult are [XML] (for the definition of the XML language) and [XPath] (for details on XPath, the language used to form expressions in XSLT and elsewhere).

An excellent source of information about XSLT (indeed, about anything related to XML and SGML) is [Cover]; see also [XSLINFO] and [XMLorg].

3.2  On Sablotron

Sablotron is a XSLT processor (though not quite conforming yet..., see below) written in C++. Since the machines where it is meant to run include various small mobile clients, the main objectives of its design are the following:

  • portability,
  • compact code,
  • as much independence on other resources (Java etc.) as possible.

Sablotron is a single shared library (sablot.dll or libsablot.so.0.44). It can also be used from the command line via the simple interface called sabcmd. See here for more information.

The only other files you will need are the two shared libraries that make up expat, the XML parser by James Clark. Their Windows names are xmlparse.dll and xmltok.dll, in Linux, these are libxmlparse.so.1.0 and libxmltok.so.1.0

For information on the available interfaces to Python (experimental) and Perl, see www.gingerall.com.

4  The sources

Sablotron is written in C++. The source files compile under Win32 (using MS Visual C++ 4.2) and on Solaris and Linux (using g++ 2.95.2) without change.

4.1  Getting the sources

The source or binary distributions of Sablotron can be downloaded from www.gingerall.com. For instructions on how to build the sources (if any), refer to the accompanying INSTALL file.

If you have access to the Ginger Alliance CVS server, you can get the working version of Sablotron in the CVS module ga.

In case you wish to get the latest source files, but have no access to the CVS server, please contact the authors.

The source distribution of expat is included with Sablotron. For more information on expat, check James Clark's page.

4.2  Joining the development

Sablotron is an open source project and all volunteers are most welcome! The documentation of the sources is still somewhat sparse but we will try to improve it. If you find the invitation to work on Sablotron with us interesting, please contact the authors. There is also a mailing list available, see www.gingerall.com.

5  Implementation. Supported instructions and functions

The instruction set supported by this version of Sablotron is already sufficient for many transformation tasks (e.g. the task of formatting this document). On the other hand, a comparison of it to the XSLT specification [XSLT] shows that much is still to be done. The purpose of the following sections is to describe the varying degree of support for the elements of the XSLT language.

It may be helpful to refer to the syntax table in Appendix B of [XSLT]. The instructions/attributes that are not listed as unsupported should be implemented. The authors will appreciate being told about any omissions found in the following description.

For readability, I sometimes omit the xsl: prefix from the instruction names.

5.1  Templates

template, apply-templates, call-template

Implemented. Lacking features:

  • xsl:sort inside apply-templates is not supported

Predicates in match patterns are supported since release 0.42.

5.2  Conditional processing

if, choose, when, otherwise

Fully implemented.

5.3  Loops

for-each

Implemented except for sorting with xsl:sort.

5.4  Variables and parameters

variable, param, with-param

Fully implemented. Top-level variables and parameters are read in the document order, so no forward references are resolved. This is a harmless deviation from the spec.

5.5  Element creation

element, attribute, text, comment, processing-instruction, attribute-set

xsl:attribute-set is not implemented. For the rest, name is the only recognized attribute (where applicable). Literal result elements work.

5.6  Global definitions

stylesheet, transform, output

For stylesheet and transform, the only recognized attribute is version. xsl:output should work (see below for notes on the encoding attribute).

5.7  Values and copying

value-of, copy, copy-of

copy-of and value-of are fully implemented. copy is implemented except for the use-attribute-sets attribute.

5.8  Namespace processing

namespace-alias

Namespaces should be processed correctly. The namespace-alias instruction is now supported (patch by Major).

5.9  Sorting

sort

xsl:sort is not implemented yet. Contexts are sorted in document order as prescribed by the specification.

5.10  Whitespace stripping

strip-space, preserve-space

Only the default whitespace stripping is done. That is, all whitespace-only text nodes in any stylesheet, not appearing inside a xsl:text, are removed. The two instructions for whitespace stripping and preservation are unsupported.

5.11  Includes

include, import, apply-imports

Only xsl:include is implemented. Processing involving multiple documents works, but has to get more testing, eg. with respect to generate-id().

5.12  Other unimplemented instructions

  • xsl:key,
  • xsl:number,
  • xsl:fallback.

5.13  Output conformance

The output mechanism is much closer to the spec than in the versions prior to 0.4. The following issues remain for the html method:

  • Output the boolean attributes correctly.
  • Disable the escaping inside <SCRIPT> and <STYLE>
  • .

5.14  XPath expressions

Almost all features of XPath are fully implemented. This means there should be no problems with expressions of any kind.

One exception relates to axes. The following 3 axes are not implemented yet: following, preceding, namespace.

Another possible exception may be numbers; we did not yet do a thorough test of rounding, NaNs, infinity, etc.

5.15  Built-in functions

The implementation of the standard function library has been substantially extended. Only a few functions remain unimplemented:

  • id(),
  • lang() (accepted but always returns true),
  • key(),
  • format-number(),
  • current(),
  • unparsed-entity-uri().

As for the fuctions that are implemented, the following is a list of differences from the spec:

  • document() only accepts one argument, always getting the base URI from the stylesheet URI.
  • string-length() returns the byte length of the UTF-8 representation of the string. This will typically differ from the actual length.
  • generate-id() might fail to generate unique identifiers when several input documents are present (giving the same id to nodes from different documents).

6  Other implementation-related notes

6.1  Handlers

It is possible for the user to supply the following handlers to Sablotron:

  • message handler (to bypass the default way of displaying error and warning messages and logging),
  • scheme handler (to retrieve documents whose URI use an unsupported scheme),
  • streaming handler (an expat-like interface to the XML document which is the result of the processing),
  • 'miscellaneous' handler (which will probably server as a collections of odd callbacks).

The handlers are set using SablotRegHandler() For details concerning the interface of these handlers, consult the header files sablot.h and shandler.h.

6.2  Encodings

Thanks to the abilities of expat, Sablotron accepts input encoded as UTF-8, UTF-16, ISO-8859-1 or US ASCII. In addition, windows-1250 is recognized. A document can specify its encoding, as usual, in the encoding attribute of the xml declaration, e.g.

<?xml version='1.0' encoding='windows-1250'?>

The output is in UTF-8 by default. If the iconv library is present on the system (it seems to be a standard part of glibc2, and a Win32 implementation is available), then any of the above encoding can be specified for output, using the 'encoding' attribute of xsl:output. Note that this new feature [by Sven Neumann] still needs testing.

6.3  URIs

Sablotron can handle two URI schemes natively: 'file' and 'arg' (see below). Moreover, it is possible to use the function SablotRegSchemeHandler to register an external scheme handler which will receive requests in all other schemes. See the documentation in sablot.h and shandler.h.

Relative URI references are resolved in conformance to RFC 2396. The base URI is well defined when the relative reference appears inside a XML document; when invoking sabcmd, the base URI is taken to correspond to the current working directory.

When specifying filenames, the following rules are in effect:

  • specify the "file:" scheme for any standard files, i.e. refer to stdin as file://stdin etc.
  • slashes and backslashes work equally fine, in Windows as well as Linux.
  • to include a drive letter under Windows (e.g. C:\doc.xml), it is necessary to say file://c:/doc.xml.

6.4  Named buffers

Sablotron introduces an URI scheme 'arg:' which enables one to use strings in named memory buffers. The buffer names can have a tree-like structure so that a relative reference from a document in a buffer can be resolved as pointing to another buffer.

For instance, if we invoke Sablotron specifying that a buffer named /mybuf/1 contains the string "contents", then the expression

document('arg:/mybuf/1')/a

has string-value "contents". If the document in arg:/mybuf/1 contained a relative URI reference "../theirbuf/2" then this would be resolved as pointing to "arg:/theirbuf/2".

6.5  Error and log messages

By default, Sablotron writes error and warning messages to stderr, and does no logging. By a call to SablotSetLog(), you can specify the name of the log file to be used.

Besides, you can use SablotRegHandler() to override the default message handling. The handler you register will receive all messages in a structured form that's easy to process and filter. For details, see the documentation in sablot.h and shandler.h.

7  The C interface

This section describes the functions exported from the Sablotron library. All of them have a return type of 'int' and return an error flag (nonzero signals an error). Errors are reported to the user by Sablotron itself.

7.1  Shortcuts

We'll first describe the 'shortcuts' that do the whole processing in one call.

int SablotProcess(char *sheetURI, char *inputURI, char *resultURI, char **params, char **arguments, char **resultArg);

This is the basic function. The first three of its arguments are the URIs of the XSLT stylesheet, the XML source and the resulting document, respectively. For some notes on specifying file names, see above.

params is an array of pointers to the names and contents of the top-level stylesheet parameters. Thus, params[0] is a pointer to the null-terminated name of the first parameter, params[1] points to the (null-terminated) contents of the first parameter. The following two array items do the same for the second parameter, etc. The whole array is terminated by a NULL pointer in place of the name. If no parameters are to be passed, you can specify NULL for params itself.

arguments is a similar array of named buffers to be passed to the stylesheet. (They can be referred to via the 'arg:' scheme, see above.) Again, the array is a sequence of (name, value) pairs terminated by NULL in place of a name. If no named buffers are to be passed, you can specify NULL for arguments itself.

resultArg enables one to access the resulting document in case the output went to a named buffer. In that situation, *resultArg points to the resulting null-terminated string, allocated by Sablotron. You can pass NULL for resultArg if the output is sure to go to a file.

Note:When you are done processing the string pointed to by *resultArg, free it using SablotFree() - never use free(). The latter is guaranteed to produce a segmentation fault under Linux.

int SablotProcessFiles(char *styleSheetName, char *inputName, char *resultName);

A wrapper for SablotProcess() working on files. The parameters are the null-terminated file names of the XSLT stylesheet, the XML input and the result, respectively. Sablotron opens these files itself and closes them after the processing is complete. Values like "file://stdin" are allowed.

int SablotProcessStrings(char *styleSheetStr, char *inputStr, char **resultStr);

Another wrapper for SablotProcess(), this time for accessing named buffers (i.e. user-allocated memory blocks)only. Thus, the first parameter is a null-terminated string containing the whole stylesheet; the second parameter is a null-terminated string containing the XML input. Sablotron allocates the buffer for the resulting string and returns a pointer to it in resultStr. Hence, invoking puts(*resultStr) after having called SablotProcessStrings sends the result to stdout. The buffer allocated must be freed by calling the function SablotFree described next.

7.2  Basic functions

The above shortcuts just call the basic, lower-level functions described below. Note that if you need to set options for logging etc., you may need to use the low-level functions.

A typical processing session may look like this:

          SablotHandle p;
          char *my_buf;
          SablotCreateProcessor(&p);
          SablotSetLog(p, ...);
          /* ...set other instance-specific options here... */
          SablotRunProcessor(p, ...);
          SablotGetResultArg(p, "arg:/somename", &my_buf)
          /* ...do something with my_buf... */
          /* can run the processor again if necessary */
          SablotRunProcessor(p, ...);
          SablotDestroyProcessor(p);
      

int SablotCreateProcessor(SablotHandle *processorPtr);

Creates an instance of Sablotron and returns a pointer to it in *processorPtr. This pointer is passed on all subsequent calls to this instance. Note that it is currently not possible to have multiple processor instances at one time; however, the present interface was designed to facilitate supporting the possibility in a future version.

int SablotDestroyProcessor(SablotHandle processor_);

Destroys an instance of the processor, deallocating all the memory used up by it.

int SablotRunProcessor(SablotHandle processor_, char *sheetURI, char *inputURI, char *resultURI, char **params, char **arguments);

Processes documents using the given processor instance and given params and args definitions. See SablotProcess().

int SablotGetResultArg(SablotHandle processor_, char *argURI, char **argValue);

Copies the result 'arg' buffer with the given URI, returning a pointer to the newly-allocated block in *argValue. If no such buffer exists, returns NULL in *argValue.

This function is necessary, because if the result document is output to memory, it would be lost when SablotDestroyProcessor() is called. When deallocating the copy obtained from SablotGetResultArg(), use SablotFree (never free()).

int SablotFreeResultArgs(SablotHandle processor_);

Removes the Sablotron-internal copies of the 'arg' buffers from the last Sablotron run. Normally, there should be no reason to call this function as it is called automatically on both SablotRunProcessor() and SablotDestroyProcessor().

int SablotFree(char *resultBuf);

This function frees the buffer allocated on previous call to SablotProcessStrings. Calling it with an invalid pointer will cause a crash.

int SablotRegHandler( SablotHandle processor_, HandlerType type, void *handler, void *userData);

Registers an external handler. type can be HLR_MESSAGE, HLR_SCHEME or HLR_SAX. handler points to the callback vector of the appropriate type. userData is a data item to passed to all callbacks of this particular handler. For details, check the sablot.h and shandler.h header files.

int SablotUnregHandler( SablotHandle processor_, HandlerType type, void *handler, void *userData);

Unregisters the given external handler. For details, check the sablot.h and shandler.h header files.

int SablotSetLog( SablotHandle processor_, const char *logFilename, int logLevel);

Sets the log filename. The logLevel parameter is currently not used. Pass NULL for logFilename to turn logging off (default).

The other functions published by sablot.h have been included for experimental reasons or for compatibility, and it is better not to use them.

int SablotClearError(SablotHandle processor_);

Clears the 'pending error' flag for this instance of Sablotron.

8  The command line interface

Sablotron comes with a command-line interface to the shared library, which is a program named sabcmd. At present, sabcmd is invoked as follows:

sabcmd [options] stylesheet [input [result]] [assignments]

The arguments are the URIs of the XSLT stylesheet, the XML input document, and the resulting document, respectively. The default for input is file://stdin (meaning plain old stdin); result defaults to file://stdout. Filenames have to include the extension (if any).

You can display the list of available options by typing sabcmd --help. Among the more useful ones are --log-file (for setting the log file) and --measure (measures and outputs the total processing time).

The rules for filenames are the same as with SablotProcess().

assignments is a series of definitions of the form:

name1=value1 name2=value2 ...

assigning values to top-level stylesheet parameters and to named buffers. These two cases are distinguished by a leading '$' in the name of a stylesheet parameter. The names of the buffers do not start with "arg:". They may start with a slash; if they don't, the slash is prepended.

Note: In most cases, it will be necessary to quote the individual assignments. Whether to use single or double quotes may depend on the shell used (or may it?) Single quotes work for bash, double quotes work in Windows.

If the result URI refers to a named buffer, the output would normally remain buried in memory. Sabcmd dumps the buffer to standard output instead.

To sum up and give an example, the following would be a valid invocation of sabcmd:

sabcmd sheet.xsl arg:/the_input "the_input=<a/>" "$use_defaults=1"

This processes the document passed in the buffer named the_input, using a stylesheet found in file "sheet.xsl" in the working directory. We assign 1 to the top-level parameter called "use_defaults". The output goes to stdout by default.

9  References

[XSLT]
XSL Transformations (XSLT) Version 1.0
[XPath]
XML Path Language (XPath) Version 1.0
[XML]
Extensible Markup Language (XML) 1.0
[Cover]
The XML Cover Pages
[XMLorg]
XML.org
[XSLINFO]
XSLINFO.com
[XMLBible14]
Harold, E. R.: XML Bible, Chapter 14 (online presentation)

(c) 2000 Ginger Alliance s.r.o.