Tomas Kaiser (Ginger Alliance)
September 13, 2000
The HTML form of this description was compiled by Sablotron from the XML source Sablot-0-44.xml.
The material in the following sections includes:
XSLT is a language allowing to transform given XML data (the input) according to a stylesheet. XSLT stylesheets are themselves XML documents; that is, all instructions of the language are expressed in the form of XML elements. The output, i.e. the result of the processing, is typically a XML document as well, although the syntactic requirements can be relaxed to allow the creation of a HTML document (one that contains unclosed tags and the like), or even plain text.
XSLT was designed by the World Wide Web Consortium (W3C) as a part of the XSL stylesheet language, where it is complemented by a powerful set of formatting instructions. The most precise information about XSLT can be found in the W3C Recommendation [XSLT]. In particular, Appendix B of the Recommendation contains a handy syntax table. A good tutorial is [XMLBible14].
Other W3C Recommendations one often needs to consult are [XML] (for the definition of the XML language) and [XPath] (for details on XPath, the language used to form expressions in XSLT and elsewhere).
An excellent source of information about XSLT (indeed, about anything related to XML and SGML) is [Cover]; see also [XSLINFO] and [XMLorg].
Sablotron is a XSLT processor (though not quite conforming yet..., see below) written in C++. Since the machines where it is meant to run include various small mobile clients, the main objectives of its design are the following:
Sablotron is a single shared library
(sablot.dll
or libsablot.so.0.44
). It can
also be used from the command line via the simple interface
called sabcmd
. See here for
more information.
The only other files you will need are the two shared
libraries that make up expat, the XML parser by James
Clark. Their Windows names are xmlparse.dll
and
xmltok.dll
, in Linux, these are
libxmlparse.so.1.0
and
libxmltok.so.1.0
For information on the available interfaces to Python (experimental) and Perl, see www.gingerall.com.
Sablotron is written in C++. The source files compile under Win32 (using MS Visual C++ 4.2) and on Solaris and Linux (using g++ 2.95.2) without change.
The source or binary distributions of Sablotron can be downloaded from www.gingerall.com. For instructions on how to build the sources (if any), refer to the accompanying INSTALL file.
If you have access to the Ginger Alliance CVS server, you
can get the working version of Sablotron in the CVS module
ga
.
In case you wish to get the latest source files, but have no access to the CVS server, please contact the authors.
The source distribution of expat is included with Sablotron. For more information on expat, check James Clark's page.
Sablotron is an open source project and all volunteers are most welcome! The documentation of the sources is still somewhat sparse but we will try to improve it. If you find the invitation to work on Sablotron with us interesting, please contact the authors. There is also a mailing list available, see www.gingerall.com.
The instruction set supported by this version of Sablotron is already sufficient for many transformation tasks (e.g. the task of formatting this document). On the other hand, a comparison of it to the XSLT specification [XSLT] shows that much is still to be done. The purpose of the following sections is to describe the varying degree of support for the elements of the XSLT language.
It may be helpful to refer to the syntax table in Appendix B of [XSLT]. The instructions/attributes that are not listed as unsupported should be implemented. The authors will appreciate being told about any omissions found in the following description.
For readability, I sometimes omit the xsl:
prefix
from the instruction names.
template, apply-templates, call-template
Implemented. Lacking features:
xsl:sort
inside
apply-templates
is not supportedPredicates in match patterns are supported since release 0.42.
variable, param, with-param
Fully implemented. Top-level variables and parameters are read in the document order, so no forward references are resolved. This is a harmless deviation from the spec.
element, attribute, text,
comment, processing-instruction, attribute-set
xsl:attribute-set
is not implemented. For the
rest, name
is the only recognized attribute (where
applicable). Literal result elements work.
stylesheet, transform, output
For stylesheet
and transform
,
the only recognized attribute is
version
. xsl:output
should work
(see below for notes on the encoding
attribute).
value-of, copy, copy-of
copy-of
and value-of
are fully
implemented. copy
is implemented except for the
use-attribute-sets
attribute.
namespace-alias
Namespaces should be processed correctly. The
namespace-alias
instruction is now supported
(patch by Major).
sort
xsl:sort
is not implemented yet. Contexts are
sorted in document order as prescribed by the specification.
strip-space, preserve-space
Only the default whitespace stripping is done. That is,
all whitespace-only text nodes in any stylesheet, not appearing
inside a xsl:text
, are removed. The two
instructions for whitespace stripping and preservation are
unsupported.
include, import, apply-imports
Only xsl:include
is implemented. Processing
involving multiple documents works, but has to get more testing,
eg. with respect to generate-id()
.
The output mechanism is much closer to the spec than in the versions prior to 0.4. The following issues remain for the html method:
<SCRIPT>
and
<STYLE>
Almost all features of XPath are fully implemented. This means there should be no problems with expressions of any kind.
One exception relates to axes. The following 3 axes are
not implemented yet: following
,
preceding
, namespace
.
Another possible exception may be numbers; we did not yet do a thorough test of rounding, NaNs, infinity, etc.
The implementation of the standard function library has been substantially extended. Only a few functions remain unimplemented:
id()
,lang()
(accepted but always returns true),key()
,format-number()
,current()
,unparsed-entity-uri()
.As for the fuctions that are implemented, the following is a list of differences from the spec:
document()
only accepts one argument, always
getting the base URI from the stylesheet URI.
string-length()
returns the byte length of
the UTF-8 representation of the string. This will typically
differ from the actual length.
generate-id()
might fail to generate unique identifiers
when several input documents are present (giving the same id to
nodes from different documents).
It is possible for the user to supply the following handlers to Sablotron:
The handlers are set using SablotRegHandler()
For details concerning the interface of these handlers,
consult the header files sablot.h
and
shandler.h
.
Thanks to the abilities of expat, Sablotron accepts input
encoded as UTF-8, UTF-16, ISO-8859-1 or US ASCII. In addition,
windows-1250 is recognized. A document can specify its
encoding, as usual, in the encoding
attribute of
the xml declaration, e.g.
<?xml version='1.0'
encoding='windows-1250'?>
The output is in UTF-8 by default. If the iconv library is present on the system (it seems to be a standard part of glibc2, and a Win32 implementation is available), then any of the above encoding can be specified for output, using the 'encoding' attribute of xsl:output. Note that this new feature [by Sven Neumann] still needs testing.
Sablotron can handle
two URI schemes natively: 'file' and 'arg' (see
below). Moreover, it is possible to use the function
SablotRegSchemeHandler
to register an external scheme
handler which will receive requests in all other schemes. See
the documentation in sablot.h
and
shandler.h
.
Relative URI references are resolved in conformance to RFC 2396. The base URI is well defined when the relative reference appears inside a XML document; when invoking sabcmd, the base URI is taken to correspond to the current working directory.
When specifying filenames, the following rules are in effect:
stdin
as file://stdin
etc.C:\doc.xml
), it is necessary to say
file://c:/doc.xml
.
Sablotron introduces an URI scheme 'arg:' which enables one to use strings in named memory buffers. The buffer names can have a tree-like structure so that a relative reference from a document in a buffer can be resolved as pointing to another buffer.
For instance, if we invoke Sablotron specifying that a
buffer named /mybuf/1
contains the string
"contents", then the expression
document('arg:/mybuf/1')/a
has string-value "contents". If the document in arg:/mybuf/1 contained a relative URI reference "../theirbuf/2" then this would be resolved as pointing to "arg:/theirbuf/2".
By default, Sablotron writes error and warning messages to
stderr, and does no logging. By a call to
SablotSetLog()
, you can specify the name of the log
file to be used.
Besides, you can use SablotRegHandler()
to override the default message handling. The handler you
register will receive all messages in a structured form that's
easy to process and filter. For details, see
the documentation in sablot.h
and
shandler.h
.
This section describes the functions exported from the Sablotron library. All of them have a return type of 'int' and return an error flag (nonzero signals an error). Errors are reported to the user by Sablotron itself.
We'll first describe the 'shortcuts' that do the whole processing in one call.
int SablotProcess(char *sheetURI, char *inputURI, char *resultURI,
char **params, char **arguments, char **resultArg);
This is the basic function. The first three of its arguments are the URIs of the XSLT stylesheet, the XML source and the resulting document, respectively. For some notes on specifying file names, see above.
params
is an array of pointers to the names
and contents of the top-level stylesheet parameters. Thus,
params[0]
is a pointer to the null-terminated name
of the first parameter, params[1]
points to the
(null-terminated) contents of the first parameter. The following
two array items do the same for the second parameter, etc. The
whole array is terminated by a NULL pointer in place of the
name. If no parameters are to be passed, you can specify NULL
for params
itself.
arguments
is a similar array of named buffers
to be passed to the stylesheet. (They can be referred to via the
'arg:' scheme, see above.) Again, the
array is a sequence of (name, value) pairs terminated by NULL in
place of a name. If no named buffers are to be passed, you can
specify NULL for arguments
itself.
resultArg
enables one to access the
resulting document in case the output went to a named buffer. In
that situation, *resultArg
points to the resulting
null-terminated string, allocated by Sablotron. You can pass NULL
for resultArg
if the output is sure to go to a
file.
Note:When you are done processing the string
pointed to by *resultArg
, free it using SablotFree()
- never use
free()
. The latter is guaranteed to produce a
segmentation fault under Linux.
int SablotProcessFiles(char *styleSheetName,
char *inputName,
char *resultName);
A wrapper for SablotProcess()
working on
files. The parameters are the null-terminated file names of the
XSLT stylesheet, the XML input and the result,
respectively. Sablotron opens these files itself and closes them
after the processing is complete. Values like "file://stdin" are
allowed.
int SablotProcessStrings(char *styleSheetStr, char *inputStr, char
**resultStr);
Another wrapper for SablotProcess()
, this
time for accessing named buffers (i.e. user-allocated memory
blocks)only. Thus, the first parameter is a null-terminated
string containing the whole stylesheet; the second parameter
is a null-terminated string containing the XML
input. Sablotron allocates the buffer for the resulting string
and returns a pointer to it in resultStr. Hence, invoking
puts(*resultStr)
after having called
SablotProcessStrings
sends the result to
stdout. The buffer allocated must be freed by calling the
function SablotFree
described next.
The above shortcuts just call the basic, lower-level functions described below. Note that if you need to set options for logging etc., you may need to use the low-level functions.
A typical processing session may look like this:
SablotHandle p; char *my_buf; SablotCreateProcessor(&p); SablotSetLog(p, ...); /* ...set other instance-specific options here... */ SablotRunProcessor(p, ...); SablotGetResultArg(p, "arg:/somename", &my_buf) /* ...do something with my_buf... */ /* can run the processor again if necessary */ SablotRunProcessor(p, ...); SablotDestroyProcessor(p);
int SablotCreateProcessor(SablotHandle *processorPtr);
Creates an instance of Sablotron and returns a pointer to it in *processorPtr. This pointer is passed on all subsequent calls to this instance. Note that it is currently not possible to have multiple processor instances at one time; however, the present interface was designed to facilitate supporting the possibility in a future version.
int SablotDestroyProcessor(SablotHandle processor_);
Destroys an instance of the processor, deallocating all the memory used up by it.
int SablotRunProcessor(SablotHandle processor_,
char *sheetURI,
char *inputURI,
char *resultURI,
char **params,
char **arguments);
Processes documents using the given processor instance and
given params and args definitions. See
SablotProcess()
.
int SablotGetResultArg(SablotHandle processor_,
char *argURI,
char **argValue);
Copies the result 'arg' buffer with the given URI, returning a pointer to the newly-allocated block in *argValue. If no such buffer exists, returns NULL in *argValue.
This function is necessary, because if the result document
is output to memory, it would be lost when
SablotDestroyProcessor()
is called. When
deallocating the copy obtained from
SablotGetResultArg()
, use SablotFree
(never free()
).
int SablotFreeResultArgs(SablotHandle processor_);
Removes the Sablotron-internal copies of the 'arg' buffers
from the last Sablotron run. Normally, there should be no reason
to call this function as it is called automatically on both
SablotRunProcessor()
and
SablotDestroyProcessor()
.
int SablotFree(char *resultBuf);
This function frees the buffer allocated on previous call
to SablotProcessStrings
. Calling it with an
invalid pointer will cause a crash.
int SablotRegHandler(
SablotHandle processor_,
HandlerType type,
void *handler,
void *userData);
Registers an external handler. type
can be
HLR_MESSAGE
, HLR_SCHEME
or
HLR_SAX
. handler
points to the
callback vector of the appropriate type. userData
is a data item to passed to all callbacks of this particular
handler. For details, check the sablot.h
and
shandler.h
header files.
int SablotUnregHandler(
SablotHandle processor_,
HandlerType type,
void *handler,
void *userData);
Unregisters the given external handler. For details, check the
sablot.h
and shandler.h
header
files.
int SablotSetLog(
SablotHandle processor_,
const char *logFilename,
int logLevel);
Sets the log filename. The logLevel
parameter
is currently not used. Pass NULL for logFilename
to
turn logging off (default).
The other functions published by sablot.h have been included for experimental reasons or for compatibility, and it is better not to use them.
int SablotClearError(SablotHandle processor_);
Clears the 'pending error' flag for this instance of Sablotron.
Sablotron comes with a command-line interface to the
shared library, which is a program named
sabcmd
. At present, sabcmd
is invoked
as follows:
sabcmd [options] stylesheet [input [result]] [assignments]
The arguments are the URIs of the XSLT stylesheet, the
XML input document, and the resulting document, respectively. The
default for input
is
file://stdin
(meaning plain old stdin);
result
defaults to
file://stdout
. Filenames have to include the extension (if
any).
You can display the list of available options by typing
sabcmd --help
. Among the more useful ones are
--log-file
(for setting the log file) and
--measure
(measures and outputs the total
processing time).
The rules for filenames are the same as
with SablotProcess()
.
assignments
is a series of definitions of the
form:
name1=value1 name2=value2 ...
assigning values to top-level stylesheet parameters and to named buffers. These two cases are distinguished by a leading '$' in the name of a stylesheet parameter. The names of the buffers do not start with "arg:". They may start with a slash; if they don't, the slash is prepended.
Note: In most cases, it will be necessary to quote the individual assignments. Whether to use single or double quotes may depend on the shell used (or may it?) Single quotes work for bash, double quotes work in Windows.
If the result URI refers to a named buffer, the output would normally remain buried in memory. Sabcmd dumps the buffer to standard output instead.
To sum up and give an example, the following would be a valid invocation of sabcmd:
sabcmd sheet.xsl arg:/the_input "the_input=<a/>"
"$use_defaults=1"
This processes the document passed in the buffer named the_input, using a stylesheet found in file "sheet.xsl" in the working directory. We assign 1 to the top-level parameter called "use_defaults". The output goes to stdout by default.
(c) 2000 Ginger Alliance s.r.o.