summaryrefslogtreecommitdiff
path: root/external/jaxp/source/gnu/xml/aelfred2/package.html
blob: b521791f9a57b7911e0a5e0fbde16d0c03ea7266 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
<!DOCTYPE html PUBLIC
	'-//W3C//DTD XHTML 1.0 Transitional//EN'
	'http://www.w3.org/TR/xhtml1/DTD/transitional.dtd'>

<html><head>
    <title>package overview</title>
<!--
/*
 * Copyright (c) 1999-2001 by David Brownell.
 * This file is distributed under the GPL.
 *
 * $Id: package.html,v 1.1 2003-02-01 02:10:13 cbj Exp $
 */
-->
</head><body>

<p> This package contains &AElig;lfred2, which includes an
enhanced SAX2-compatible version of the &AElig;lfred
non-validating XML parser, a modular (and hence optional) 
DTD validating parser, and modular (and hence optional)
JAXP glue to those.
Use these like any other SAX2 parsers. </p>

<ul>
    <li><a href="#about">About &AElig;lfred</a><ul>
	<li><a href="#principles">Design Principles</a></li>
	<li><a href="#name">About the Name &AElig;lfred</a></li>
	<li><a href="#encodings">Character Encodings</a></li>
	<li><a href="#violations">Known Conformance Violations</a></li>
	<li><a href="#license">Licensing</a></li>
	</ul></li>

    <li><a href="#changes">Changes Since the Last Microstar Release</a><ul>
	<li><a href="#sax2">SAX2 Support</a></li>
	<li><a href="#validation">Validation</a></li>
	<li><a href="#smaller">You Want Smaller?</a></li>
	<li><a href="#bugfixes">Bugs Fixed</a></li>
	</ul></li>

</ul>

<p> Some of the documentation below was modified from the original
&AElig;lfred README.txt file.  All of it has been updated. </p>


<h2><a name="about">About &AElig;lfred</a></h2>

<p>&AElig;lfred is a Java-based XML parser originally from
Microstar Software Limited (no longer in existence) and
more or less placed into the public domain.


<h3><a name="principles">Design Principles</a></h3>

<p>In most Java applets and applications, XML should not be the central
feature; instead, XML is the means to another end, such as loading
configuration information, reading meta-data, or parsing transactions.</p>

<p> When an XML parser is only a single component of a much larger
program, it cannot be large, slow, or resource-intensive.  With Java
applets, in particular, code size is a significant issue.  The standard
modem is still not operating at 56 Kbaud, or sometimes even with data
compression.  Assuming an uncompressed 28.8 Kbaud modem, only about
3 KBytes can be downloaded in one second; compression often doubles
that speed, but a V.90 modem may not provide another doubling.  When
used with embedded processors, similar size concerns apply.  </p>

<p> &AElig;lfred is designed for easy and efficient use over the Internet,
based on the following principles: </p> <ol>

<li> &AElig;lfred must be as small as possible, so that it doesn't add too
   much to an applet's download time. </li>

<li> &AElig;lfred must use as few class files as possible, to minimize the
   number of HTTP connections necessary.  (The use of JAR files has made this
   be less of a concern.) </li>

<li> &AElig;lfred must be compatible with most or all Java implementations
   and platforms. (Write once, run anywhere.) </li>

<li> &AElig;lfred must use as little memory as possible, so that it does
   not take away resources from the rest of your program.  (It doesn't force
   you to use DOM or a similar costly data structure API.)</li>

<li> &AElig;lfred must run as fast as possible, so that it does not slow down
   the rest of your program. </li>

<li> &AElig;lfred must produce correct output for well-formed and valid
   documents, but need not reject every document that is not valid or
   not well-formed. (In &AElig;lfred2, correctness was a bigger concern
   than in the original version; and a validation option is available.) </li>

<li> &AElig;lfred must provide full internationalization from the first
    release.  (&AElig;lfred2 now automatically handles all encodings
    supported by the underlying JVM; previous versions handled only
    UTF-8, UTF_16, ASCII, and ISO-8859-1.)</li>

</ol>

<p>As you can see from this list, &AElig;lfred is designed for production
use, but neither validation nor perfect conformance was a requirement.
Good validating parsers exist, including one in this package,
and you should use them as appropriate.  (See conformance reviews
available at <a href="http://www.xml.com/">http://www.xml.com</a>)
</p>

<p> One of the main goals of &AElig;lfred2 was to significantly improve
conformance, while not significantly affecting the other goals stated above.
Since the only use of this parser is with SAX, some classes could be
removed, and so the overall size of &AElig;lfred was actually reduced.
Subsequent performance work produced a notable speedup (over twenty
percent on larger files).  That is, the tradeoffs between speed, size, and
conformance were re-targeted towards conformance and support of newer APIs
(SAX2), with a a positive performance impact. </p>

<p> The role anticipated for this version of &AElig;lfred is as a
lightweight Free Software SAX parser that can be used in essentially every
Java program where the handful of conformance violations (noted below)
are acceptable.
That certainly includes applets, and
nowadays one must also mention embedded systems as being even more
size-critical.
At this writing, all parsers that are more conformant are
significantly larger, even when counting the optional
validation support in this version of &AElig;lfred. </p>


<h3><a name="name">About the Name <em>&AElig;lfred</em></a></h3>

<p>&AElig;lfred the Great (AElfred in ASCII) was King of Wessex, and
some say of King of England, at the time of his death in 899 AD.
&AElig;lfred introduced a wide-spread literacy program in the hope that
his people would learn to read English, at least, if Latin was too
difficult for them.  This &AElig;lfred hopes to bring another sort of
literacy to Java, using XML, at least, if full SGML is too difficult.</p>

<p>The initial &AElig; ligature ("AE)" is also a reminder that XML is
not limited to ASCII.</p>


<h3><a name="encodings">Character Encodings</a></h3>

<p> The &AElig;lfred parser currently builds in support for a handful
of input encodings.  Of course these include UTF-8 and UTF-16, which
all XML parsers are required to support:</p> <ul>

    <li> UTF-8 ... the standard eight bit encoding, used unless
    you provide an encoding declaration or a MIME charset tag.</li>

    <li> US-ASCII ... an extremely common seven bit encoding,
    which happens to be a subset of UTF-8 and ISO-8859-1 as well
    as many other encodings.  XHTML web pages using US-ASCII
    (without an encoding declaration) are probably more
    widely interoperable than those in any other encoding. </li>

    <li> ISO-8859-1 ... includes accented characters used in
    much of western Europe (but excluding the Euro currency
    symbol).</li>

    <li> UTF-16 ... with several variants, this encodes each
    sixteen bit Unicode character in sixteen bits of output.
    Variants include UTF-16BE (big endian, no byte order mark),
    UTF-16LE (little endian, no byte order mark), and
    ISO-10646-UCS-2 (an older and less used encoding, using a
    version of Unicode without surrogate pairs).  This is
    essentially the native encoding used by Java.  </li>

    <li> ISO-10646-UCS-4 ... a seldom-used four byte encoding,
    also known as UTF-32BE.  Four byte order variants are supported,
    including one known as UTF-32LE.  Some operating systems
    standardized on UCS-4 despite its significant size penalty,
    in anticipation that Unicode (even with surrogate pairs)
    would eventually become limiting.  UCS-4 permits encoding
    of non-Unicode characters, which Java can't represent (and
    XML doesn't allow).  
    </li>

    </ul>

<p> If you use any encoding other than UTF-8 or UTF-16 you should
make sure to label your data appropriately: </p>

<blockquote>
&lt;?xml version="1.0" encoding="<b>ISO-8859-15</b>"?&gt;
</blockquote>

<p> Encodings accessed through <code>java.io.InputStreamReader</code>
are now fully supported for both external labels (such as MIME types)
and internal types (as shown above).
There is one limitation in the support for internal labels:
the encodings must be derived from the US-ASCII encoding,
the EBCDIC family of encodings is not recognized.
Note that Java defines its
own encoding names, which don't always correspond to the standard
Internet encoding names defined by the IETF/IANA, and that Java
may even <em>require</em> use of nonstandard encoding names.
Please report
such problems; some of them can be worked around in this parser,
and many can be worked around by using external labels.
</p>

<p>Note that if you are using the Euro symbol with an fixed length
eight bit encoding, you should probably be using the encoding label
<em>iso-8859-15</em> or, with a Microsoft OS, <em>cp-1252</em>.
Of course, UTF-8 and UTF-16 handle the Euro symbol directly.
</p>


<h3><a name="violations">Known Conformance Violations</a></h3>

<p>Known conformance issues should be of negligible importance for 
most applications, and include: </p><ul>
    
    <li> Rather than following the voluminous "Appendix B" rules about
    what characters may appear in names (and name tokens), the Unicode
    rules embedded in <em>java.lang.Character</em> are used.
    This means mostly that some names are inappropriately accepted,
    though a few are inappropriately rejected.  (It's much simpler
    to avoid that much special case code.  Recent OASIS/NIST test
    cases may have these rules be realistically testable.) </li>

    <li> Text containing "]]&gt;" is not rejected unless it fully resides
    in an internal buffer ... which is, thankfully, the typical case.  This
    text is illegal, but sometimes appears in illegal attempts to
    nest CDATA sections.  (Not catching that boundary condition
    substantially simplifies parsing text.) </li>

    <li> Surrogate characters that aren't correctly paired are ignored
    rather than rejected, unless they were encoded using UTF-8.  (This
    simplifies parsing text.)  Unicode 3.1 assigned the first characters
    to those character codes, in early 2001, so few documents (or tools)
    use such characters in any case. </li>

    <li> Declarations following references to an undefined parameter
    entity reference are not ignored. (Not maintaining and using state
    about this validity error simplifies declaration handling; few
    XML parsers address this constraint in any case.) </li>

    <li> Well formedness constraints for general entity references
    are not enforced.  (The code to handle the "content" production
    is merged with the element parsing code, making it hard to reuse
    for this additional situation.) </li>

</ul>

<p> When tested against the July 12, 1999 version of the OASIS
XML Conformance test suite, an earlier version passed 1057 of 1067 tests.
That contrasts with the original version, which passed 867.  The
current parser is top-ranked in terms of conformance, as is its
validating sibling (which has some additional conformance violations
imposed on it by SAX2 API deficiencies as well as some of the more
curious SGML layering artifacts found in the XML specification). </p>

<p> The XML 1.0 specification itself was not without problems,
and after some delays the W3C has come out with a revised
"second edition" specification.  While that doesn't resolve all
the problems identified the XML specification, many of the most
egregious problems have been resolved.  (You still need to drink
magic Kool-Aid before some DTD-related issues make sense.)
To the extent possible, this parser conforms to that second
edition specification, and does well against corrected versions
of the OASIS/NIST XML conformance test cases.  See <a href=
"http://xmlconf.sourceforge.net">http://xmlconf.sourceforge.net</a>
for more information about SAX2/XML conformance testing. </p>


<h3><a name="licensing">Licensing</a></h3>

<p> As noted above, the original distribution was more or less
public domain.  The license had the constraint that modifications
be clearly documented, as has been done here.  </p>

<p> This version is Copyright (c) 1999-2001 by David Brownell,
and all the modifications are distributed under the GNU General
Public License (GPL).  It is subject to the "Library Exception",
supporting use in some environments (such as embedded systems where
dynamic linking may not be available) by proprietary code without
necessarily requiring all code to be licencsed under the GPL.
</p>


<h2><a name="changes">Changes Since the last Microstar Release</a></h2>

<p> As noted above, Microstar has not updated this parser since
the summer of 1998, when it released version 1.2a on its web site.
This release is intended to benefit the developer community by
refocusing the API on SAX2, and improving conformance to the extent
that most developers should not need to use another XML parser.  </p>

<p> The code has been cleaned up (referring to the XML 1.0 spec in
all the production numbers in 
comments, rather than some preliminary draft, for one example) and
has been sped up a bit as well.
JAXP support has been added, although developers are still
strongly encouraged to use the SAX2 APIs directly.  </p>


<h3><a name="sax2">SAX2 Support</a></h3>

<p> The original version of &AElig;lfred did not support the
SAX2 APIs. </p>

<p> This version supports the SAX2 APIs, exposing the standard
boolean feature descriptors.  It supports the "DeclHandler" property
to provide access to all DTD declarations not already exposed
through the SAX1 API.  The "LexicalHandler" property is supported,
exposing entity boundaries (including the unnamed external subset) and
things like comments and CDATA boundaries.  SAX1 compatibility is
currently provided.</p>


<h3><a name="validation">Validation</a></h3>

<p> In the 'pipeline' package in this same software distribution is an
<a href="../pipeline/ValidationConsumer.html">XML Validation component</a>
using any full SAX2 event stream (including all document type declarations)
to validate.  There is now a <a href="XmlReader.html">XmlReader</a> class
which combines that class and this enhanced &AElig;lfred parser, creating
an optionally validating SAX2 parser. </p>

<p> As noted in the documentation for that validating component, certain
validity constraints can't reliably be tested by a layered validator.
These include all constraints relying on
layering violations (exposing XML at the level of tokens or below,
required since XML isn't a context-free grammar), some that
SAX2 doesn't support, and a few others.  The resulting validating
parser is conformant enough for most applications that aren't doing
strange SGML tricks with DTDs.
Moreover, that validating filter can be used without
a parser ... any application component that emits SAX event streams
can DTD-validate its output on demand. </p>

<h3><a name="smaller">You want Smaller?</a></h3>

<p> You'll have noticed that the original version of &AElig;lfred
had small size as a top goal.  &AElig;lfred2 normally includes a
DTD validation layer, but you can package without that.
Similarly, JAXP factory support is available but optional.
Then the main added cost due to this revision are for
supporting the SAX2 API itself; DTD validation is as
cleanly layered as allowed by SAX2.</p>

<h3><a name="bugfixes">Bugs Fixed</a></h3>

<p> Bugs fixed in &AElig;lfred2 include: </p>

<ol>
    <li> Originally &AElig;lfred didn't close file descriptors, which
    led to file descriptor leakage on programs which ran for any
    length of time. </li>

    <li> NOTATION declarations without system identifiers are
    now handled correctly. </li>

    <li> DTD events are now reported for all invocations of a
    given parser, not just the first one. </li>

    <li> More correct character handling: <ul>

	<li> Rejects out-of-range characters, both in text and in
	character references. </li>

	<li> Correctly handles character references that expand to
	surrogate pairs. </li>

	<li> Correctly handles UTF-8 encodings of surrogate pairs. </li>

	<li> Correctly handles Unicode 3.1 rules about illegal UTF-8
	encodings: there is only one legal encoding per character. </li>

	<li> PUBLIC identifiers are now rejected if they have illegal
	characters. </li>

	<li> The parser is more correct about what characters are allowed
	in names and name tokens.  Uses Unicode rules (built in to Java)
	rather than the voluminous XML rules, although some extensions
	have been made to match XML rules more closely.</li>

	<li> Line ends are now normalized to newlines in all known
	cases. </li>

	</ul></li>

    <li> Certain validity errors were previously treated as well
    formedness violations. <ul>

	<li> Repeated declarations of an element type are no
	longer fatal errors. </li>

	<li> Undeclared parameter entity references are no longer
	fatal errors. </li>

	</ul></li>

    <li> Attribute handling is improved: <ul>

	<li> Whitespace must exist between attributes. </li>

	<li> Only one value for a given attribute is permitted. </li>

	<li> ATTLIST declarations don't need to declare attributes. </li>

	<li> Attribute values are normalized when required. </li>

	<li> Tabs in attribute values are normalized to spaces. </li>

	<li> Attribute values containing a literal "&lt;" are rejected. </li>

	</ul></li>

    <li> More correct entity handling: <ul>

	<li> Whitespace must precede NDATA when declaring unparsed
	entities.</li>

	<li> Parameter entity declarations may not have NDATA annotations. </li>

	<li> The XML specification has a bug in that it doesn't specify
	that certain contexts exist within which parameter entity
	expansion must not be performed.  Lacking an offical erratum,
	this parser now disables such expansion inside comments,
	processing instructions, ignored sections, public identifiers,
	and parts of entity declarations. </li>

	<li> Entity expansions that include quote characters no longer
	confuse parsing of strings using such expansions. </li>

	<li> Whitespace in the values of internal entities is not mapped
	to space characters. </li>

	<li> General Entity references in attribute defaults within the
	DTD now cause fatal errors when the entity is not defined at the
	time it is referenced. </li>

	<li> Malformed general entity references in entity declarations are
	now detected.  </li>

	</ul></li>

    <li> Neither conditional sections
    nor parameter entity references within markup declarations
    are permitted in the internal subset. </li>

    <li> Processing instructions whose target names are "XML"
    (ignoring case) are now rejected. </li>

    <li> Comments may not include "--".</li>

    <li> Most "]]&gt;" sequences in text are rejected. </li>

    <li> Correct syntax for standalone declarations is enforced. </li>

    <li> Setting a locale for diagnostics only produces an exception
    if the language of that locale isn't English. </li>

    <li> Some more encoding names are recognized.  These include the
    Unicode 3.0 variants of UTF-16 (UTF-16BE, UTF-16LE) as well as
    US-ASCII and a few commonly seen synonyms. </li>

    <li> Text (from character content, PIs, or comments) large enough
    not to fit into internal buffers is now handled correctly even in
    some cases which were originally handled incorrectly.</li>

    <li> Content is now reported for element types for which attributes
    have been declared, but no content model is known.  (Such documents
    are invalid, but may still be well formed.) </li>

</ol>

<p> Other bugs may also have been fixed. </p>

<p> For better overall validation support, some of the validity
constraints that can't be verified using the SAX2 event stream
are now reported directly by &AElig;lfred2. </p>

</body></html>