sandbox/package-doc/multiple-input-files.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502

================================
 Docutils: Multiple Input Files
================================

:Author: Lea Wiemann <LeWiemann@gmail.com>
:Date: $Date$
:Revision: $Revision$
:Copyright: This document has been placed in the public domain.

.. sectnum::
.. contents::


Introduction
============

We would like to support documents whose source text comes from
multiple files.  For instance, the Docutils documentation could be
considered a single large document; parsing all files into one single
document tree would enable us to do cross-linking between parts of the
documentation (our current way to cross-link between files is to link
to HTML files and fragments, which is limited to HTML).  Another
example is a reference manual for a customized software system.  The
manual is built from a set of sub-documents that may differ from
installation to installation.

Note that this issue is separate from that of output to multiple
files; after implementing support for multiple input files, all we
will be able to do is to generate a single huge output file.

This is a collection of notes and semi-random thoughts (many of which
are credit to David, from IM conversations).  Feel free to add yours
and/or make changes as you see fit!

You can also discuss this proposal on the Docutils-develop_ mailing
list, or reach us individually via email or Jabber/Google Talk at
LeWiemann@gmail.com and dgoodger@gmail.com, respectively.

.. _Docutils-develop:
   http://docutils.sf.net/docs/user/mailing-lists.html#docutils-develop


Terminology
===========

Right now, we are using the following terminology: A document which
includes other documents (using the ``subdocument`` directive
described below) is called a *super-document*.  The included documents
are called *sub-documents*.  Sub-documents can in turn include other
documents and can thus be super-documents themselves.  Any top-most
super-document in the hierarchy, which is not included by any other
document, is called a *compound document*.

The set of all documents that can be part of a compound document is
the *document set* (or *docset*).  The directory that is ancestor to
all documents in the document set is the *docset root*.


The ``subdocument`` Directive
=============================

* The "include" directive is not usable for this because we want to
  have independent parsing contexts (for instance, section title
  adornment should not have to be consistent across input files).

* So create a "subdocument" directive (syntax: ".. subdocument::
  file.txt").  This directive causes the referenced file to be parsed
  and its document tree to be inserted in place.

  - The sub-document must only consist of top-level sections and
    transitions, or it must have a document title.  (The document
    title will become a section title when the sub-document is
    inserted using the ``subdocument`` directive.)  We only support
    per-document docinfos -- if a sub-document contains multiple
    top-level sections, don't touch field lists at all.

    This restriction may be lifted later; there is no theoretical
    reason that prevents sub-documents from containing arbitrary text
    (within the limits of the DTD, e.g. no body elements may end up
    after section elements) -- it is merely somewhat harder to
    implement.

  - As long as above restrictions apply, the subdocument directive can
    be treated like a section at parse-time; that is, no elements
    except for more sections or transitions may follow.
    
  - In order to facilitate assembling a large number of hierarchical
    files into a large document, the subdocument directive should
    allow specifying any number of files, like this::

        .. subdocument::

           chapter1.txt
               chapter1-section1.txt
               chapter1-section2.txt
           chapter2.txt
               chapter2-section1.txt
               ...

    Specifying an indented file (like chapter1-section1.txt) is
    equivalent to inserting ".. subdocument:: chapter1-section1.txt"
    at the end of chapter1.txt.

    Lists of files should be required to be directive content, not
    parameters, because file lists as parameters would be prone to
    uncaught user errors.  In this example, the indentation of
    "chapter1-section1.txt" would be stripped by the directive parser,
    which is contrary to what the user expects::

        .. subdocument:: chapter1.txt
               chapter1-section1.txt

* The sub-document files should each be processable stand-alone
  (without the other files), each forming a document on its own.

* What to do with docinfos in subdocuments:

  - Allow for section infos by generalizing the existing docinfo node.

  - Add an option to either strip or leave docinfos, specifiable as an
    option to the "subdocument" directive:

    Uniform handling::

        .. subdocument:: :leave-docinfo:

           chapter1.txt
           chapter2.txt
           chapter3.txt

    Non-uniform handling::

        .. subdocument:: chapter1.txt
        .. subdocument:: chapter2.txt
           :leave-docinfo:
        .. subdocument:: chapter3.txt

  - There is currently no way to have per-section "section infos" in
    reStructuredText files, as opposed to only file-wide docinfos.  So
    the only way to get section infos in a document is to use
    sub-documents.  This might be just fine though.

  - For a first implementation, just go the easy route and strip all
    docinfos in sub-documents.

* In order to facilitate multi-file output that parallels the input
  file structure, add "source" attributes to section nodes for
  sections that come from different input files.

* Restriction: Do not allow sub-documents without a top-level section,
  or with body elements in front of the first section.  IOW, the
  sub-document may only contain PreBibliographic elements, sections,
  and transitions.  The PreBibliographic elements in front of the
  first section get moved into the section.

  David says we shouldn't have this restriction -- for instance you
  might want to have an introductory paragraph in front of the first
  section of the first sub-document.  Lea doesn't mind the restriction
  and says you could use the "include" directive.  Since having the
  restriction makes the implementation somewhat easier, we agreed on
  having this restriction, waiting until the first user presents a
  good use case to remove it, and calling it a YAGNI until then.

* Silently drop header and footer in sub-documents.  (Document this in
  directives.txt though.)

* To do: Explore alternatives besides "subdocument" for the directive
  name.

  [DG] "subdoc" is a good alias.  "Subdocument" implies a single item;
  perhaps "manifest" instead, for multiple items?

* ``docutils.conf`` files in the sub-documents' respective directories
  are honored.

* You may want to read some insightful remarks by Joaquim Baptista on
  how `files should be expected to be part of different compound
  documents`__.

  .. _different compound documents:
  __ http://article.gmane.org/gmane.text.docutils.devel/4043


.. _xrefs:

Cross-References
================

.. note:: You may need to read the `reST spec`_ in order to understand
   the terminology (targets, references).  In this section, an
   "*external named reference*" means a reference whose target is
   outside of the current file, but within the current docset.

   .. _reST spec: http://docutils.sf.net/docs/ref/rst/restructuredtext.html

A major issue to think about is how to do **cross-references**
(colloquially known as **xrefs**) between files.  Things like
local references or substitutions should not be shared between files
(their definitions can simply be loaded using the "include"
directive).  However, sharing *external* targets and thereby allowing
cross-references between files is one of the major features of
an architecture that supports multiple input files.

(Implementation note: For this to work, we need to apply transforms to
sub-documents; basically, all transforms but the one resolving
external references should be applied.  This means that a new reader
instance must be created.  r5266 makes applying transforms to
sub-documents possible by pulling the responsibility for applying
transforms out of the Publisher.)

Issues arise once we think about how to group target names into
namespaces.  Unfortunately, simply putting all targets into a global,
document-wide namespace is bound to cause collisions; files that were
processable stand-alone are no longer processable when used in
conjunction with other files because they share common target names.

Since linking to targets outside the scope of the current sub-document
has significant disadvantages, we will need some form of qualifiers.


Namespace Identifiers
---------------------

This makes it necessary to add a notion of *namespace identifiers*.

.. sidebar:: Why headers are a bad idea

   One of the appealing features of reStructuredText, compared to
   LaTeX, is that creating a new file does not require writing a
   header.  Just type the title, some text, run rst2html, and you're
   done.  Writing a stand-alone LaTeX document on the other hand
   typically begins with declaring the \\documentclass, loading all
   the packages you need for your document, setting some options, and
   finally \\begin{document}.

   While it may not be possible to go *entirely* without any explicit
   markup, it is certainly a worthwhile goal to keep the amount of
   such markup to a minimum.

It is possible to always name the namespace of the current file (as it
is done in C++).  For instance, "``.. namespace:: frob``" at the
beginning of a file could declare that the namespace of the current
file is called "frob".  However, this is a little verbose as it adds a
line at the top of each file (see the sidebar).  Also, it removes the
reader's ability to easily look up the referenced files (you might not
know which file(s) declare the "frob" namespace).

On the other hand, namespace names could also be derived from paths
and file names.  (Note though that these two options need not be
mutually exclusive.)  Since using only the file name would cause
ambiguity, it is necessary to include its path in the namespace name.
For instance, the file ``docs/dev/todo.txt`` could be referenced by
the implicit namespace identifier ``docs/dev/todo.txt``; a reference
would look like ```<docs/dev/todo.txt> large documents`_``.  Using
paths relative to the current file makes it hard to move files or
document parts.  Therefore, we need to establish the notion of a
*docset root* which path names are relative to.

The docset root could be specified using a "docset-root" directive at
the top of each sub-document that uses external named references.  On
the other hand, perhaps we do not need to know the docset root until
we process the compound document (in which case it can be implicitly
derived from the location of the master file).  So let's wait with
implementing a "docset-root" directive until the need arises.


Namespace Aliases
-----------------

A general disadvantage of using paths as namespace identifiers is that
changes in the directory structure cause a massive amount of changes
in the reStructuredText files, because all the paths need to be
updated.  This is not any worse than the current situation.  However,
to improve maintainability it would be desirable to make the namespace
of an often-referenced files known under a shorter name.  (The shorter
namespace identifier should only be valid within the file where it is
declared, and possibly sub-documents.)  For instance, one could make
"docs/ref/restructuredtext.txt" known as "spec" using one of the
following syntax alternatives::

    .. namespaces::

       :spec: docs/ref/restructuredtext.txt

Namespace aliases can also be used make one namespace refer to
different physical files depending on the super-document.  Namespace
definitions should therefore be inherited from super-documents to
sub-documents.  The "namespaces" directive overrides namespace
definitions inherited from super-documents, unless the *:inherit:*
option is specified.  (The :inherit: option thus allows to provide
default paths for namespace aliases, which can still be overridden in
super-documents.)


Qualifier Syntax
----------------

Angled brackets::

    `<namespace> target`_

This is similar to the syntax for embedded URI's (```target
<URI>`_``).  It fits well into the existing syntax.


Implementation
==============

In order to parse sub-documents, we need to create new parser
instances.

For now, we'll instantiate them by calling parser.__class__(); in the
long run the reader, parser, and writer parameters of the Publisher
should be turned into classes (or callbacks) instead of instances.

The Parser must know about the Reader (or about the
Publisher) and calls Reader.parse_[sub]document in order to parser
sub-documents.


Dumpster
========

You can stop reading now.  This section is only to archive sections
we're no longer interested in.


Rejected Proposal: Local and Global Namespace, no Qualifiers
------------------------------------------------------------

An obvious solution would be to add a notion of a file-local and a
global namespace.  When trying to resolve a reference, first the
target name is looked up in the local namespace of the current file;
if no suitable target is found there, the target name is searched for
document-wide, in the global namespace; if the target name exists and
is unique within the compound document, the reference can be resolved.

.. sidebar:: Why independent references are a good idea

   While the requirement that the compound document be processed in
   order to resolve external named references makes implementation
   easier, it is certainly worthwhile to provide for a means to
   resolve external named references without a re-run of the compound
   document for speed reasons.

   Since authoring can involve an edit-process-edit-process cycle, it
   should be possible to process files individually, rather than the
   compound document (which can be very slow).  Of course, as long as
   external named references are marked in the source file as such,
   they can, in a stand-alone pass, always be marked as "unresolvable"
   (e.g. in red) in the output, and only be resolved when the compound
   document is processed.  However, it would be even better to be able
   to actually resolve the references.

If references to the global namespace are not marked up as such,
however, the individual files are no longer processable stand-alone
because they contain unresolvable references.  While it may be
acceptable that external named cross-references do not (fully) work
any longer when a file is processed stand-alone, it would be nice to
be able to handle unresolved external references somehow (at least by
marking them as "unresolvable" in the output), rather than simply
throwing an error (see the sidebar).

This can be solved by marking external references as such, like this::

    `local target name`_
    `-> global target name`_

where "local target name" must be a unique target name within the
current file, and "global target name" must be a unique target name
within the current compound document.

(We would need to explicitly establish a notion of "stand-alone" vs.
"full document" processing in this case.  But since this proposal is
being rejected, I'm not going to explore this further.)


Drawbacks
~~~~~~~~~

This approach turns out to have a major drawback though: External
named references depend on the context of the containing
super-document.  However, as Joaquim `pointed out`__, files should be
expected to be part of several super-documents.  This means that once
a file is put into the context of a new document, its external named
references might point to non-existing or duplicate targets.  This
seems like a maintenance problem for complex (large) collections of
documentation.

__ `different compound documents`_

Another peculiarity of this system is that, as long as a file is
processed stand-alone, external named references are not associated
with the file that defines the target.  This brings the advantage that
renaming and moving files won't invalidate reference names.  On the
downside, it lacks clarity for the reader because the file containing
a target is often not inferable from the target name (try to guess
which file ```-> html4css1`_`` links to) -- this may be significant
since reStructuredText should be readable in its source form.


Importing Namespaces
--------------------

While namespaces should generally be available without explicitly
importing them (in order to avoid length headers), it would probably
be handy to have a means of inserting all targets of another namespace
into the current one (similar to Python's "from module import \*").
The disadvantage is that it may cause confusion.

Contenders for the syntax::

    .. import:: namespace   (Pythonic)

    .. import-targets:: namespace   (more verbose)

    .. using:: namespace    (like C++)

Or, provided that we use "``.. namespace:: short-name <- namespace``",
and "```namespace -> target`_``" as reference syntax, this would be a
logical fit::

    .. namespace:: <- namespace
    `-> target`_                   (instead of `namespace -> target`_)

The advantage of this syntax is that we can prohibit importing more
than one namespace, which might cause confusion.  Importing only one
namespace might be a handy shortcut though.


Caching
-------

In order to be able to regenerate the whole compound document in a
timely manner after changing a single file, it is necessary to
implement a caching system.

Processing a document is done in the following steps:

1. For each file in the docset, parse it and turn the target names
   into file-local ID's (this includes error handling for duplicate
   target names).  Cache the parse tree, the name-to-ID mapping, and
   the list of all files included using the "include" directive.  Skip
   this step for files whose cache entry date-stamp is newer than the
   file's mtime and ctime, and all included files mtimes and ctimes.

   This means that the "subdocument" directive must be resolved at
   transform time (and not at parse time), because otherwise we cannot
   store the doctree before the sub-document has been inserted.

2. For each file, run transforms, resolving external named references
   using the cached name-to-ID mappings of other files.

3. Write out the resulting document (currently a single file).  (The
   writer needs to turn namespace/ID pairs into output-file-local
   ID's.)

Processing a file stand-alone is done in the same way, except that
steps 1 and 2 are only performed for the file being processed, not for
each file in the docset.  If other files' cached name-to-ID mappings
are not up-to-date (when being accessed in step 2), they should be
automatically updated.

All cache entries should be stored in a docset cache file, in order to
avoid LaTeX-like creation of many junk files.  Possible names include
docutils.cache, docutils.aux, or either of them with leading dot.  The
file is stored in the docset root and contains a header and a large
pickle string (reading and writing even large strings of pickled data
is reasonably fast).  In the header of the cache file, store
sys.version and docutils.__version__; discard cache files that have
the wrong version.

Potential security issue: Since unpickling is unsafe, an attacker
could provide a carefully crafted cache file, which is then
automatically picked up by Docutils.  Remedies: Insert some
unguessable system-specific key (generate randomly and store in
~/.docutils.cache.key), and automatically discard cache files that
have the wrong key.  Or simply place a big warning in the
documentation not to accept cache files from strangers.

No caching is done if no docset-root is defined (which means that the
file being processed is independent and not part of any compound
document).


Implementation
--------------

As described in section Caching_, when processing files stand-alone
and resolving their external named references, it may be necessary to
process (or re-process) referenced files.  Since this is during
transform-time, the parser instance is no longer available; it is
therefore necessary to create a new instance.

All requests for doctree and name-to-ID mappings should go through the
caching system.  In case of a miss, the caching system instantiates a
parser and (re-)parses the requested file.

In fact, all calls by the standalone reader to the reStructuredText
parser should go through the cache.  In the case of independent files
which are not part of a larger docset, the system always assumes a
cache miss.