1 files changed, 272 insertions, 0 deletions
diff --git a/doc/developers/bundle-format4.txt b/doc/developers/bundle-format4.txt
new file mode 100644
index 0000000..0af1489
--- /dev/null
+++ b/doc/developers/bundle-format4.txt
@@ -0,0 +1,272 @@
+============================================
+Merge Directive format 2 and Bundle format 4
+============================================
+
+:Date: 2007-06-21
+
+Motivation
+----------
+Merge Directive format 2 represents a request to perform a certain merge.  It
+provides access to all the data necessary to perform that merge, by including
+a branch URL or a bundle payload.  It typically will include a preview of
+what applying the patch would do.
+
+Bundle Format 4 is designed to be a compact format for storing revision
+metadata that can be generated quickly and installed into a repository
+efficiently.  It is not intended to be human-readable.
+
+Note
+----
+These two formats, taken together, can be viewed as the successor of Bundle
+format 0.9, so their specifications are combined.  It is expected that in the
+future, bundle and merge-directive formats will vary independently.
+
+
+Bundle Format Name
+------------------
+This is the fourth bundle format to see public use.  Previous versions were
+0.7, 0.8, and 0.9.  Only 0.7's version number was aligned with a Bazaar
+release.
+
+
+Dependencies
+------------
+- Container format 1
+- Multiparent diffs
+- Bencode
+- Patch-RIO
+
+
+Description
+-----------
+Merge Directives fulfil the role previous bundle formats had of requesting a
+merge to be performed, but are a more flexible way of doing so.  With the
+introduction of these two formats, there is a clear split between "directive",
+which is a request to merge (and therefore signable), and "bundle", which is
+just data.
+
+Merge Directive format 2 may provide a patch preview of the change being
+requested.  If a preview is supplied, the receiving client will verify that
+the actual change matches the preview.
+
+Merge Directive format 2 also includes a testament hash, to ensure that if a
+branch is used, the branch cannot be subverted to cause the wrong changes to be
+applied.
+
+Bundle format 4 is designed to trade human-readability for speed and
+compactness.  It does not contain a human-readable "prelude" patch.
+
+Merge Directive 2 Contents
+--------------------------
+This format consists of three sections, in the following order.
+
+
+Patch-RIO command section
+~~~~~~~~~~~~~~~~~~~~~~~~~
+This section is identical to the corresponding section in Format 1 merge
+directives, except as noted below.  It is mandatory.  It is terminated by a
+line reading ``#`` that is not preceeded by a line ending with ``\``.
+
+In order to support cherry-picking and patch comparison, this format adds a new
+piece of information, the ``base_revision_id``.  This is a suggested base
+revision for merging.  It may be supplied by the user.  If not, it is
+calculated using the standard merge base algorithm, with the ``revision_id``
+and target branch's ``last_revision`` as its inputs.
+
+When merging, clients should use the ``base_revision_id`` when it is not
+already present in the ancestry of the ``last_revision`` of the target branch.
+If it is already present, clients should calculate a merge base in the normal
+way.
+
+
+Patch preview section
+~~~~~~~~~~~~~~~~~~~~~
+This section is optional.  It begins with the line ``# Begin patch``.  It is
+terminated by the end-of-file or by the beginning of a bundle section.
+
+Its contents are a unified diff, as per the ``bzr diff`` command.  The FROM
+revision is the ``base_revision_id`` specified in the Patch-RIO section.
+
+
+Bundle section
+~~~~~~~~~~~~~~
+This section is optional, but if it is not supplied, a source_branch must be
+supplied.  It begins with the line ``# Begin bundle``, and is terminated by the
+end-of-file.
+
+The contents are a base-64 encoded bundle.  This may be any bundle format, but
+formats 4+ are strongly recommended.  The base revision is the newest revision
+in the source branch which is an ancestor of all revisions not present in
+target which are ancestors of revision_id.
+
+This base revision may or may not be the same as the ``base_revision_id``.  In
+particular, the ``base_revision_id`` may specify a cherry-pick, but all the
+ancestors of the ``base_revision_id`` should be installed in the target
+repository before performing such a merge.
+
+
+Bundle 4 Contents
+-----------------
+Bazaar revision bundles  begin with a format marker that reads
+``# Bazaar revision bundle v4`` in plaintext.  The remainder of the file is a
+``Bazaar pack format 1`` container.  The container is compressed using bzip2.
+
+Putting the format marker in plaintext ensures that old clients will give good
+diagnostics, but renders the file unreadable by standard bzip2 utilities.
+
+Serialization
+~~~~~~~~~~~~~
+Format 4 records revision and inventory records in their repository
+serialization format.  This minimizes translation and compression costs
+in the common case, where the sender and receiver use the same serialization
+format for their repository. Steps have been taken to ensure a faithful
+conversion when serialization formats are mismatched.
+
+
+Bundle Records
+~~~~~~~~~~~~~~
+The bundle format creates a single bundle-level record out of two container
+records.  The first container record contains metainfo as a Bencoded dict.  The
+second container record contains the body.
+
+The bundle record name is associated with the metainfo record.  The body record
+is anonymous.
+
+
+Record metainfo
+~~~~~~~~~~~~~~~
+
+:record_kind: The storage strategy of the record.  May be ``fulltext`` (the
+    record body contains the full text of the value), ``mpdiff`` (the record
+    body contains a multi-parent diff of the value), or ``header`` (no record
+    body).
+:parents: Used in fulltext and mpdiff records.  The revisions that should be
+    noted as parents of this revision in the repository.  For mpdiffs, this is
+    also the list of build-parents.
+:sha1: Used in mpdiff records.  The sha-1 hash of the full-text value.
+
+
+Bundle record naming
+~~~~~~~~~~~~~~~~~~~~~
+All bundle records have a single name, which is associated with the metainfo
+container record.  Records are named according to the body's content-kind,
+revision-id, and file-id.
+
+Content-kind may be one of:
+
+:file: a version of a user file
+:inventory: the tree inventory
+:revision: the revision metadata for a revision
+:signature: the revision signature for a revision
+
+Names are constructed like so: ``content-kind/revision-id/file-id``.  Values
+are iterpreted left-to-right, so if two values are present, they are
+content-kind and revision-id.
+A record has a file-id if-and-only-if it is a file record.
+Info records have no revision or file-id.
+Inventory, revision and signature all have content-kind and revision-id, but
+no file-id.
+
+Layout
+~~~~~~
+The first record is an info/header record.
+
+The subsequent records are mpdiff file records.  The are ordered first by file
+id, then in topological order by revision-id.
+
+The next records are mpdiff inventory records.  They are topologically sorted.
+
+The next records are revision and signature fulltexts.  They are interleaved
+and topologically sorted.
+
+Info record
+~~~~~~~~~~~
+The info record has type ``header``.  It has no revision_id or file_id.
+Its metadata contains:
+
+:serializer: A string describing the serialization format used for inventory
+    and revision data.  May be ``xml5``, ``xml6`` or ``xml7``.
+:supports_rich_root: 1 if the source repository supports rich roots,
+    0 otherwise.
+
+
+Implementation notes
+~~~~~~~~~~~~~~~~~~~~
+- knit deltas contain almost enough information to extract the original
+  SequenceMatcher.get_matching_blocks() call used to produce them.  Combining
+  that information with the relevant fulltexts allows us to avoid performing
+  sequence matching on any fulltexts for which we have deltas.
+
+- MultiParent deltas contain ``get_matching_blocks`` output almost verbatim,
+  but if there is more than one parent, the information about the leftmost
+  parent may be incomplete.  However, for single-parent multiparent diffs, we
+  can extract the ``SequenceMatcher.get_matching_blocks`` output, and therefore
+  ``the SequenceMatcher.get_opcodes`` output used to create knit deltas.
+
+
+Installing data across serialization mismatches
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In practice, there cannot be revision serialization mismatches, because the
+serialization of revisions has been consistent in serializations 5-7
+
+If there is a mismatch in inventory serialization formats, the receiver can
+
+  1. extract the inventory objects for the parents
+  2. serialize them using the bundle serialize
+  3. apply the mpdiff
+  4. calculate the fulltext sha1
+  5. compare the calculated sha1 to the expected sha1
+  6. deserialize using the bundle serializer
+  7. serialize using the repository serializer
+  8. add to the repository
+
+This is much slower, of course.  But since the since the fulltext is verified
+at step 5, it should be just as safe as any other conversion.
+
+Model differences
+~~~~~~~~~~~~~~~~~
+
+Note that there may be model differences requiring additional changes.  These
+differences are described by the "supports_rich_root" value in the info record.
+
+A subset of xml6 and xml7 records are compatible with xml5 (i.e. those that
+were converted from xml5 originally).
+
+When installing from a bundle whose serializer supports tree references to a
+repository that does not support tree references, clients should halt if they
+encounter a record containing a tree reference.
+
+When installing from a supports_rich_root bundle to a repository that does not
+support rich roots, clients should halt if they encounter an inventory record
+whose root directory revision-id does not match the inventory revision id.
+
+When installing from a bundle that does not support rich roots to a repository
+that does, additional knits should be added for the root directory, with a
+revision for each inventory revision.
+
+Validating preview patches
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+When applying a merge directive that includes a preview, clients should
+verify that the preview matches the changes requested by the merge directive.
+
+In order to do this, the client should generate a diff from the
+``base_revision_id`` to the ``revision_id``.  This diff should be compared
+against the preview patch, making allowances for the fact that whitespace
+munging may have occurred.
+
+One form of whitespace munging that has been observed is line-ending
+conversion.  Certain mail clients such as Evolution do not respect the
+line-endings of text attachments.  Since line-ending conversion is unlikely to
+alter the meaning of a patch, it seems safe to ignore line endings when
+comparing the preview patch.
+
+Another form of whitespace munging that has been observed is
+trailing-whitespace stripping.  Again, it seems unlikely that stripping
+trailing whitespace could alter the meaning of a patch.  Such a distinction
+is also invisible to readers, so ignoring it does not create a new threat.  So
+it seems reasonable to ignore trailing whitespace when comparing the patches.
+
+Other mungings are possible, but it is recommended not to implement support
+for them until they have been observed.  Each of these changes makes the
+comparison more approximate, and the more approximate it becomes, the easier it
+is to provide a preview patch that does not match the requested changes.