summaryrefslogtreecommitdiff
path: root/subversion/libsvn_fs_fs/structure-indexes
diff options
context:
space:
mode:
Diffstat (limited to 'subversion/libsvn_fs_fs/structure-indexes')
-rw-r--r--subversion/libsvn_fs_fs/structure-indexes352
1 files changed, 352 insertions, 0 deletions
diff --git a/subversion/libsvn_fs_fs/structure-indexes b/subversion/libsvn_fs_fs/structure-indexes
new file mode 100644
index 0000000..25490c7
--- /dev/null
+++ b/subversion/libsvn_fs_fs/structure-indexes
@@ -0,0 +1,352 @@
+This file describes the design, data model, and storage formats of FSFS
+index data.
+
+
+Design
+======
+
+Each pack and each rev file using logical addressing contains exactly
+two index sections. One, the log-to-phys index, maps the (rev, item_index)
+pairs to absolute file offsets. The other, phys-to-log, is a reverse
+index that gives basic information on any file location. This is enough
+to read and cache any data without traversing DAGs.
+
+Rev and pack files are immutable, so the same is true for index data.
+During a transaction or while packing a file, a proto index file gets
+written (actually, one log-to-phys and one phys-to-log). Its format is
+a simple concatenation of runtime structs and as such, an implementation
+detail subject to change. A proto index basically aggregates all the
+information that must later be transformed into the final index.
+
+
+General design concerns
+-----------------------
+
+In Subversion, there is no limit to the size of a revision; even practical
+limits are in the order of millions of changes at least. Index data for
+these would be multiple megabytes in size with pack file indexes possibly
+approaching 1 GB. To ensure we still get roughly O(1) access time, we
+need a hierarchical data structure.
+
+Therefore, the indexes will start with a header containing an array of
+references to sub-sections or pages. The length of these pages varies
+but is limited to a size configurable in fsfs.conf.
+
+Finally, it is assumed that whole pages can be cached efficiently and
+with a high cache hit rate. So, although a page may have a thousand or
+more entries, the access time is still amortized O(1) in many scenarios.
+
+
+Items and item types
+--------------------
+
+The index implementation treats item_index and item type as simple ints,
+except for SVN_FS_FS__ITEM_INDEX_UNUSED and SVN_FS_FS__ITEM_TYPE_UNUSED.
+Since they have been defined as 0, the code may test for "used" etc.
+by simply comparing with 0.
+
+See section "addressing modes" in structure to a list of item types
+and pre-defined item_index values.
+
+
+Encoding
+--------
+
+The final index data format is tuned for space and decoding efficiency.
+Indexes are stored as a sequence of variable integers. The encoding is
+as follows:
+
+* Unsigned integers are stored in little endian order with a variable
+ length 7b/8b encoding. If most significant bit a byte has been set,
+ the next byte has also belongs to the same value.
+
+ 0x00 .. 0x7f -> 0x00 .. 0x7f ( 7 bits stored in 8 bits)
+ 0x80 .. 0xff -> 0x80 0x01 .. 0xff 0x01 (14 bits stored in 16 bits)
+ 0x100 .. 0x3fff -> 0x80 0x02 .. 0xff 0x7f (14 bits stored in 16 bits)
+ 0x100000000 -> 0x80 0x80 0x80 0x80 0x10 (35 bits stored in 40 bits)
+
+ Technically, we can represent integers of arbitrary lengths. Currently,
+ we only generate and parse up to 64 bits.
+
+* Signed integers are mapped onto the unsigned value space as follows:
+
+ x >= 0 -> 2 * x
+ x < 0 -> -2 * x - 1
+
+ Again, we can represent arbitrary length numbers that way but the code
+ is currently restricted to 64 bits.
+
+Most data is unsigned by nature but will be stored differentially using
+signed integers.
+
+
+Encoding in proto-index files
+-----------------------------
+
+These have a much simpler encoding. Throughout the files, all records have
+the same length (but different between L2P and P2L). All records contain
+unsigned 64 bit integers only, stored in little endian byte order.
+
+
+Log-to-phys index
+=================
+
+This index has to map (rev, item_index) -> offset. It assumes that the
+item_index values per revision are dense and start at 0. There may be
+unused item_index values, though; the data structure simply gets less
+space-efficient when the more sparse the value space gets.
+
+
+Index data model
+----------------
+
+hierarchy:
+
+ header -> per-revision info -> page -> offset
+
+ There is one entry per revision in the header. Per revision there are
+ one or more pages (exclusive to that revision) containing up to a known,
+ fixed limit of entries (= page size). The total access path is:
+
+ pages = header->pages[revision];
+ offsets = page = pages[item_index / page_size];
+ offset = offsets[item_index % page_size];
+
+ Different log-to-phys indexes in the same repository may have different
+ page sizes but within any given index, the page size is the same and
+ immutable.
+
+header:
+
+ <first revision> ... first revision covered by this index
+ <revision count> ... number of revision covered by this index
+ <page size> ... maximum number of entries per page
+ <page table index> ... array, for each revision containing the index in
+ <page table> of the first page that belongs to
+ this revision. This has <revision count>+1
+ entries to terminate the last revision.
+ <page table> ... array of page headers. It has
+ <page table index>[<revision count>] entries.
+
+page table:
+
+ <offset> ... absolute position of the page contents within the
+ index
+ <entry count> ... number of offset entries in the page.
+ Must match <header>.<page size> unless this is
+ the last page for the respective revision.
+ <size> ... length in bytes of the on-disk page description.
+ Note that this is redundant with the <offset>.
+
+page:
+
+ <entry count> ... number of offset entries in the page.
+ Must match <header>.<page size> unless this is
+ the last page for the respective revision.
+ Redundant with <page table>.<entry count>
+ <offsets> ... array of absolute file positions within the rev /
+ pack file. This has <entry count> entries.
+
+
+Index on-disk format
+--------------------
+
+ index := "L2P-INDEX\n" header revisions pages offsets
+
+ header := u(<header>.<first revision>) \
+ u(<header>.<page size>) \
+ u(<header>.<revision count>) \
+ u(s(<header>.<page table>))
+
+ revisions := u( <header>.<page table index>[k+1]
+ - <header>.<page table index>[k]),
+ for k in 0 .. <header>.<revision count>-1
+
+ pages := u(<header>.<page table>[k].<size>) \
+ u(<header>.<page table>[k].<entry count>),
+ for k in 0 .. s(<header>.<page table>)-1
+
+ offsets := page(k),
+ for k in 0 .. s(<header>.<page table>)-1
+
+ page(k) := i(<header>.<page table>[k].<offsets>[0]) \
+ i( <header>.<page table>[k].<offsets>[l] \
+ - <header>.<page table>[k].<offsets>[l - 1]),
+ for l in 1 .. s(<header>.<page table>[k].<entry count>)-1
+
+ u(x) ... unsigned int x in 7b/8b encoding
+ i(x) ... signed int x in 7b/8b encoding
+ s(x) ... number of entries in array x
+
+
+Proto index file format
+-----------------------
+
+The index will be created from a "revision-less" proto index file
+containing (<offset><item_index>) pairs only.
+
+All mappings belonging to the same revision must be written in one go but
+there is no restriction on the order of those entries. To signify that
+a new revision begins, a (0, 0) mapping must be written. A (0, 0) entry
+at the beginning of the file is optional and will be ignored.
+
+ <bof> /* begin of proto index file for revision r and following */
+ (0, 0) /* mark start of revision r, optional for first rev */
+ (off, item)* /* zero to many mappings in random order */
+ (0, 0) /* mark start of revision r + 1 */
+ (off, item)* /* zero to many mappings in random order */
+ (0, 0) /* mark start of revision r + 2 */
+ (off, item)* /* zero to many mappings in random order */
+ ...
+ <eof> /* end of file. */
+
+All entries are pairs of 64 bit unsigned integers in little endian order.
+
+
+Phys-to-log index
+=================
+
+This index has to map offset -> (rev, item_index, type, len, checksum).
+
+
+Index data model
+----------------
+
+hierarchy:
+
+ header -> page -> item info
+
+ Logically, the index splits up index rev / pack file into pages of a
+ fixed size. That page size may differ from the FS's block size. The
+ index will describe each rev / pack file page with one index page.
+
+ page = header->pages[offset % page_size];
+ item info = binary_search(page.data, offset)
+
+ Note that the current implementation will not return any data if the
+ offset is does not match any actual item start. To simplify the lookup,
+ the last index page will have an "unused item" entry for the section
+ behind EOF. Holes aren't allowed as well, i.e. every byte of the rev /
+ pack is expected to be covered by the index.
+
+ Also, there may be items stretching across page borders or even over
+ multiple pages. The data model solves this issue by storing the item
+ descriptions as a "primary array" and then representing the pages as
+ ranges within that array. Thus multiple pages may reference the same
+ item description.
+
+header:
+
+ <first revision> ... first revision covered by this index
+ <file size> ... size of the rev / pack file in bytes
+ <page size> ... number of bytes in the rev / pack file covered by
+ each index page
+ <page count> ... number of pages
+ <offsets> ... array of page offsets, i.e. locations the page
+ data within the index.
+ This array has <page count> + 1 entries.
+
+page:
+
+ <entries> ... array of item descriptions, ordered by offset.
+ First and last entry may cross page boundaries.
+
+entry:
+
+ <offset> ... absolute position in the pack / rev file
+ <size> ... on-disk size of the item in the pack / rev file
+ <type> ... item type
+ <FNV checksum> ... modified 32 bit FNV-1a checksum of that section
+ of the pack / rev file (see below). 0 for empty
+ or zero-length items
+ <revision> ... revision that this item belongs to
+ <item_index> ... item_index within that revision
+
+
+Index on-disk format
+--------------------
+
+ index := "P2L-INDEX\n" header pages items
+
+ header := u(<header>.<first revision>) \
+ u(<header>.<file size>) \
+ u(<header>.<page size>) \
+ u(<header>.<page count>)
+
+ pages := u(<header>.<offsets>[k+1] - <header>.<offsets>[k]),
+ for k in 0 .. <header>.<page count> -1
+
+ items := u(<items in page k>[0].<offset>) \
+ u(<items in page k>[l].<size>) \
+ i(c(<items in page k>[l]) - c(<items of page k>[l-1])) \
+ i( <items in page k>[l].<revision>
+ - <items in page k>[l-1].<revision>), \
+ u(FNV checksum)
+ for l in 0 .. s(<items in page k>)-1,
+ for k in 0 .. <header>.<page count>-1
+
+ u(x) ... unsigned int x in 7b/8b encoding
+ i(x) ... signed int x in 7b/8b encoding
+ s(x) ... number of entries in collection x
+ c(x) ... compound function := x.<item_index> * 8 + x.<type>
+
+ Access to negative indexes gives a 0 value.
+
+ <Items in page k> are in strict ascending offset order. Items that
+ started after the begin of a given page and overlap with the next page
+ will not be stored in the start page. The runtime representation will
+ duplicate items overlapping page boundaries; the on-disk representation
+ will not. Thus, pages inside large items will have zero entries on disk.
+
+
+Proto index file format
+-----------------------
+
+The index will be created from a proto index file containing simple
+instances of svn_fs_fs__p2l_entry_t with the following element order:
+
+ item offset as uint64
+ item size as uint64
+ item type as uint64
+ modified FNV1a checksum as uint64
+ revision as uint64, with SVN_INVALID_REVNUM mapped to 0
+ and revisions >= 0 stored as rev+1
+ item index as uint64
+
+All values are stored in little endian order.
+
+Page table and header information, except start revision and page size,
+can easily be derived from that information.
+
+All entries must be written in strict offset order. Overlapping entries
+are not allowed; zero-length items are.
+
+In transactions, the final revision number may not be known when writing
+the proto index file (e.g. while still writing the proto rev file). Items
+with revision set to SVN_INVALID_REVNUM will therefore be automatically
+updated when creating the final index. This is possible in conjunction
+with rev files but not for pack files.
+
+
+FNV checksum
+------------
+
+FNV-1a can be found here: http://www.isthe.com/chongo/tech/comp/fnv/
+For performance reasons we use a modified version:
+
+* split the input byte stream [b0 .. bN] into 4 sub-streams of equal
+ length and up to 3 remnants:
+
+ [b0 b4 b8 ..], [b1 b5 b9 ..], [b2 b6 b10 ..], [b3 b7 b11 ..], [remnant]
+
+* calculate 32 bit FNV-1a checksums for the 4 substreams:
+
+ h0 = fnv_1a([b0 b4 b8 ..]), ..., h3 = fnv_1a([b3 b7 b11 ..])
+
+* combine the big endian representation of these checksums plus the
+ remnant of the original stream into a 12 to 15 byte long intermediate
+
+ [i0 .. iK], 12 <= K+1 <= 15
+
+* FNV checksum = fnv_1a([i0 .. iK]) in big endian representation
+