diff options
Diffstat (limited to 'subversion/libsvn_fs_x/TODO')
-rw-r--r-- | subversion/libsvn_fs_x/TODO | 270 |
1 files changed, 270 insertions, 0 deletions
diff --git a/subversion/libsvn_fs_x/TODO b/subversion/libsvn_fs_x/TODO new file mode 100644 index 0000000..4daf45b --- /dev/null +++ b/subversion/libsvn_fs_x/TODO @@ -0,0 +1,270 @@ + +TODO (see also DONE section below) +================================== + +Internal API cleanup +-------------------- + +During refactoring, some functions had to be declared in header files +to make them available to other fsfs code. We need to revisit those +function definitions to turn them into a proper API that may be useful +to other code (such as fsfs tools). + + +Checksum all metadata elements +------------------------------ + +All elements of an FS-X repository shall be guarded by checksums. That +includes indexes, noderevs etc. Larger data structures, such as index +files, should have checksummed sub-elements such that corrupted parts +may be identified and potentially repaired / circumvented in a meaningful +way. + +Those checksums may be quite simple such as Adler32 because that meta- +data can be cross-verified with other parts as well and acts only as a +fallback to narrow down the affected parts. + +'svnadmin verify' shall check consistency based on those checksums. + + +Port existing FSFS tools +------------------------ + +fsfs-stats, fsfsverify.py and possibly others should have equivalents +in the FS-X world. + + +Optimize data ordering during pack +---------------------------------- + +I/O optimized copy algorithms are yet to be implemented. The current +code is relatively slow as it performs quasi-random I/O on the +input stream. + + +TxDelta v2 +---------- + +Version 1 of txdelta turns out to be limited in its effectiveness for +larger files when data gets inserted or removed. For typical office +documents (zip files), deltification often becomes ineffective. + +Version 2 shall introduce the following changes: + +- increase the delta window from 100kB to 1MB +- use a sliding window instead of a fixed-sized one +- use a slightly more efficient instruction encoding + +When introducing it, we will make it an option at the txdelta interfaces +(e.g. a format number). The version will be indicated in the 'SVN\x1' / +'SVN\x2' stream header. While at it, (try to) fix the layering violations +where those prefixes are being read or written. + + +Large file storage +------------------ + +Even most source code repositories contain large, hard to compress, +hard to deltify binaries. Reconstructing their content becomes very I/O +intense and it "dilutes" the data in our pack files. The latter makes +e.g. caching, prefetching and packing less efficient. + +Once a representation exceeds a certain configured threshold (16M default), +the fulltext of that item will be stored in a separate file. This will +be marked in the representation_t by an extra flag and future reps will +not be deltified against it. From that location, the data can be forwarded +directly via SendFile and the fulltext caches will not be used for it. + +Note that by making the decision contingent upon the size of the deltified +and packed representation, all large data that benefit from these (i.e. +have smaller increments) will still be stored within the rev and pack files. +If a future representation is smaller than the threshold, it may be + +/* danielsh: so if we have a file which is 20MB over many revisions, it'll +be stored in fulltext every single time unless the configured threshold is +changed? Wondering if that's the best solution... */ + + +Sorted binary directory representations +--------------------------------------- + +Lookup of entries in a directory is a frequent operation when following +cached paths. The represents directories as arrays sorted by entry name +to allow for binary search during that lookup. However, all external +representation uses hashes and the conversion is expensive. + +FS-X shall store directory representations sorted by element names and +all use that array representation internally wherever appropriate. This +will minimize the conversion overhead for long directories, especially +during transaction building. + +Moreover, switch from the key/value representation to a slightly tighter +and easier to process binary representation (validity is already guaranteed +by checksums). + + +Star-Deltification +------------------ + +Current implementation is incomplete. TODO: actually support & use base +representations, optimize instruction table. + +Combine this with Txdelta 2 such that the corresponding windows from +all representations get stored in a common star-delta container. + + +Multiple pack stages +-------------------- + +FSFS only knows one packing level - the shard. For repositories with +a large number of revisions, it may be more efficient to start with small +packs (10-ish) and later pack them into larger and larger ones. + + +Open less files when opening a repository +----------------------------------------- + +Opening a repository reads numerous files in db/ (besides several more in +../conf): uuid, current, format, fs-type, fsfs.conf, min-unpacked-rev, ... + +Combine most of them into one or two files (eg uuid|format(|fs-type?), +current|min-unpacked-revprop). + + +Sharded transaction directories +------------------------------- + +Transaction directories contain 3 OS files per FS file modified in the +transaction. That doesn't scale well; find something better. + + +DONE +==== + +Turn into separate FS +--------------------- + +Make FS-X a separate file system alongside BDB and FSFS. Rip out all +FSFS compatibility code. + + +Logical addressing +------------------ + +To allow for moving data structures around within the repository, we must +replace the current absolute addressing using file offsets with a logical +one. All references will no take the form of (revision, index) pairs and +a replacement to the format 6 manifest files will map that to actual file +offsets. + +Having the need to map revision-local offsets to pack-file global offsets +today already gives us some localized address mapping code that simply +needs to be replaced. + + +Optimize data ordering during pack +---------------------------------- + +Replace today's simple concatenating shard packing process with a one +placing fragments (representations and noderevs) from various revisions +close to each other if they are likely needed to serve in the same request. + +We will optimize on a per-shard basis. The general strategy is + +* place all change lists at the beginning of the pack file + - strict revision order + - place newest ones first +* place all file properties reps next + - place newer reps first +* place all directory properties next + - place newer reps first +* place all root nodes and root directories + - ordered newest rev -> oldest rev + - place rep delta chains 'en block' + - place root node in front of rep, if that rep has not already + been placed as part of a rep delta chain +* place remaining content as follows: + - place node rev directly in front of their reps (where they have one) + - start with the latest root directory not placed, yet + - recurse to sub-folders first with, sorted by name + - per folder, place files in naming order + - place rep deltification chains in deltification order (new->old) +* no fragments should be left but if they are, put them at the end + + +Index pack files +---------------- + +In addition to the manifest we need for the (revision, index) -> offset +mapping, we also introduce an offset -> (revision, index, type) index +file. This will allow us to parse any data in a pack file without walking +the DAG top down. + + +Data prefetch +------------- + +This builds on the previous. The idea is that whenever a cache lookup +fails, we will not just read the single missing fragment but parse all +data within the APR file buffer and put that into the cache. + +For maximum efficiency, we will align the data blocks being read to +multiples of the block size and allow that buffer size to be configured +(where supported by APR). The default block size will be raised to 64kB. + + +Extend 'svnadmin verify' +------------------------ + +Format 7 provides many extra chances to verify contents plus contains +extra indexes that must be consistent with the pack / rev files. We +must extend the tests to cover all that. + + +Containers +---------- + +Extend the index format support containers, i.e. map a logical item index +to (file offset, sub-index) pairs. The whole container will be read and +cached and the specific item later accessed from the whole structure. + +Use these containers for reps, noderevs and changes. Provide specific +data container types for each of these item types and different item +types cannot be put into the same container. Containers are binaries, +i.e. there is no textual representations of their contents. + +This allows for significant space savings on disk due to deltification +amongst e.g. revprops. More importantly, it reduces the size of the +runtime data structures within the cache *and* reduces the number of +cache entries (the cache is can't handle items < 500 bytes very well). + + +Packed change lists +------------------- + +Change lists tend to be large, in some cases >20% of the repo. Due to the +new ordering of pack data, the change lists can be the largest part of +data to read for svn log. Use our standard compression method to save +70 .. 80% of the disk space. + +Packing will only be applied to binary representations of change lists +to keep the number of possible combinations low. + + +Star-Deltification +------------------ + +Most node contents are smaller than 500k, i.e. less than Txdelta 2 window. +Those contents shall be aggregated into star-delta containers upon pack. +This will save significant amounts of disk space, particularly in case +of heavy branching. Also, the data extraction is independent of the +number of deltas, i.e. delta chain length) within the same container. + + +Support for arbitrary chars in path names +----------------------------------------- + +FSFS's textual item representations breaks when path names contain +newlines. FS-X revisions shall escape all control chars (e.g. < 0x20) +in path names when using them in textual item representations. + |