summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'lt/gitlink'Junio C Hamano2007-04-2118-120/+639
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * lt/gitlink: Tests for core subproject support Expose subprojects as special files to "git diff" machinery Fix some "git ls-files -o" fallout from gitlinks Teach "git-read-tree -u" to check out submodules as a directory Teach git list-objects logic to not follow gitlinks Fix gitlink index entry filesystem matching Teach "git-read-tree -u" to check out submodules as a directory Teach git list-objects logic not to follow gitlinks Don't show gitlink directories when we want "other" files Teach git-update-index about gitlinks Teach directory traversal about subprojects Fix thinko in subproject entry sorting Teach core object handling functions about gitlinks Teach "fsck" not to follow subproject links Add "S_IFDIRLNK" file mode infrastructure for git links Add 'resolve_gitlink_ref()' helper function Avoid overflowing name buffer in deep directory structures diff-lib: use ce_mode_from_stat() rather than messing with modes manually
| * Tests for core subproject supportAlex Riesen2007-04-181-0/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following tests available: - create subprojects: create a directory in the superproject, initialize a git repo in it, and try adding it in super project. Make a commit in superproject - check if fsck ignores the subprojects: it just should give no errors - check if commit in a subproject detected: make a commit in subproject, git-diff-files in superproject should detect it - check if a changed subproject HEAD can be committed: try "git-commit -a" in superproject. It should commit changed HEAD of a subproject - check if diff-index works for subproject elements: compare the index (changed by previuos tests) with the initial commit (which created two subprojects). Should show a change for the recently changed subproject - check if diff-tree works for subproject elements: do the same, just use git-diff-tree. This test is somewhat redundant, I just added it for completeness (diff, diff-files, and diff-index are already used) - check if git diff works for subproject elements: try to limit the diff for the name of a subproject in superproject: git diff HEAD^ HEAD -- subproject - check if clone works: try a clone of superproject and compare "git ls-files -s" output in superproject and cloned repo - removing and adding subproject: rename test. Currently implemented as "git-update-index --force-remove", "mv" and "git-add". - checkout in superproject: try to checkout the initial commit Signed-off-by: Alex Riesen <raa.lkml@gmail.com> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * Expose subprojects as special files to "git diff" machineryLinus Torvalds2007-04-151-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The same way we generate diffs on symlinks as the the diff of text of the symlink, we can generate subproject diffs (when not recursing into them!) as the diff of the text that describes the subproject. Of course, since what descibes a subproject is just the SHA1, that's what we'll use. Add some pretty-printing to make it a bit more obvious what is going on, and we're done. So with this, we can get both raw diffs and "textual" diffs of subproject changes: - git diff --raw: :160000 160000 2de597b5ad348b7db04bd10cdd38cd81cbc93ab5 0000000... M sub-A - git diff: diff --git a/sub-A b/sub-A index 2de597b..e8f11a4 160000 --- a/sub-A +++ b/sub-A @@ -1 +1 @@ -Subproject commit 2de597b5ad348b7db04bd10cdd38cd81cbc93ab5 +Subproject commit e8f11a45c5c6b9e2fec6d136d3fb5aff75393d42 NOTE! We'll also want to have the ability to recurse into the subproject and actually diff it recursively, but that will involve a new command line option (I'd suggest "--subproject" and "-S", but the latter is in use by pickaxe), and some very different code. But regardless of ay future recursive behaviour, we need the non-recursive version too (and it should be the default, at least in the absense of config options, so that large superprojects don't default to something extremely expensive). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * Fix some "git ls-files -o" fallout from gitlinksLinus Torvalds2007-04-141-7/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since "git ls-files" doesn't really pass down any details on what it really wants done to the directory walking code, the directory walking code doesn't really know whether the caller wants to know about gitlink directories, or whether it wants to just know about ignored files. So the directory walking code will return those gitlink directories unless the caller has explicitly told it not to ("dir->show_other_directories" tells the directory walker to only show "other" directories). This kind of confuses "git ls-files -o", because - it didn't really expect to see entries listed that were already in the index, unless they were unmerged, and would die on that unexpected setup, rather than just "continue". - it didn't know how to match directory entries with the final "/" This trivial change updates the "show_other_files()" function to handle both of these issues gracefully. There really was no reason to die, when the obviously correct thing for the function was to just ignore files it already knew about (that's what "other" means here!). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * Replace a pair of patches with updated ones for subproject support.Junio C Hamano2007-04-140-0/+0
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This series of three patches is a *replacement* for the patch series of two patches (plus one-liner fixup) I sent yesterday. It fixes the issue I noted with "git status" incorrectly claiming that a non-checked out subproject wasn't clean - that was just a total thinko in the code (we were checking the filesystem mode against S_IFDIRLNK, which obviously cannot work, since S_IFDIRLINK is a git-internal state, not a filesystem state). It then re-sends the two patches on top of that, with the fix for checking out superprojects (we should *not* mess up any existing subproject directories, certainly not remove them - if we already have a directory in the place where we now want a subproject, we should leave it well alone!) The first one really is a fix, and it makes the commit commentary about a remaining bug in the patch I sent out yesterday go away.
| | * Teach "git-read-tree -u" to check out submodules as a directoryLinus Torvalds2007-04-121-13/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This actually allows us to check out a supermodule after cloning, although the submodules will obviously not be checked out, and will just be an empty subdirectory. [ Side note: this also shows that we currently don't correctly handle such subprojects that aren't checked out correctly yet. They should always show up as not being modified, but failing to resolve the gitlink HEAD does not properly trigger the "not modified" logic in all places it needs to.. So more work to be done, but that's a separate issue, unrelated to the action of checking out the superproject. ] The bulk of this patch is simply because we need to check the type of the index entry *before* we try to read the object it points to, and that meant that the code needed some re-organization. So I moved some of the code in common to both symlinks and files to be a trivial helper function. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| | * Teach git list-objects logic not to follow gitlinksLinus Torvalds2007-04-121-0/+34
| | | | | | | | | | | | | | | | | | | | | | | | This allows us to pack superprojects and thus clone them (but not yet check them out on the receiving side - that's the next patch) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | Teach "git-read-tree -u" to check out submodules as a directoryLinus Torvalds2007-04-141-13/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This actually allows us to check out a supermodule after cloning, although the submodules themselves will obviously not be checked out, and will just be empty directories. Checking out the submodules will be up to higher levels - we may not even want to! Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | Teach git list-objects logic to not follow gitlinksLinus Torvalds2007-04-141-0/+34
| | | | | | | | | | | | | | | | | | | | | | | | This allows us to pack superprojects and thus clone them (but not yet check them out on the receiving side.. That's the next patch) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | Fix gitlink index entry filesystem matchingLinus Torvalds2007-04-141-4/+4
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code to match up index entries with the filesystem was stupidly broken. We shouldn't compare the filesystem stat() information with S_IFDIRLNK, since that's purely a git-internal value, and not what the filesystem uses (on the filesystem, it's just a regular directory). Also, don't bother to make the stat() time comparisons etc for DIRLNK entries in ce_match_stat_basic(), since we do an exact match for these things, and the hints in the stat data simply doesn't matter. This fixes "git status" with submodules that haven't been checked out in the supermodule. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * Don't show gitlink directories when we want "other" filesLinus Torvalds2007-04-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | When "show_other_directories" is set, that implies that we are looking for untracked files, which obviously means that we should ignore directories that are marked as gitlinks in the index. This fixes "git status" in a superproject, that would otherwise always report that subprojects were "Untracked files:" Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * Teach git-update-index about gitlinksLinus Torvalds2007-04-121-62/+138
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I finally got around to looking at Alex' patch to teach update-index about gitlinks too, so that "git commit -a" along with any other explicit update-index scripts can work. I don't think there was anything wrong with Alex' patch, but the code he patched I felt was just so ugly that the added cases just pushed it over the edge. Especially as I don't think that patch necessarily did the right thing for a gitlink entry that already existed in the index, but that wasn't actually a real git repository in the working tree (just an empty subdirectory or a non-git snapshot because it hadn't wanted to track that particular subproject). So I ended up deciding to clean up the git-update-index handling the same way I tackled the directory traversal used by git-add earlier: by splitting the different cases up into multiple smaller functions, and just making the code easier to read (and adding more comments about the different cases). So this replaces the old "process_file()" with a new "process_path()" function that then just calls out to different helper functions depending on what kind of path it is. Processing a nondirectory ends up being just one of the simpler cases. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * Teach directory traversal about subprojectsLinus Torvalds2007-04-113-19/+121
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the promised cleaned-up version of teaching directory traversal (ie the "read_directory()" logic) about subprojects. That makes "git add" understand to add/update subprojects. It now knows to look at the index file to see if a directory is marked as a subproject, and use that as information as whether it should be recursed into or not. It also generally cleans up the handling of directory entries when traversing the working tree, by splitting up the decision-making process into small functions of their own, and adding a fair number of comments. Finally, it teaches "add_file_to_cache()" that directory names can have slashes at the end, since the directory traversal adds them to make the difference between a file and a directory clear (it always did that, but my previous too-ugly-to-apply subproject patch had a totally different path for subproject directories and avoided the slash for that case). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * Fix thinko in subproject entry sortingLinus Torvalds2007-04-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes a total thinko in my original series: subprojects do *not* sort like directories, because the index is sorted purely by full pathname, and since a subproject shows up in the index as a normal NUL-terminated string, it never has the issues with sorting with the '/' at the end. So if you have a subproject "proj" and a file "proj.c", the subproject sorts alphabetically before the file in the index (and must thus also sort that way in a tree object, since trees sort as the index). In contrast, it you have two files "proj/file" and "proj.c", the "proj.c" will sort alphabetically before "proj/file" in the index. The index itself, of course, does not actually contain an entry "proj/", but in the *tree* that gets written out, the tree entry "proj" will sort after the file entry "proj.c", which is the only real magic sorting rule. In other words: the magic sorting rule only affects tree entries, and *only* affects tree entries that point to other trees (ie are of the type S_IFDIR). Anyway, that thinko just means that we should remove the special case to make S_ISDIRLNK entries sort like S_ISDIR entries. They don't. They sort like normal files. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * Teach core object handling functions about gitlinksLinus Torvalds2007-04-104-6/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This teaches the really fundamental core SHA1 object handling routines about gitlinks. We can compare trees with gitlinks in them (although we can not actually generate patches for them yet - just raw git diffs), and they show up as commits in "git ls-tree". We also know to compare gitlinks as if they were directories (ie the normal "sort as trees" rules apply). [jc: amended a cut&paste error] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * Teach "fsck" not to follow subproject linksLinus Torvalds2007-04-102-2/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | Since the subprojects don't necessarily even exist in the current tree, much less in the current git repository (they are totally independent repositories), we do not want to try to follow the chain from one git repository to another through a gitlink. This involves teaching fsck to ignore references to gitlink objects from a tree and from the current index. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * Add "S_IFDIRLNK" file mode infrastructure for git linksLinus Torvalds2007-04-101-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | This just adds the basic helper functions to recognize and work with git tree entries that are links to other git repositories ("subprojects"). They still aren't actually connected up to any of the code-paths, but now all the infrastructure is in place. The next commit will start actually adding actual subproject support. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * Add 'resolve_gitlink_ref()' helper functionLinus Torvalds2007-04-102-0/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This new function resolves a ref in *another* git repository. It's named for its intended use: to look up the git link to a subproject. It's not actually wired up to anything yet, but we're getting closer to having fundamental plumbing support for "links" from one git directory to another, which is the basis of subproject support. [jc: amended a FILE* leak] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * Avoid overflowing name buffer in deep directory structuresLinus Torvalds2007-04-091-0/+3
| | | | | | | | | | | | | | | | | | This just makes sure that when we do a read_directory(), we check that the filename fits in the buffer we allocated (with a bit of slop) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * diff-lib: use ce_mode_from_stat() rather than messing with modes manuallyLinus Torvalds2007-04-091-12/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The diff helpers used to do the magic mode canonicalization and all the other special mode handling by hand ("trust executable bit" and "has symlink support" handling). That's bogus. Use "ce_mode_from_stat()" that does this all for us. This is also going to be required when we add support for links to other git repositories. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
* | Merge branch 'np/pack'Junio C Hamano2007-04-2121-498/+1010
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * np/pack: (27 commits) document --index-version for index-pack and pack-objects pack-objects: remove obsolete comments pack-objects: better check_object() performances add get_size_from_delta() pack-objects: make in_pack_header_size a variable of its own pack-objects: get rid of create_final_object_list() pack-objects: get rid of reuse_cached_pack pack-objects: clean up list sorting pack-objects: rework check_delta_limit usage pack-objects: equal objects in size should delta against newer objects pack-objects: optimize preferred base handling a bit clean up add_object_entry() tests for various pack index features use test-genrandom in tests instead of /dev/urandom simple random data generator for tests validate reused pack data with CRC when possible allow forcing index v2 and 64-bit offset treshold pack-redundant.c: learn about index v2 show-index.c: learn about index v2 sha1_file.c: learn about index version 2 ...
| * | document --index-version for index-pack and pack-objectsNicolas Pitre2007-04-192-0/+10
| | | | | | | | | | | | | | | Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | pack-objects: remove obsolete commentsNicolas Pitre2007-04-191-11/+3
| | | | | | | | | | | | | | | | | | | | | The sorted-by-sha ans sorted-by-type arrays are no more. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | pack-objects: better check_object() performancesNicolas Pitre2007-04-161-80/+126
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With large amount of objects, check_object() is really trashing the pack sliding map and the filesystem cache. It has a completely random access pattern especially with old objects where delta replay jumps back and forth all over the pack. This patch improves things by: 1) sorting objects by their offset in pack before calling check_object() so the pack access pattern is linear; 2) recording the object type at add_object_entry() time since it is already known in most cases; 3) recording the pack offset even for preferred_base objects; 4) avoid calling sha1_object_info() if all possible. This limits pack accesses to the bare minimum and makes them perfectly linear. In the process check_object() was made more clear (to me at least). Note: I thought about walking the sorted_by_offset list backward in get_object_details() so if a pack happens to be larger than the available file cache, then the cache would have been populated with useful data from the beginning of the pack already when find_deltas() is called. Strangely, testing (on Linux) showed absolutely no performance difference. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | add get_size_from_delta()Nicolas Pitre2007-04-162-34/+40
| | | | | | | | | | | | | | | | | | | | | | | | ... which consists of existing code split out of packed_delta_info() for other callers to use it as well. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | pack-objects: make in_pack_header_size a variable of its ownNicolas Pitre2007-04-161-14/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It currently aliases delta_size on the principle that reused deltas won't go through the whole delta matching loop hence delta_size was unused. This is not true if given delta doesn't find its base in the pack though. But we need that information even for whole object data reuse. Well in short the current state looks awful and is prone to bugs. It just works fine now because try_delta() tests trg_entry->delta before using trg_entry->delta_size, but that is a bit subtle and I was wondering for a while why things just worked fine... even if I'm guilty of having introduced this abomination myself in the first place. Let's do the sensible thing instead with no ambiguity, which is to have a separate variable for in_pack_header_size. This might even help future optimizations. While at it, let's reorder some struct object_entry members so they all align well with their own width, regardless of the architecture or the size of off_t. Some memory saving is to be expected with this alone. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | pack-objects: get rid of create_final_object_list()Nicolas Pitre2007-04-161-55/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because we don't have to know the SHA1 h(hence the name) of the pack up front anymore, let's get rid of yet another global sorted object list and sort them only in write_index_file(), then compute the object list SHA1 on the fly. This has the advantage of saving another chunk of memory, and the sorted list SHA1 won't be computed needlessly on servers during a fetch. Of course the cunning plan is also to make write_index_file() much like the function with the same name in index-pack.c for an eventual easy sharing. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | pack-objects: get rid of reuse_cached_packNicolas Pitre2007-04-161-72/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This capability is practically never useful, and therefore never tested, because it is fairly unlikely that the requested pack will be already available. Furthermore it is of little gain over the ability to reuse existing pack data. In fact the ability to change delta type on the fly when reusing delta data is a nice thing that has almost no cost and allows greater backward compatibility with a client's capabilities than if the client is blindly sent a whole pack without any discrimination. And this "feature" is simply in the way of other cleanups. Let's get rid of it. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | pack-objects: clean up list sortingNicolas Pitre2007-04-161-31/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Get rid of sort_comparator() as it impose a run time double indirect function call for little compile time type checking gain. Also get rid of create_sorted_list() as it only has one user which would as well be just fine doing its sorting locally. Eventually the list of deltifiable objects might be shorter than the whole object list. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | pack-objects: rework check_delta_limit usageNicolas Pitre2007-04-161-44/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Objects that have delta "children" from pack data reuse must consider the depth of their deepest child when they try to deltify themselves for those children not to become too deep. However, in the context of a "thin" pack, the delta children depth was skipped entirely on the presumption that the pack was always going to be exploded on the receiving end, hence the delta length wasn't an issue. Now that we keep received packs as is and reuse pack data when repacking, those packs do contain delta chains that are longer than expected. Worse, those delta chain may even grow longer when the pack is further repacked into another thin pack for a subsequent transmission. So this patch restores strict delta length even for thin packs, and it moves check_delta_limit() usage directly in the delta loop where it is needed. This way the delta_limit can be removed from struct object_entry as well. Oh and the initial value was wrong too. The progress_interval() function was moved to a more logical location in the process. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | pack-objects: equal objects in size should delta against newer objectsNicolas Pitre2007-04-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before finding best delta combinations, we sort objects by name hash, then by size, then by their position in memory. Then we walk the list backwards to test delta candidates. We hope that a bigger size usually means a newer objects. But a bigger address in memory does not mean a newer object. So the last comparison must be reversed. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | pack-objects: optimize preferred base handling a bitNicolas Pitre2007-04-161-15/+12
| | | | | | | | | | | | | | | | | | | | | | | | Let's avoid some cycles when there is no base to test against, and avoid unnecessary object lookups. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | clean up add_object_entry()Nicolas Pitre2007-04-111-27/+25
| | | | | | | | | | | | | | | | | | | | | | | | This function used to call locate_object_entry_hash() _twice_ per added object while only once should suffice. Let's reorganize that code a bit. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | tests for various pack index featuresNicolas Pitre2007-04-111-0/+146
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a fairly complete list of tests for various aspects of pack index versions 1 and 2. Tests on index v2 include 32-bit and 64-bit offsets, as well as a nice demonstration of the flawed repacking integrity checks that index version 2 intend to solve over index version 1 with the per object CRC. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | use test-genrandom in tests instead of /dev/urandomNicolas Pitre2007-04-111-1/+1
| | | | | | | | | | | | | | | | | | This way tests are completely deterministic and possibly more portable. Signed-off-by: Nicolas Pitre <nico@cam.org>
| * | simple random data generator for testsNicolas Pitre2007-04-113-2/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reliance on /dev/urandom produces test vectors that are, well, random. This can cause problems impossible to track down when the data is different from one test invokation to another. The goal is not to have random data to test, but rather to have a convenient way to create sets of large files with non compressible and non deltifiable data in a reproducible way. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | validate reused pack data with CRC when possibleNicolas Pitre2007-04-101-12/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces the inflate validation with a CRC validation when reusing data from a pack which uses index version 2. That makes repacking much safer against corruptions, and it should be a bit faster too. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | allow forcing index v2 and 64-bit offset tresholdNicolas Pitre2007-04-102-6/+32
| | | | | | | | | | | | | | | | | | | | | | | | This is necessary for testing the new capabilities in some automated way without having an actual 4GB+ pack. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | pack-redundant.c: learn about index v2Nicolas Pitre2007-04-101-20/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Initially the conversion was made using nth_packed_object_sha1() which made this file completely index version agnostic. Unfortunately the overhead was quite significant so I went back to raw index walking but with selectable base and step values which brought back similar performances as the original. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | show-index.c: learn about index v2Nicolas Pitre2007-04-101-9/+59
| | | | | | | | | | | | | | | | | | | | | | | | When index v2 is encountered, the CRC32 of each object is also displayed in parenthesis at the end of the line. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | sha1_file.c: learn about index version 2Nicolas Pitre2007-04-101-29/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With this patch, packs larger than 4GB are usable, even on a 32-bit machine (at least on Linux). If off_t is not large enough to deal with a large pack then die() is called instead of attempting to use the pack and producing garbage. This was tested with a 8GB pack specially created for the occasion on a 32-bit machine. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | index-pack: learn about pack index version 2Nicolas Pitre2007-04-101-9/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Like previous patch but for index-pack. [ There is quite some code duplication between pack-objects and index-pack for generating a pack index (and fast-import as well I suppose). This should be reworked into a common function eventually. But not now. ] Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | pack-objects: learn about pack index version 2Nicolas Pitre2007-04-101-12/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pack index version 2 goes as follows: - 8 bytes of header with signature and version. - 256 entries of 4-byte first-level fan-out table. - Table of sorted 20-byte SHA1 records for each object in pack. - Table of 4-byte CRC32 entries for raw pack object data. - Table of 4-byte offset entries for objects in the pack if offset is representable with 31 bits or less, otherwise it is an index in the next table with top bit set. - Table of 8-byte offset entries indexed from previous table for offsets which are 32 bits or more (optional). - 20-byte SHA1 checksum of sorted object names. - 20-byte SHA1 checksum of the above. The object SHA1 table is all contiguous so future pack format that would contain this table directly won't require big changes to the code. It is also tighter for slightly better cache locality when looking up entries. Support for large packs exceeding 31 bits in size won't impose an index size bloat for packs within that range that don't need a 64-bit offset. And because newer objects which are likely to be the most frequently used are located at the beginning of the pack, they won't pay the 64-bit offset lookup at run time either even if the pack is large. Right now an index version 2 is created only when the biggest offset in a pack reaches 31 bits. It might be a good idea to always use index version 2 eventually to benefit from the CRC32 it contains when reusing pack data while repacking. [jc: with the "oops" fix to keep track of the last offset correctly] Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | compute object CRC32 with index-packNicolas Pitre2007-04-101-3/+13
| | | | | | | | | | | | | | | | | | | | | Same as previous patch but for index-pack. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | compute a CRC32 for each object as stored in a packNicolas Pitre2007-04-103-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The most important optimization for performance when repacking is the ability to reuse data from a previous pack as is and bypass any delta or even SHA1 computation by simply copying the raw data from one pack to another directly. The problem with this is that any data corruption within a copied object would go unnoticed and the new (repacked) pack would be self-consistent with its own checksum despite containing a corrupted object. This is a real issue that already happened at least once in the past. In some attempt to prevent this, we validate the copied data by inflating it and making sure no error is signaled by zlib. But this is still not perfect as a significant portion of a pack content is made of object headers and references to delta base objects which are not deflated and therefore not validated when repacking actually making the pack data reuse still not as safe as it could be. Of course a full SHA1 validation could be performed, but that implies full data inflating and delta replaying which is extremely costly, which cost the data reuse optimization was designed to avoid in the first place. So the best solution to this is simply to store a CRC32 of the raw pack data for each object in the pack index. This way any object in a pack can be validated before being copied as is in another pack, including header and any other non deflated data. Why CRC32 instead of a faster checksum like Adler32? Quoting Wikipedia: Jonathan Stone discovered in 2001 that Adler-32 has a weakness for very short messages. He wrote "Briefly, the problem is that, for very short packets, Adler32 is guaranteed to give poor coverage of the available bits. Don't take my word for it, ask Mark Adler. :-)" The problem is that sum A does not wrap for short messages. The maximum value of A for a 128-byte message is 32640, which is below the value 65521 used by the modulo operation. An extended explanation can be found in RFC 3309, which mandates the use of CRC32 instead of Adler-32 for SCTP, the Stream Control Transmission Protocol. In the context of a GIT pack, we have lots of small objects, especially deltas, which are likely to be quite small and in a size range for which Adler32 is dimed not to be sufficient. Another advantage of CRC32 is the possibility for recovery from certain types of small corruptions like single bit errors which are the most probable type of corruptions. OK what this patch does is to compute the CRC32 of each object written to a pack within pack-objects. It is not written to the index yet and it is obviously not validated when reusing pack data yet either. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | add overflow tests on pack offset variablesNicolas Pitre2007-04-103-16/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change a few size and offset variables to more appropriate type, then add overflow tests on those offsets. This prevents any bad data to be generated/processed if off_t happens to not be large enough to handle some big packs. Better be safe than sorry. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | make overflow test on delta base offset work regardless of variable sizeNicolas Pitre2007-04-105-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces the MSB() macro to obtain the desired number of most significant bits from a given variable independently of the variable type. It is then used to better implement the overflow test on the OBJ_OFS_DELTA base offset variable with the property of always working correctly regardless of the type/size of that variable. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
| * | get rid of num_packed_objects()Nicolas Pitre2007-04-107-22/+17
| |/ | | | | | | | | | | | | | | | | | | | | | | | | The coming index format change doesn't allow for the number of objects to be determined from the size of the index file directly. Instead, Let's initialize a field in the packed_git structure with the object count when the index is validated since the count is always known at that point. While at it let's reorder some struct packed_git fields to avoid padding due to needed 64-bit alignment for some of them. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
* | Merge branch 'jp/refs'Junio C Hamano2007-04-211-21/+89
|\ \ | | | | | | | | | | | | * jp/refs: refs.c: add a function to sort a ref list, rather then sorting on add
| * | refs.c: add a function to sort a ref list, rather then sorting on addJulian Phillips2007-04-181-21/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than sorting the refs list while building it, sort in one go after it is built using a merge sort. This has a large performance boost with large numbers of refs. It shouldn't happen that we read duplicate entries into the same list, but just in case sort_ref_list drops them if the SHA1s are the same, or dies, as we have no way of knowing which one is the correct one. Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net>