| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* lt/gitlink:
Tests for core subproject support
Expose subprojects as special files to "git diff" machinery
Fix some "git ls-files -o" fallout from gitlinks
Teach "git-read-tree -u" to check out submodules as a directory
Teach git list-objects logic to not follow gitlinks
Fix gitlink index entry filesystem matching
Teach "git-read-tree -u" to check out submodules as a directory
Teach git list-objects logic not to follow gitlinks
Don't show gitlink directories when we want "other" files
Teach git-update-index about gitlinks
Teach directory traversal about subprojects
Fix thinko in subproject entry sorting
Teach core object handling functions about gitlinks
Teach "fsck" not to follow subproject links
Add "S_IFDIRLNK" file mode infrastructure for git links
Add 'resolve_gitlink_ref()' helper function
Avoid overflowing name buffer in deep directory structures
diff-lib: use ce_mode_from_stat() rather than messing with modes manually
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The following tests available:
- create subprojects: create a directory in the superproject,
initialize a git repo in it, and try adding it in super project.
Make a commit in superproject
- check if fsck ignores the subprojects: it just should give no errors
- check if commit in a subproject detected: make a commit in
subproject, git-diff-files in superproject should detect it
- check if a changed subproject HEAD can be committed: try
"git-commit -a" in superproject. It should commit changed
HEAD of a subproject
- check if diff-index works for subproject elements: compare the index
(changed by previuos tests) with the initial commit (which created
two subprojects). Should show a change for the recently changed subproject
- check if diff-tree works for subproject elements: do the same, just use
git-diff-tree. This test is somewhat redundant, I just added it for
completeness (diff, diff-files, and diff-index are already used)
- check if git diff works for subproject elements: try to limit
the diff for the name of a subproject in superproject:
git diff HEAD^ HEAD -- subproject
- check if clone works: try a clone of superproject and compare
"git ls-files -s" output in superproject and cloned repo
- removing and adding subproject: rename test. Currently implemented
as "git-update-index --force-remove", "mv" and "git-add".
- checkout in superproject: try to checkout the initial commit
Signed-off-by: Alex Riesen <raa.lkml@gmail.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The same way we generate diffs on symlinks as the the diff of text of the
symlink, we can generate subproject diffs (when not recursing into them!)
as the diff of the text that describes the subproject.
Of course, since what descibes a subproject is just the SHA1, that's what
we'll use. Add some pretty-printing to make it a bit more obvious what is
going on, and we're done.
So with this, we can get both raw diffs and "textual" diffs of subproject
changes:
- git diff --raw:
:160000 160000 2de597b5ad348b7db04bd10cdd38cd81cbc93ab5 0000000... M sub-A
- git diff:
diff --git a/sub-A b/sub-A
index 2de597b..e8f11a4 160000
--- a/sub-A
+++ b/sub-A
@@ -1 +1 @@
-Subproject commit 2de597b5ad348b7db04bd10cdd38cd81cbc93ab5
+Subproject commit e8f11a45c5c6b9e2fec6d136d3fb5aff75393d42
NOTE! We'll also want to have the ability to recurse into the subproject
and actually diff it recursively, but that will involve a new command line
option (I'd suggest "--subproject" and "-S", but the latter is in use by
pickaxe), and some very different code.
But regardless of ay future recursive behaviour, we need the non-recursive
version too (and it should be the default, at least in the absense of
config options, so that large superprojects don't default to something
extremely expensive).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Since "git ls-files" doesn't really pass down any details on what it
really wants done to the directory walking code, the directory walking
code doesn't really know whether the caller wants to know about gitlink
directories, or whether it wants to just know about ignored files.
So the directory walking code will return those gitlink directories unless
the caller has explicitly told it not to ("dir->show_other_directories"
tells the directory walker to only show "other" directories).
This kind of confuses "git ls-files -o", because
- it didn't really expect to see entries listed that were already in the
index, unless they were unmerged, and would die on that unexpected
setup, rather than just "continue".
- it didn't know how to match directory entries with the final "/"
This trivial change updates the "show_other_files()" function to handle
both of these issues gracefully. There really was no reason to die, when
the obviously correct thing for the function was to just ignore files it
already knew about (that's what "other" means here!).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| |\
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This series of three patches is a *replacement* for the patch series of
two patches (plus one-liner fixup) I sent yesterday.
It fixes the issue I noted with "git status" incorrectly
claiming that a non-checked out subproject wasn't clean - that
was just a total thinko in the code (we were checking the
filesystem mode against S_IFDIRLNK, which obviously cannot work,
since S_IFDIRLINK is a git-internal state, not a filesystem
state).
It then re-sends the two patches on top of that, with the fix
for checking out superprojects (we should *not* mess up any
existing subproject directories, certainly not remove them - if
we already have a directory in the place where we now want a
subproject, we should leave it well alone!)
The first one really is a fix, and it makes the commit
commentary about a remaining bug in the patch I sent out
yesterday go away.
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This actually allows us to check out a supermodule after cloning, although
the submodules will obviously not be checked out, and will just be an
empty subdirectory.
[ Side note: this also shows that we currently don't correctly handle
such subprojects that aren't checked out correctly yet. They should
always show up as not being modified, but failing to resolve the
gitlink HEAD does not properly trigger the "not modified" logic in all
places it needs to..
So more work to be done, but that's a separate issue, unrelated to
the action of checking out the superproject. ]
The bulk of this patch is simply because we need to check the type of the
index entry *before* we try to read the object it points to, and that
meant that the code needed some re-organization. So I moved some of the
code in common to both symlinks and files to be a trivial helper function.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This allows us to pack superprojects and thus clone them (but not yet
check them out on the receiving side - that's the next patch)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This actually allows us to check out a supermodule after cloning, although
the submodules themselves will obviously not be checked out, and will just
be empty directories.
Checking out the submodules will be up to higher levels - we may not even
want to!
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This allows us to pack superprojects and thus clone them (but not yet
check them out on the receiving side.. That's the next patch)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The code to match up index entries with the filesystem was stupidly
broken. We shouldn't compare the filesystem stat() information with
S_IFDIRLNK, since that's purely a git-internal value, and not what the
filesystem uses (on the filesystem, it's just a regular directory).
Also, don't bother to make the stat() time comparisons etc for DIRLNK
entries in ce_match_stat_basic(), since we do an exact match for these
things, and the hints in the stat data simply doesn't matter.
This fixes "git status" with submodules that haven't been checked out in
the supermodule.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When "show_other_directories" is set, that implies that we are looking
for untracked files, which obviously means that we should ignore
directories that are marked as gitlinks in the index.
This fixes "git status" in a superproject, that would otherwise always
report that subprojects were "Untracked files:"
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
I finally got around to looking at Alex' patch to teach update-index about
gitlinks too, so that "git commit -a" along with any other explicit
update-index scripts can work.
I don't think there was anything wrong with Alex' patch, but the code he
patched I felt was just so ugly that the added cases just pushed it over
the edge. Especially as I don't think that patch necessarily did the right
thing for a gitlink entry that already existed in the index, but that
wasn't actually a real git repository in the working tree (just an empty
subdirectory or a non-git snapshot because it hadn't wanted to track that
particular subproject).
So I ended up deciding to clean up the git-update-index handling the same
way I tackled the directory traversal used by git-add earlier: by
splitting the different cases up into multiple smaller functions, and just
making the code easier to read (and adding more comments about the
different cases).
So this replaces the old "process_file()" with a new "process_path()"
function that then just calls out to different helper functions depending
on what kind of path it is. Processing a nondirectory ends up being just
one of the simpler cases.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This is the promised cleaned-up version of teaching directory traversal
(ie the "read_directory()" logic) about subprojects. That makes "git add"
understand to add/update subprojects.
It now knows to look at the index file to see if a directory is marked as
a subproject, and use that as information as whether it should be recursed
into or not.
It also generally cleans up the handling of directory entries when
traversing the working tree, by splitting up the decision-making process
into small functions of their own, and adding a fair number of comments.
Finally, it teaches "add_file_to_cache()" that directory names can have
slashes at the end, since the directory traversal adds them to make the
difference between a file and a directory clear (it always did that, but
my previous too-ugly-to-apply subproject patch had a totally different
path for subproject directories and avoided the slash for that case).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This fixes a total thinko in my original series: subprojects do *not* sort
like directories, because the index is sorted purely by full pathname, and
since a subproject shows up in the index as a normal NUL-terminated
string, it never has the issues with sorting with the '/' at the end.
So if you have a subproject "proj" and a file "proj.c", the subproject
sorts alphabetically before the file in the index (and must thus also sort
that way in a tree object, since trees sort as the index).
In contrast, it you have two files "proj/file" and "proj.c", the "proj.c"
will sort alphabetically before "proj/file" in the index. The index
itself, of course, does not actually contain an entry "proj/", but in the
*tree* that gets written out, the tree entry "proj" will sort after the
file entry "proj.c", which is the only real magic sorting rule.
In other words: the magic sorting rule only affects tree entries, and
*only* affects tree entries that point to other trees (ie are of the type
S_IFDIR).
Anyway, that thinko just means that we should remove the special case to
make S_ISDIRLNK entries sort like S_ISDIR entries. They don't. They sort
like normal files.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This teaches the really fundamental core SHA1 object handling routines
about gitlinks. We can compare trees with gitlinks in them (although we
can not actually generate patches for them yet - just raw git diffs),
and they show up as commits in "git ls-tree".
We also know to compare gitlinks as if they were directories (ie the
normal "sort as trees" rules apply).
[jc: amended a cut&paste error]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Since the subprojects don't necessarily even exist in the current tree,
much less in the current git repository (they are totally independent
repositories), we do not want to try to follow the chain from one git
repository to another through a gitlink.
This involves teaching fsck to ignore references to gitlink objects from
a tree and from the current index.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This just adds the basic helper functions to recognize and work with git
tree entries that are links to other git repositories ("subprojects").
They still aren't actually connected up to any of the code-paths, but
now all the infrastructure is in place.
The next commit will start actually adding actual subproject support.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This new function resolves a ref in *another* git repository. It's
named for its intended use: to look up the git link to a subproject.
It's not actually wired up to anything yet, but we're getting closer to
having fundamental plumbing support for "links" from one git directory
to another, which is the basis of subproject support.
[jc: amended a FILE* leak]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This just makes sure that when we do a read_directory(), we check
that the filename fits in the buffer we allocated (with a bit of
slop)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The diff helpers used to do the magic mode canonicalization and all the
other special mode handling by hand ("trust executable bit" and "has
symlink support" handling).
That's bogus. Use "ce_mode_from_stat()" that does this all for us.
This is also going to be required when we add support for links to other
git repositories.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
* np/pack: (27 commits)
document --index-version for index-pack and pack-objects
pack-objects: remove obsolete comments
pack-objects: better check_object() performances
add get_size_from_delta()
pack-objects: make in_pack_header_size a variable of its own
pack-objects: get rid of create_final_object_list()
pack-objects: get rid of reuse_cached_pack
pack-objects: clean up list sorting
pack-objects: rework check_delta_limit usage
pack-objects: equal objects in size should delta against newer objects
pack-objects: optimize preferred base handling a bit
clean up add_object_entry()
tests for various pack index features
use test-genrandom in tests instead of /dev/urandom
simple random data generator for tests
validate reused pack data with CRC when possible
allow forcing index v2 and 64-bit offset treshold
pack-redundant.c: learn about index v2
show-index.c: learn about index v2
sha1_file.c: learn about index version 2
...
|
| | |
| | |
| | |
| | |
| | | |
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The sorted-by-sha ans sorted-by-type arrays are no more.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
With large amount of objects, check_object() is really trashing the pack
sliding map and the filesystem cache. It has a completely random access
pattern especially with old objects where delta replay jumps back and
forth all over the pack.
This patch improves things by:
1) sorting objects by their offset in pack before calling check_object()
so the pack access pattern is linear;
2) recording the object type at add_object_entry() time since it is
already known in most cases;
3) recording the pack offset even for preferred_base objects;
4) avoid calling sha1_object_info() if all possible.
This limits pack accesses to the bare minimum and makes them perfectly
linear.
In the process check_object() was made more clear (to me at least).
Note: I thought about walking the sorted_by_offset list backward in
get_object_details() so if a pack happens to be larger than the available
file cache, then the cache would have been populated with useful data from
the beginning of the pack already when find_deltas() is called. Strangely,
testing (on Linux) showed absolutely no performance difference.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
... which consists of existing code split out of packed_delta_info()
for other callers to use it as well.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
It currently aliases delta_size on the principle that reused deltas won't
go through the whole delta matching loop hence delta_size was unused.
This is not true if given delta doesn't find its base in the pack though.
But we need that information even for whole object data reuse.
Well in short the current state looks awful and is prone to bugs. It just
works fine now because try_delta() tests trg_entry->delta before using
trg_entry->delta_size, but that is a bit subtle and I was wondering for a
while why things just worked fine... even if I'm guilty of having
introduced this abomination myself in the first place.
Let's do the sensible thing instead with no ambiguity, which is to have
a separate variable for in_pack_header_size. This might even help future
optimizations.
While at it, let's reorder some struct object_entry members so they all
align well with their own width, regardless of the architecture or the
size of off_t. Some memory saving is to be expected with this alone.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Because we don't have to know the SHA1 h(hence the name) of the pack
up front anymore, let's get rid of yet another global sorted object list
and sort them only in write_index_file(), then compute the object list
SHA1 on the fly.
This has the advantage of saving another chunk of memory, and the sorted
list SHA1 won't be computed needlessly on servers during a fetch.
Of course the cunning plan is also to make write_index_file() much like
the function with the same name in index-pack.c for an eventual easy
sharing.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This capability is practically never useful, and therefore never tested,
because it is fairly unlikely that the requested pack will be already
available. Furthermore it is of little gain over the ability to reuse
existing pack data.
In fact the ability to change delta type on the fly when reusing delta
data is a nice thing that has almost no cost and allows greater backward
compatibility with a client's capabilities than if the client is blindly
sent a whole pack without any discrimination.
And this "feature" is simply in the way of other cleanups.
Let's get rid of it.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Get rid of sort_comparator() as it impose a run time double indirect
function call for little compile time type checking gain.
Also get rid of create_sorted_list() as it only has one user which would
as well be just fine doing its sorting locally. Eventually the list of
deltifiable objects might be shorter than the whole object list.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Objects that have delta "children" from pack data reuse must consider the
depth of their deepest child when they try to deltify themselves for those
children not to become too deep.
However, in the context of a "thin" pack, the delta children depth was
skipped entirely on the presumption that the pack was always going to be
exploded on the receiving end, hence the delta length wasn't an issue.
Now that we keep received packs as is and reuse pack data when repacking,
those packs do contain delta chains that are longer than expected. Worse,
those delta chain may even grow longer when the pack is further repacked
into another thin pack for a subsequent transmission.
So this patch restores strict delta length even for thin packs, and it
moves check_delta_limit() usage directly in the delta loop where it is
needed. This way the delta_limit can be removed from struct object_entry
as well. Oh and the initial value was wrong too.
The progress_interval() function was moved to a more logical location in
the process.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Before finding best delta combinations, we sort objects by name hash,
then by size, then by their position in memory. Then we walk the list
backwards to test delta candidates.
We hope that a bigger size usually means a newer objects. But a bigger
address in memory does not mean a newer object. So the last comparison
must be reversed.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Let's avoid some cycles when there is no base to test against, and avoid
unnecessary object lookups.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This function used to call locate_object_entry_hash() _twice_ per added
object while only once should suffice. Let's reorganize that code a bit.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This is a fairly complete list of tests for various aspects of pack
index versions 1 and 2.
Tests on index v2 include 32-bit and 64-bit offsets, as well as a nice
demonstration of the flawed repacking integrity checks that index
version 2 intend to solve over index version 1 with the per object CRC.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | | |
This way tests are completely deterministic and possibly more portable.
Signed-off-by: Nicolas Pitre <nico@cam.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Reliance on /dev/urandom produces test vectors that are, well, random.
This can cause problems impossible to track down when the data is
different from one test invokation to another.
The goal is not to have random data to test, but rather to have a
convenient way to create sets of large files with non compressible and
non deltifiable data in a reproducible way.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This replaces the inflate validation with a CRC validation when reusing
data from a pack which uses index version 2. That makes repacking much
safer against corruptions, and it should be a bit faster too.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This is necessary for testing the new capabilities in some automated
way without having an actual 4GB+ pack.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Initially the conversion was made using nth_packed_object_sha1() which
made this file completely index version agnostic. Unfortunately the
overhead was quite significant so I went back to raw index walking but
with selectable base and step values which brought back similar
performances as the original.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When index v2 is encountered, the CRC32 of each object is also displayed
in parenthesis at the end of the line.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
With this patch, packs larger than 4GB are usable, even on a 32-bit machine
(at least on Linux). If off_t is not large enough to deal with a large
pack then die() is called instead of attempting to use the pack and
producing garbage.
This was tested with a 8GB pack specially created for the occasion on
a 32-bit machine.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Like previous patch but for index-pack.
[ There is quite some code duplication between pack-objects and index-pack
for generating a pack index (and fast-import as well I suppose). This
should be reworked into a common function eventually. But not now. ]
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pack index version 2 goes as follows:
- 8 bytes of header with signature and version.
- 256 entries of 4-byte first-level fan-out table.
- Table of sorted 20-byte SHA1 records for each object in pack.
- Table of 4-byte CRC32 entries for raw pack object data.
- Table of 4-byte offset entries for objects in the pack if offset is
representable with 31 bits or less, otherwise it is an index in the next
table with top bit set.
- Table of 8-byte offset entries indexed from previous table for offsets
which are 32 bits or more (optional).
- 20-byte SHA1 checksum of sorted object names.
- 20-byte SHA1 checksum of the above.
The object SHA1 table is all contiguous so future pack format that would
contain this table directly won't require big changes to the code. It is
also tighter for slightly better cache locality when looking up entries.
Support for large packs exceeding 31 bits in size won't impose an index
size bloat for packs within that range that don't need a 64-bit offset.
And because newer objects which are likely to be the most frequently used
are located at the beginning of the pack, they won't pay the 64-bit offset
lookup at run time either even if the pack is large.
Right now an index version 2 is created only when the biggest offset in a
pack reaches 31 bits. It might be a good idea to always use index version
2 eventually to benefit from the CRC32 it contains when reusing pack data
while repacking.
[jc: with the "oops" fix to keep track of the last offset correctly]
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Same as previous patch but for index-pack.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The most important optimization for performance when repacking is the
ability to reuse data from a previous pack as is and bypass any delta
or even SHA1 computation by simply copying the raw data from one pack
to another directly.
The problem with this is that any data corruption within a copied object
would go unnoticed and the new (repacked) pack would be self-consistent
with its own checksum despite containing a corrupted object. This is a
real issue that already happened at least once in the past.
In some attempt to prevent this, we validate the copied data by inflating
it and making sure no error is signaled by zlib. But this is still not
perfect as a significant portion of a pack content is made of object
headers and references to delta base objects which are not deflated and
therefore not validated when repacking actually making the pack data reuse
still not as safe as it could be.
Of course a full SHA1 validation could be performed, but that implies
full data inflating and delta replaying which is extremely costly, which
cost the data reuse optimization was designed to avoid in the first place.
So the best solution to this is simply to store a CRC32 of the raw pack
data for each object in the pack index. This way any object in a pack can
be validated before being copied as is in another pack, including header
and any other non deflated data.
Why CRC32 instead of a faster checksum like Adler32? Quoting Wikipedia:
Jonathan Stone discovered in 2001 that Adler-32 has a weakness for very
short messages. He wrote "Briefly, the problem is that, for very short
packets, Adler32 is guaranteed to give poor coverage of the available
bits. Don't take my word for it, ask Mark Adler. :-)" The problem is
that sum A does not wrap for short messages. The maximum value of A for
a 128-byte message is 32640, which is below the value 65521 used by the
modulo operation. An extended explanation can be found in RFC 3309,
which mandates the use of CRC32 instead of Adler-32 for SCTP, the
Stream Control Transmission Protocol.
In the context of a GIT pack, we have lots of small objects, especially
deltas, which are likely to be quite small and in a size range for which
Adler32 is dimed not to be sufficient. Another advantage of CRC32 is the
possibility for recovery from certain types of small corruptions like
single bit errors which are the most probable type of corruptions.
OK what this patch does is to compute the CRC32 of each object written to
a pack within pack-objects. It is not written to the index yet and it is
obviously not validated when reusing pack data yet either.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Change a few size and offset variables to more appropriate type, then
add overflow tests on those offsets. This prevents any bad data to be
generated/processed if off_t happens to not be large enough to handle
some big packs.
Better be safe than sorry.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch introduces the MSB() macro to obtain the desired number of
most significant bits from a given variable independently of the variable
type.
It is then used to better implement the overflow test on the OBJ_OFS_DELTA
base offset variable with the property of always working correctly
regardless of the type/size of that variable.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The coming index format change doesn't allow for the number of objects
to be determined from the size of the index file directly. Instead, Let's
initialize a field in the packed_git structure with the object count when
the index is validated since the count is always known at that point.
While at it let's reorder some struct packed_git fields to avoid padding
due to needed 64-bit alignment for some of them.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
|\ \
| | |
| | |
| | |
| | | |
* jp/refs:
refs.c: add a function to sort a ref list, rather then sorting on add
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Rather than sorting the refs list while building it, sort in one
go after it is built using a merge sort. This has a large
performance boost with large numbers of refs.
It shouldn't happen that we read duplicate entries into the same
list, but just in case sort_ref_list drops them if the SHA1s are
the same, or dies, as we have no way of knowing which one is the
correct one.
Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|