summaryrefslogtreecommitdiff
path: root/commit-graph.h
Commit message (Collapse)AuthorAgeFilesLines
* commit-graph: fix memory leak in misused string_list APIÆvar Arnfjörð Bjarmason2022-03-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | When this code was migrated to the string_list API in d88b14b3fd6 (commit-graph: use string-list API for input, 2018-06-27) it was made to use used both STRING_LIST_INIT_NODUP and a strbuf_detach() pattern. Those should not be used together if string_list_clear() is expected to free the memory, instead we need to either use STRING_LIST_INIT_DUP with a string_list_append_nodup(), or a STRING_LIST_INIT_NODUP and manually fiddle with the "strdup_strings" member before calling string_list_clear(). Let's do the former. Since "strdup_strings = 1" is set now other code might be broken by relying on "pack_indexes" not to duplicate it strings, but that doesn't happen. When we pass this down to write_commit_graph() that code uses the "struct string_list" without modifying it. Let's add a "const" to the variable to have the compiler enforce that assumption. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* revision: avoid hitting packfiles when commits are in commit-graphPatrick Steinhardt2021-08-091-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When queueing references in git-rev-list(1), we try to optimize parsing of commits via the commit-graph. To do so, we first look up the object's type, and if it is a commit we call `repo_parse_commit()` instead of `parse_object()`. This is quite inefficient though given that we're always uncompressing the object header in order to determine the type. Instead, we can opportunistically search the commit-graph for the object ID: in case it's found, we know it's a commit and can directly fill in the commit object without having to uncompress the object header. Expose a new function `lookup_commit_in_graph()`, which tries to find a commit in the commit-graph by ID, and convert `get_reference()` to use this function. This provides a big performance win in cases where we load references in a repository with lots of references pointing to commits. The following has been executed in a real-world repository with about 2.2 million refs: Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 4.458 s ± 0.044 s [User: 4.115 s, System: 0.342 s] Range (min … max): 4.409 s … 4.534 s 10 runs Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 3.089 s ± 0.015 s [User: 2.768 s, System: 0.321 s] Range (min … max): 3.061 s … 3.105 s 10 runs Summary 'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran 1.44 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: use config to specify generation typeDerrick Stolee2021-02-251-1/+0
| | | | | | | | | | | | | | | | | | | | | | We have two established generation number versions: 1: topological levels 2: corrected commit dates The corrected commit dates are enabled by default, but they also write extra data in the GDAT and GDOV chunks. Services that host Git data might want to have more control over when this feature rolls out than just updating the Git binaries. Add a new "commitGraph.generationVersion" config option that specifies the intended generation number version. If this value is less than 2, then the GDAT chunk is never written _or read_ from an existing file. This can replace our use of the GIT_TEST_COMMIT_GRAPH_NO_GDAT environment variable in the test suite. Remove it. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-reach: use corrected commit dates in paint_down_to_common()Abhishek Kumar2021-01-181-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 091f4cf (commit: don't use generation numbers if not needed, 2018-08-30) changed paint_down_to_common() to use commit dates instead of generation numbers v1 (topological levels) as the performance regressed on certain topologies. With generation number v2 (corrected commit dates) implemented, we no longer have to rely on commit dates and can use generation numbers. For example, the command `git merge-base v4.8 v4.9` on the Linux repository walks 167468 commits, taking 0.135s for committer date and 167496 commits, taking 0.157s for corrected committer date respectively. While using corrected commit dates, Git walks nearly the same number of commits as commit date, the process is slower as for each comparision we have to access a commit-slab (for corrected committer date) instead of accessing struct member (for committer date). This change incidentally broke the fragile t6404-recursive-merge test. t6404-recursive-merge sets up a unique repository where all commits have the same committer date without a well-defined merge-base. While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer date as a heuristic in paint_down_to_common(). 6404.1 'combined merge conflicts' merges commits in the order: - Merge C with B to form an intermediate commit. - Merge the intermediate commit with A. With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently use the corrected committer date, which changes the order in which commits are merged: - Merge A with B to form an intermediate commit. - Merge the intermediate commit with C. While resulting repositories are equivalent, 6404.4 'virtual trees were processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting different merge-bases and thus have different object ids for the intermediate commits. As this has already causes problems (as noted in 859fdc0 (commit-graph: define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph within t6404-recursive-merge. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: use generation v2 only if entire chain doesAbhishek Kumar2021-01-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since there are released versions of Git that understand generation numbers in the commit-graph's CDAT chunk but do not understand the GDAT chunk, the following scenario is possible: 1. "New" Git writes a commit-graph with the GDAT chunk. 2. "Old" Git writes a split commit-graph on top without a GDAT chunk. If each layer of split commit-graph is treated independently, as it was the case before this commit, with Git inspecting only the current layer for chunk_generation_data pointer, commits in the lower layer (one with GDAT) whould have corrected commit date as their generation number, while commits in the upper layer would have topological levels as their generation. Corrected commit dates usually have much larger values than topological levels. This means that if we take two commits, one from the upper layer, and one reachable from it in the lower layer, then the expectation that the generation of a parent is smaller than the generation of a child would be violated. It is difficult to expose this issue in a test. Since we _start_ with artificially low generation numbers, any commit walk that prioritizes generation numbers will walk all of the commits with high generation number before walking the commits with low generation number. In all the cases I tried, the commit-graph layers themselves "protect" any incorrect behavior since none of the commits in the lower layer can reach the commits in the upper layer. This issue would manifest itself as a performance problem in this case, especially with something like "git log --graph" since the low generation numbers would cause the in-degree queue to walk all of the commits in the lower layer before allowing the topo-order queue to write anything to output (depending on the size of the upper layer). Therefore, When writing the new layer in split commit-graph, we write a GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees that if a layer has GDAT chunk, all lower layers must have a GDAT chunk as well. Rewriting layers follows similar approach: if the topmost layer below the set of layers being rewritten (in the split commit-graph chain) exists, and it does not contain GDAT chunk, then the result of rewrite does not have GDAT chunks either. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: implement generation data chunkAbhishek Kumar2021-01-181-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As discovered by Ævar, we cannot increment graph version to distinguish between generation numbers v1 and v2 [1]. Thus, one of pre-requistes before implementing generation number v2 was to distinguish between graph versions in a backwards compatible manner. We are going to introduce a new chunk called Generation DATa chunk (or GDAT). GDAT will store corrected committer date offsets whereas CDAT will still store topological level. Old Git does not understand GDAT chunk and would ignore it, reading topological levels from CDAT. New Git can parse GDAT and take advantage of newer generation numbers, falling back to topological levels when GDAT chunk is missing (as it would happen with a commit-graph written by old Git). We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT' which forces commit-graph file to be written without generation data chunk to emulate a commit-graph file written by old Git. To minimize the space required to store corrrected commit date, Git stores corrected commit date offsets into the commit-graph file, instea of corrected commit dates. This saves us 4 bytes per commit, decreasing the GDAT chunk size by half, but it's possible for the offset to overflow the 4-bytes allocated for storage. As such overflows are and should be exceedingly rare, we use the following overflow management scheme: We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV') to store corrected commit dates for commits with offsets greater than GENERATION_NUMBER_V2_OFFSET_MAX. If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set the MSB of the offset and the other bits store the position of corrected commit date in GDOV chunk, similar to how Extra Edge List is maintained. We test the overflow-related code with the following repo history: F - N - U / \ U - N - U N \ / N - F - N Where the commits denoted by U have committer date of zero seconds since Unix epoch, the commits denoted by N have committer date of 1112354055 (default committer date for the test suite) seconds since Unix epoch and the commits denoted by F have committer date of (2 ^ 31 - 2) seconds since Unix epoch. The largest offset observed is 2 ^ 31, just large enough to overflow. [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/ Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: return 64-bit generation numberAbhishek Kumar2021-01-181-2/+2
| | | | | | | | | | | | | | | | | | | | In a preparatory step for introducing corrected commit dates, let's return timestamp_t values from commit_graph_generation(), use timestamp_t for local variables and define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead. We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to represent the largest topological level we can store in the commit data chunk. With corrected commit dates implemented, we will have two such *_MAX variables to denote the largest offset and largest topological level that can be stored. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: add a slab to store topological levelsAbhishek Kumar2021-01-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a later commit we will introduce corrected commit date as the generation number v2. Corrected commit dates will be stored in the new seperate Generation Data chunk. However, to ensure backwards compatibility with "Old" Git we need to continue to write generation number v1 (topological levels) to the commit data chunk. Thus, we need to compute and store both versions of generation numbers to write the commit-graph file. Therefore, let's introduce a commit-slab `topo_level_slab` to store topological levels; corrected commit date will be stored in the member `generation` of struct commit_graph_data. The macros `GENERATION_NUMBER_INFINITY` and `GENERATION_NUMBER_ZERO` mark commits not in the commit-graph file and commits written by a version of Git that did not compute generation numbers respectively. Generation numbers are computed identically for both kinds of commits. A "slab-miss" should return `GENERATION_NUMBER_INFINITY` as the commit is not in the commit-graph file. However, since the slab is zero-initialized, it returns 0 (or rather `GENERATION_NUMBER_ZERO`). Thus, we no longer need to check if the topological level of a commit is `GENERATION_NUMBER_INFINITY`. We will add a pointer to the slab in `struct write_commit_graph_context` and `struct commit_graph` to populate the slab in `fill_commit_graph_info` if the commit has a pre-computed topological level as in case of split commit-graphs. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* Merge branch 'tb/bloom-improvements'Junio C Hamano2020-09-291-6/+11
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "git commit-graph write" learned to limit the number of bloom filters that are computed from scratch with the --max-new-filters option. * tb/bloom-improvements: commit-graph: introduce 'commitGraph.maxNewFilters' builtin/commit-graph.c: introduce '--max-new-filters=<n>' commit-graph: rename 'split_commit_graph_opts' bloom: encode out-of-bounds filters as non-empty bloom/diff: properly short-circuit on max_changes bloom: use provided 'struct bloom_filter_settings' bloom: split 'get_bloom_filter()' in two commit-graph.c: store maximum changed paths commit-graph: respect 'commitGraph.readChangedPaths' t/helper/test-read-graph.c: prepare repo settings commit-graph: pass a 'struct repository *' in more places t4216: use an '&&'-chain commit-graph: introduce 'get_bloom_filter_settings()'
| * builtin/commit-graph.c: introduce '--max-new-filters=<n>'Taylor Blau2020-09-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a command-line flag to specify the maximum number of new Bloom filters that a 'git commit-graph write' is willing to compute from scratch. Prior to this patch, a commit-graph write with '--changed-paths' would compute Bloom filters for all selected commits which haven't already been computed (i.e., by a previous commit-graph write with '--split' such that a roll-up or replacement is performed). This behavior can cause prohibitively-long commit-graph writes for a variety of reasons: * There may be lots of filters whose diffs take a long time to generate (for example, they have close to the maximum number of changes, diffing itself takes a long time, etc). * Old-style commit-graphs (which encode filters with too many entries as not having been computed at all) cause us to waste time recomputing filters that appear to have not been computed only to discover that they are too-large. This can make the upper-bound of the time it takes for 'git commit-graph write --changed-paths' to be rather unpredictable. To make this command behave more predictably, introduce '--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters from scratch. This lets "computing" already-known filters proceed quickly, while bounding the number of slow tasks that Git is willing to do. Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
| * commit-graph: rename 'split_commit_graph_opts'Taylor Blau2020-09-171-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In the subsequent commit, additional options will be added to the commit-graph API which have nothing to do with splitting. Rename the 'split_commit_graph_opts' structure to the more-generic 'commit_graph_opts' to encompass both. Likewise, rename the 'flags' member to instead be 'split_flags' to clarify that it only has to do with the behavior implied by '--split'. Suggested-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
| * commit-graph: pass a 'struct repository *' in more placesTaylor Blau2020-09-091-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | In a future commit, some commit-graph internals will want access to 'r->settings', but we only have the 'struct object_directory *' corresponding to that repository. Add an additional parameter to pass the repository around in more places. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
| * commit-graph: introduce 'get_bloom_filter_settings()'Taylor Blau2020-09-091-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many places in the code often need a pointer to the commit-graph's 'struct bloom_filter_settings', in which case they often take the value from the top-most commit-graph. In the non-split case, this works as expected. In the split case, however, things get a little tricky. Not all layers in a chain of incremental commit-graphs are required to themselves have Bloom data, and so whether or not some part of the code uses Bloom filters depends entirely on whether or not the top-most level of the commit-graph chain has Bloom filters. This has been the behavior since Bloom filters were introduced, and has been codified into the tests since a759bfa9ee (t4216: add end to end tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130 requires that Bloom filters are not used in exactly the case described earlier. There is no reason that this needs to be the case, since it is perfectly valid for commits in an earlier layer to have Bloom filters when commits in a newer layer do not. Since Bloom settings are guaranteed in practice to be the same for any layer in a chain that has Bloom data, it is sufficient to traverse the '->base_graph' pointer until either (1) a non-null 'struct bloom_filter_settings *' is found, or (2) until we are at the root of the commit-graph chain. Introduce a 'get_bloom_filter_settings()' function that does just this, and use it instead of purely dereferencing the top-most graph's '->bloom_filter_settings' pointer. While we're at it, add an additional test in t5324 to guard against code in the commit-graph writing machinery that doesn't correctly handle a NULL 'struct bloom_filter *'. Co-authored-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | maintenance: add commit-graph taskDerrick Stolee2020-09-171-0/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | The first new task in the 'git maintenance' builtin is the 'commit-graph' task. This updates the commit-graph file incrementally with the command git commit-graph write --reachable --split By writing an incremental commit-graph file using the "--split" option we minimize the disruption from this operation. The default behavior is to merge layers until the new "top" layer is less than half the size of the layer below. This provides quick writes most of the time, with the longer writes following a power law distribution. Most importantly, concurrent Git processes only look at the commit-graph-chain file for a very short amount of time, so they will verly likely not be holding a handle to the file when we try to replace it. (This only matters on Windows.) If a concurrent process reads the old commit-graph-chain file, but our job expires some of the .graph files before they can be read, then those processes will see a warning message (but not fail). This could be avoided by a future update to use the --expire-time argument when writing the commit-graph. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* Merge branch 'ds/commit-graph-bloom-updates' into masterJunio C Hamano2020-07-301-1/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Updates to the changed-paths bloom filter. * ds/commit-graph-bloom-updates: commit-graph: check all leading directories in changed path Bloom filters revision: empty pathspecs should not use Bloom filters revision.c: fix whitespace commit-graph: check chunk sizes after writing commit-graph: simplify chunk writes into loop commit-graph: unify the signatures of all write_graph_chunk_*() functions commit-graph: persist existence of changed-paths bloom: fix logic in get_bloom_filter() commit-graph: change test to die on parse, not load commit-graph: place bloom_settings in context
| * commit-graph: persist existence of changed-pathsDerrick Stolee2020-07-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The changed-path Bloom filters were released in v2.27.0, but have a significant drawback. A user can opt-in to writing the changed-path filters using the "--changed-paths" option to "git commit-graph write" but the next write will drop the filters unless that option is specified. This becomes even more important when considering the interaction with gc.writeCommitGraph (on by default) or fetch.writeCommitGraph (part of features.experimental). These config options trigger commit-graph writes that the user did not signal, and hence there is no --changed-paths option available. Allow a user that opts-in to the changed-path filters to persist the property of "my commit-graph has changed-path filters" automatically. A user can drop filters using the --no-changed-paths option. In the process, we need to be extremely careful to match the Bloom filter settings as specified by the commit-graph. This will allow future versions of Git to customize these settings, and the version with this change will persist those settings as commit-graphs are rewritten on top. Use the trace2 API to signal the settings used during the write, and check that output in a test after manually adjusting the correct bytes in the commit-graph file. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
| * commit-graph: change test to die on parse, not loadDerrick Stolee2020-06-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 43d3561 (commit-graph write: don't die if the existing graph is corrupt, 2019-03-25) introduced the GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD environment variable. This was created to verify that commit-graph was not loaded when writing a new non-incremental commit-graph. An upcoming change wants to load a commit-graph in some valuable cases, but we want to maintain that we don't trust the commit-graph data when writing our new file. Instead of dying on load, instead die if we ever try to parse a commit from the commit-graph. This functionally verifies the same intended behavior, but allows a more advanced feature in the next change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | Merge branch 'sg/commit-graph-cleanups' into masterJunio C Hamano2020-07-301-3/+3
|\ \ | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The changed-path Bloom filter is improved using ideas from an independent implementation. * sg/commit-graph-cleanups: commit-graph: simplify write_commit_graph_file() #2 commit-graph: simplify write_commit_graph_file() #1 commit-graph: simplify parse_commit_graph() #2 commit-graph: simplify parse_commit_graph() #1 commit-graph: clean up #includes diff.h: drop diff_tree_oid() & friends' return value commit-slab: add a function to deep free entries on the slab commit-graph-format.txt: all multi-byte numbers are in network byte order commit-graph: fix parsing the Chunk Lookup table tree-walk.c: don't match submodule entries for 'submod/anything'
| * commit-graph: clean up #includesSZEDER Gábor2020-06-081-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Our CodingGuidelines says that it's sufficient to include one of 'git-compat-util.h' and 'cache.h', but both 'commit-graph.c' and 'commit-graph.h' include both. Let's include only 'git-compat-util.h' to loose a bunch of unnecessary dependencies; but include 'hash.h', because 'commit-graph.h' does require the definition of 'struct object_id'. 'commit-graph.h' explicitly includes 'repository.h' and 'string-list.h', but only needs the declaration of a few structs from them. Drop these includes and forward-declare the necessary structs instead. 'commit-graph.c' includes 'dir.h', but doesn't actually use anything from there, so let's drop that #include as well. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | commit-graph: introduce commit_graph_data_slabAbhishek Kumar2020-06-171-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The struct commit is used in many contexts. However, members `generation` and `graph_pos` are only used for commit-graph related operations and otherwise waste memory. This wastage would have been more pronounced as we transition to generation number v2, which uses 64-bit generation number instead of current 32-bits. As they are often accessed together, let's introduce struct commit_graph_data and move them to a commit_graph_data slab. While the overall test suite runs just as fast as master, (series: 26m48s, master: 27m34s, faster by 2.87%), certain commands like `git merge-base --is-ancestor` were slowed by 40% as discovered by Szeder Gábor [1]. After minimizing commit-slab access, the slow down persists but is closer to 20%. Derrick Stolee believes the slow down is attributable to the underlying algorithm rather than the slowness of commit-slab access [2] and we will follow-up in a later series. [1]: https://lore.kernel.org/git/20200607195347.GA8232@szeder.dev/ [2]: https://lore.kernel.org/git/13db757a-9412-7f1e-805c-8a028c4ab2b1@gmail.com/ Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | commit-graph: drop COMMIT_GRAPH_WRITE_CHECK_OIDS flagTaylor Blau2020-05-181-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since 7c5c9b9c57 (commit-graph: error out on invalid commit oids in 'write --stdin-commits', 2019-08-05), the commit-graph builtin dies on receiving non-commit OIDs as input to '--stdin-commits'. This behavior can be cumbersome to work around in, say, the case of piping 'git for-each-ref' to 'git commit-graph write --stdin-commits' if the caller does not want to cull out non-commits themselves. In this situation, it would be ideal if 'git commit-graph write' wrote the graph containing the inputs that did pertain to commits, and silently ignored the remainder of the input. Some options have been proposed to the effect of '--[no-]check-oids' which would allow callers to have the commit-graph builtin do just that. After some discussion, it is difficult to imagine a caller who wouldn't want to pass '--no-check-oids', suggesting that we should get rid of the behavior of complaining about non-commit inputs altogether. If callers do wish to retain this behavior, they can easily work around this change by doing the following: git for-each-ref --format='%(objectname) %(objecttype) %(*objecttype)' | awk ' !/commit/ { print "not-a-commit:"$1 } /commit/ { print $1 } ' | git commit-graph write --stdin-commits To make it so that valid OIDs that refer to non-existent objects are indeed an error after loosening the error handling, perform an extra lookup to make sure that object indeed exists before sending it to the commit-graph internals. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | Merge branch 'ds/blame-on-bloom'Junio C Hamano2020-05-011-0/+9
|\ \ | |/ | | | | | | | | | | | | | | | | | | | | | | "git blame" learns to take advantage of the "changed-paths" Bloom filter stored in the commit-graph file. * ds/blame-on-bloom: test-bloom: check that we have expected arguments test-bloom: fix some whitespace issues blame: drop unused parameter from maybe_changed_path blame: use changed-path Bloom filters tests: write commit-graph with Bloom filters revision: complicated pathspecs disable filters
| * tests: write commit-graph with Bloom filtersDerrick Stolee2020-04-161-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The GIT_TEST_COMMIT_GRAPH environment variable updates the commit- graph file whenever "git commit" is run, ensuring that we always have an updated commit-graph throughout the test suite. The GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS environment variable was introduced to write the changed-path Bloom filters whenever "git commit-graph write" is run. However, the GIT_TEST_COMMIT_GRAPH trick doesn't launch a separate process and instead writes it directly. To expand the number of tests that have commits in the commit-graph file, add a helper method that computes the commit-graph and place that helper inside "git commit" and "git merge". In the helper method, check GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS to ensure we are writing changed-path Bloom filters whenever possible. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | Merge branch 'gs/commit-graph-path-filter'Junio C Hamano2020-05-011-1/+8
|\ \ | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce an extension to the commit-graph to make it efficient to check for the paths that were modified at each commit using Bloom filters. * gs/commit-graph-path-filter: bloom: ignore renames when computing changed paths commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag t4216: add end to end tests for git log with Bloom filters revision.c: add trace2 stats around Bloom filter usage revision.c: use Bloom filters to speed up path based revision walks commit-graph: add --changed-paths option to write subcommand commit-graph: reuse existing Bloom filters during write commit-graph: write Bloom filters to commit graph file commit-graph: examine commits by generation number commit-graph: examine changed-path objects in pack order commit-graph: compute Bloom filters for changed paths diff: halt tree-diff early after max_changes bloom.c: core Bloom filter implementation for changed paths. bloom.c: introduce core Bloom filter constructs bloom.c: add the murmur3 hash implementation commit-graph: define and use MAX_NUM_CHUNKS
| * commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flagGarima Singh2020-04-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag to the test setup suite in order to toggle writing Bloom filters when running any of the git tests. If set to true, we will compute and write Bloom filters every time a test calls `git commit-graph write`, as if the `--changed-paths` option was passed in. The test suite passes when GIT_TEST_COMMIT_GRAPH and GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS are enabled. Helped-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Garima Singh <garima.singh@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
| * commit-graph: write Bloom filters to commit graph fileGarima Singh2020-04-061-0/+5
| | | | | | | | | | | | | | | | | | | | | | Update the technical documentation for commit-graph-format with the formats for the Bloom filter index (BIDX) and Bloom filter data (BDAT) chunks. Write the computed Bloom filters information to the commit graph file using this format. Helped-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Garima Singh <garima.singh@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
| * commit-graph: compute Bloom filters for changed pathsGarima Singh2020-03-301-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add new COMMIT_GRAPH_WRITE_CHANGED_PATHS flag that makes Git compute Bloom filters for the paths that changed between a commit and it's first parent, for each commit in the commit-graph. This computation is done on a commit-by-commit basis. We will write these Bloom filters to the commit-graph file, to store this data on disk, in the next change in this series. Helped-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Garima Singh <garima.singh@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | commit-graph: close descriptors after mmapJeff King2020-04-241-4/+1
| | | | | | | | | | | | | | | | | | | | We don't ever refer to the descriptor after mmap-ing it. And keeping it open means we can run out of descriptors in degenerate cases (e.g., thousands of split chain files). Let's close it as soon as possible. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | commit-graph.h: replace 'commit_hex' with 'commits'Taylor Blau2020-04-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'write_commit_graph()' function takes in either a string list of pack indices, or a string list of hexadecimal commit OIDs. These correspond to the '--stdin-packs' and '--stdin-commits' mode(s) from 'git commit-graph write'. Using a string_list of hexadecimal commit IDs is not the most efficient use of memory, since we can instead use the 'struct oidset', which is more well-suited for this case. This has another benefit which will become apparent in the following commit. This is that we are about to disambiguate the kinds of errors we produce with '--stdin-commits' into "non-hex input" and "hex-input, but referring to a non-commit object". By having 'write_commit_graph' take in a 'struct oidset *' of commits, we place the burden on the caller (in this case, the builtin) to handle the first case, and the commit-graph machinery can handle the second case. Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | builtin/commit-graph.c: introduce split strategy 'replace'Taylor Blau2020-04-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using split commit-graphs, it is sometimes useful to completely replace the commit-graph chain with a new base. For example, consider a scenario in which a repository builds a new commit-graph incremental for each push. Occasionally (say, after some fixed number of pushes), they may wish to rebuild the commit-graph chain with all reachable commits. They can do so with $ git commit-graph write --reachable but this removes the chain entirely and replaces it with a single commit-graph in 'objects/info/commit-graph'. Unfortunately, this means that the next push will have to move this commit-graph into the first layer of a new chain, and then write its new commits on top. Avoid such copying entirely by allowing the caller to specify that they wish to replace the entirety of their commit-graph chain, while also specifying that the new commit-graph should become the basis of a fresh, length-one chain. This addresses the above situation by making it possible for the caller to instead write: $ git commit-graph write --reachable --split=replace which writes a new length-one chain to 'objects/info/commit-graphs', making the commit-graph incremental generated by the subsequent push relatively cheap by avoiding the aforementioned copy. In order to do this, remove an assumption in 'write_commit_graph_file' that chains are always at least two incrementals long. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | builtin/commit-graph.c: introduce split strategy 'no-merge'Taylor Blau2020-04-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the previous commit, we laid the groundwork for supporting different splitting strategies. In this commit, we introduce the first splitting strategy: 'no-merge'. Passing '--split=no-merge' is useful for callers which wish to write a new incremental commit-graph, but do not want to spend effort condensing the incremental chain [1]. Previously, this was possible by passing '--size-multiple=0', but this no longer the case following 63020f175f (commit-graph: prefer default size_mult when given zero, 2020-01-02). When '--split=no-merge' is given, the commit-graph machinery will never condense an existing chain, and it will always write a new incremental. [1]: This might occur when, for example, a server administrator running some program after each push may want to ensure that each job runs proportional in time to the size of the push, and does not "jump" when the commit-graph machinery decides to trigger a merge. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | builtin/commit-graph.c: support for '--split[=<strategy>]'Taylor Blau2020-04-151-0/+5
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | With '--split', the commit-graph machinery writes new commits in another incremental commit-graph which is part of the existing chain, and optionally decides to condense the chain into a single commit-graph. This is done to ensure that the asymptotic behavior of looking up a commit in an incremental chain is not dominated by the number of incrementals in that chain. It can be controlled by the '--max-commits' and '--size-multiple' options. In the next two commits, we will introduce additional splitting strategies that can exert additional control over: - when a split commit-graph is and isn't written, and - when the existing commit-graph chain is discarded completely and replaced with another graph To prepare for this, make '--split' take an optional strategy (as in '--split[=<strategy>]'), and add a new enum to describe which strategy is being used. For now, no strategies are given, and the only enumerated value is 'COMMIT_GRAPH_SPLIT_UNSPECIFIED', indicating the absence of a strategy. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph.h: use odb in 'load_commit_graph_one_fd_st'Taylor Blau2020-02-041-1/+2
| | | | | | | | | | | | | | Apply a similar treatment as in the previous patch to pass a 'struct object_directory *' through the 'load_commit_graph_one_fd_st' initializer, too. This prevents a potential bug where a pointer comparison is made to a NULL 'g->odb', which would cause the commit-graph machinery to think that a pair of commit-graphs belonged to different alternates when in fact they do not (i.e., in the case of no '--object-dir'). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph.c: remove path normalization, comparisonTaylor Blau2020-02-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As of the previous patch, all calls to 'commit-graph.c' functions which perform path normalization (for e.g., 'get_commit_graph_filename()') are of the form 'ctx->odb->path', which is always in normalized form. Now that there are no callers passing non-normalized paths to these functions, ensure that future callers are bound by the same restrictions by making these functions take a 'struct object_directory *' instead of a 'const char *'. To match, replace all calls with arguments of the form 'ctx->odb->path' with 'ctx->odb' To recover the path, functions that perform path manipulation simply use 'odb->path'. Further, avoid string comparisons with arguments of the form 'odb->path', and instead prefer raw pointer comparisons, which accomplish the same effect, but are far less brittle. This has a pleasant side-effect of making these functions much more robust to paths that cannot be normalized by 'normalize_path_copy()', i.e., because they are outside of the current working directory. For example, prior to this patch, Valgrind reports that the following uninitialized memory read [1]: $ ( cd t && GIT_DIR=../.git valgrind git rev-parse HEAD^ ) because 'normalize_path_copy()' can't normalize '../.git' (since it's relative to but above of the current working directory) [2]. By using a 'struct object_directory *' directly, 'get_commit_graph_filename()' does not need to normalize, because all paths are relative to the current working directory since they are always read from the '->path' of an object directory. [1]: https://lore.kernel.org/git/20191027042116.GA5801@sigill.intra.peff.net. [2]: The bug here is that 'get_commit_graph_filename()' returns the result of 'normalize_path_copy()' without checking the return value. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph.h: store object directory in 'struct commit_graph'Taylor Blau2020-02-041-2/+3
| | | | | | | | | | | | | | | | | | | In a previous patch, the 'char *object_dir' in 'struct commit_graph' was replaced with a 'struct object_directory'. This patch applies the same treatment to 'struct commit_graph', which is another intermediate step towards getting rid of all path normalization in 'commit-graph.c'. Instead of taking a 'char *object_dir', functions that construct a 'struct commit_graph' now take a 'struct object_directory *'. Any code that needs an object directory path use '->path' instead. This ensures that all calls to functions that perform path normalization are given arguments which do not themselves require normalization. This prepares those functions to drop their normalization entirely, which will occur in the subsequent patch. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph.h: store an odb in 'struct write_commit_graph_context'Taylor Blau2020-02-041-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are lots of places in 'commit-graph.h' where a function either has (or almost has) a full 'struct object_directory *', accesses '->path', and then throws away the rest of the struct. This can cause headaches when comparing the locations of object directories across alternates (e.g., in the case of deciding if two commit-graph layers can be merged). These paths are normalized with 'normalize_path_copy()' which mitigates some comparison issues, but not all [1]. Replace usage of 'char *object_dir' with 'odb->path' by storing a 'struct object_directory *' in the 'write_commit_graph_context' structure. This is an intermediate step towards getting rid of all path normalization in 'commit-graph.c'. Resolving a user-provided '--object-dir' argument now requires that we compare it to the known alternates for equality. Prior to this patch, an unknown '--object-dir' argument would silently exit with status zero. This can clearly lead to unintended behavior, such as verifying commit-graphs that aren't in a repository's own object store (or one of its alternates), or causing a typo to mask a legitimate commit-graph verification failure. Make this error non-silent by 'die()'-ing when the given '--object-dir' does not match any known alternate object store. [1]: In my testing, for example, I can get one side of the commit-graph code to fill object_dir with "./objects" and the other with just "objects". Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* upload-pack: disable commit graph more gently for shallow traversalJeff King2019-09-121-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the client has asked for certain shallow options like "deepen-since", we do a custom rev-list walk that pretends to be shallow. Before doing so, we have to disable the commit-graph, since it is not compatible with the shallow view of the repository. That's handled by 829a321569 (commit-graph: close_commit_graph before shallow walk, 2018-08-20). That commit literally closes and frees our repo->objects->commit_graph struct. That creates an interesting problem for commits that have _already_ been parsed using the commit graph. Their commit->object.parsed flag is set, their commit->graph_pos is set, but their commit->maybe_tree may still be NULL. When somebody later calls repo_get_commit_tree(), we see that we haven't loaded the tree oid yet and try to get it from the commit graph. But since it has been freed, we segfault! So the root of the issue is a data dependency between the commit's lazy-load of the tree oid and the fact that the commit graph can go away mid-process. How can we resolve it? There are a couple of general approaches: 1. The obvious answer is to avoid loading the tree from the graph when we see that it's NULL. But then what do we return for the tree oid? If we return NULL, our caller in do_traverse() will rightly complain that we have no tree. We'd have to fallback to loading the actual commit object and re-parsing it. That requires teaching parse_commit_buffer() to understand re-parsing (i.e., not starting from a clean slate and not leaking any allocated bits like parent list pointers). 2. When we close the commit graph, walk through the set of in-memory objects and clear any graph_pos pointers. But this means we also have to "unparse" any such commits so that we know they still need to open the commit object to fill in their trees. So it's no less complicated than (1), and is more expensive (since we clear objects we might not later need). 3. Stop freeing the commit-graph struct. Continue to let it be used for lazy-loads of tree oids, but let upload-pack specify that it shouldn't be used for further commit parsing. 4. Push the whole shallow rev-list out to its own sub-process, with the commit-graph disabled from the start, giving it a clean memory space to work from. I've chosen (3) here. Options (1) and (2) would work, but are non-trivial to implement. Option (4) is more expensive, and I'm not sure how complicated it is (shelling out for the actual rev-list part is easy, but we do then parse the resulting commits internally, and I'm not clear which parts need to be handling shallow-ness). The new test in t5500 triggers this segfault, but see the comments there for how horribly intimate it has to be with how both upload-pack and commit graphs work. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: error out on invalid commit oids in 'write --stdin-commits'SZEDER Gábor2019-08-051-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While 'git commit-graph write --stdin-commits' expects commit object ids as input, it accepts and silently skips over any invalid commit object ids, and still exits with success: # nonsense $ echo not-a-commit-oid | git commit-graph write --stdin-commits $ echo $? 0 # sometimes I forgot that refs are not good... $ echo HEAD | git commit-graph write --stdin-commits $ echo $? 0 # valid tree OID, but not a commit OID $ git rev-parse HEAD^{tree} | git commit-graph write --stdin-commits $ echo $? 0 $ ls -l .git/objects/info/commit-graph ls: cannot access '.git/objects/info/commit-graph': No such file or directory Check that all input records are indeed valid commit object ids and return with error otherwise, the same way '--stdin-packs' handles invalid input; see e103f7276f (commit-graph: return with errors during write, 2019-06-12). Note that it should only return with error when encountering an invalid commit object id coming from standard input. However, '--reachable' uses the same code path to process object ids pointed to by all refs, and that includes tag object ids as well, which should still be skipped over. Therefore add a new flag to 'enum commit_graph_write_flags' and a corresponding field to 'struct write_commit_graph_context', so we can differentiate between those two cases. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Acked-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: turn a group of write-related macro flags into an enumSZEDER Gábor2019-08-051-5/+8
| | | | | | Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Acked-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: verify chains with --shallow modeDerrick Stolee2019-06-191-2/+4
| | | | | | | | | | | | | | | If we wrote a commit-graph chain, we only modified the tip file in the chain. It is valuable to verify what we wrote, but not waste time checking files we did not write. Add a '--shallow' option to the 'git commit-graph verify' subcommand and check that it does not read the base graph in a two-file chain. Making the verify subcommand read from a chain of commit-graphs takes some rearranging of the builtin code. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: create options for split filesDerrick Stolee2019-06-191-2/+10
| | | | | | | | | | | | The split commit-graph feature is now fully implemented, but needs some more run-time configurability. Allow direct callers to 'git commit-graph write --split' to specify the values used in the merge strategy and the expire time. Update the documentation to specify these values. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: allow cross-alternate chainsDerrick Stolee2019-06-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | In an environment like a fork network, it is helpful to have a commit-graph chain that spans both the base repo and the fork repo. The fork is usually a small set of data on top of the large repo, but sometimes the fork is much larger. For example, git-for-windows/git has almost double the number of commits as git/git because it rebases its commits on every major version update. To allow cross-alternate commit-graph chains, we need a few pieces: 1. When looking for a graph-{hash}.graph file, check all alternates. 2. When merging commit-graph chains, do not merge across alternates. 3. When writing a new commit-graph chain based on a commit-graph file in another object directory, do not allow success if the base file has of the name "commit-graph" instead of "commit-graphs/graph-{hash}.graph". Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: write commit-graph chainsDerrick Stolee2019-06-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extend write_commit_graph() to write a commit-graph chain when given the COMMIT_GRAPH_SPLIT flag. This implementation is purposefully simplistic in how it creates a new chain. The commits not already in the chain are added to a new tip commit-graph file. Much of the logic around writing a graph-{hash}.graph file and updating the commit-graph-chain file is the same as the commit-graph file case. However, there are several places where we need to do some extra logic in the split case. Track the list of graph filenames before and after the planned write. This will be more important when we start merging graph files, but it also allows us to upgrade our commit-graph file to the appropriate graph-{hash}.graph file when we upgrade to a chain of commit-graphs. Note that we use the eighth byte of the commit-graph header to store the number of base graph files. This determines the length of the base graphs chunk. A subtle change of behavior with the new logic is that we do not write a commit-graph if we our commit list is empty. This extends to the typical case, which is reflected in t5318-commit-graph.sh. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: add base graphs chunkDerrick Stolee2019-06-191-0/+1
| | | | | | | | | | | | | | | To quickly verify a commit-graph chain is valid on load, we will read from the new "Base Graphs Chunk" of each file in the chain. This will prevent accidentally loading incorrect data from manually editing the commit-graph-chain file or renaming graph-{hash}.graph files. The commit_graph struct already had an object_id struct "oid", but it was never initialized or used. Add a line to read the hash from the end of the commit-graph file and into the oid member. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: prepare for commit-graph chainsDerrick Stolee2019-06-191-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | To prepare for a chain of commit-graph files, augment the commit_graph struct to point to a base commit_graph. As we load commits from the graph, we may actually want to read from a base file according to the graph position. The "graph position" of a commit is given by concatenating the lexicographic commit orders from each of the commit-graph files in the chain. This means that we must distinguish two values: * lexicographic index : the position within the lexicographic order in a single commit-graph file. * graph position: the position within the concatenated order of multiple commit-graph files Given the lexicographic index of a commit in a graph, we can compute the graph position by adding the number of commits in the lower-level graphs. To find the lexicographic index of a commit, we subtract the number of commits in lower-level graphs. While here, change insert_parent_or_die() to take a uint32_t position, as that is the type used by its only caller and that makes more sense with the limits in the commit-graph format. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: use raw_object_store when closingDerrick Stolee2019-06-121-1/+1
| | | | | | | | | The close_commit_graph() method took a repository struct, but then only uses the raw_object_store within. Change the function prototype to make the method more flexible. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: collapse parameters into flagsDerrick Stolee2019-06-121-3/+5
| | | | | | | | | | | | | The write_commit_graph() and write_commit_graph_reachable() methods currently take two boolean parameters: 'append' and 'report_progress'. As we update these methods, adding more parameters this way becomes cluttered and hard to maintain. Collapse these parameters into a 'flags' parameter, and adjust the callers to provide flags as necessary. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: return with errors during writeDerrick Stolee2019-06-121-5/+11
| | | | | | | | | | | | | | | | | | | | The write_commit_graph() method uses die() to report failure and exit when confronted with an unexpected condition. This use of die() in a library function is incorrect and is now replaced by error() statements and an int return type. Return zero on success and a negative value on failure. Now that we use 'goto cleanup' to jump to the terminal condition on an error, we have new paths that could lead to uninitialized values. New initializers are added to correct for this. The builtins 'commit-graph', 'gc', and 'commit' call these methods, so update them to check the return value. Test that 'git commit-graph write' returns a proper error code when hitting a failure condition in write_commit_graph(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph write: don't die if the existing graph is corruptÆvar Arnfjörð Bjarmason2019-04-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the commit-graph is written we end up calling parse_commit(). This will in turn invoke code that'll consult the existing commit-graph about the commit, if the graph is corrupted we die. We thus get into a state where a failing "commit-graph verify" can't be followed-up with a "commit-graph write" if core.commitGraph=true is set, the graph either needs to be manually removed to proceed, or core.commitGraph needs to be set to "false". Change the "commit-graph write" codepath to use a new parse_commit_no_graph() helper instead of parse_commit() to avoid this. The latter will call repo_parse_commit_internal() with use_commit_graph=1 as seen in 177722b344 ("commit: integrate commit graph with commit parsing", 2018-04-10). Not using the old graph at all slows down the writing of the new graph by some small amount, but is a sensible way to prevent an error in the existing commit-graph from spreading. Just fixing the current issue would be likely to result in code that's inadvertently broken in the future. New code might use the commit-graph at a distance. To detect such cases introduce a "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD" setting used when we do our corruption tests, and test that a "write/verify" combo works after every one of our current test cases where we now detect commit-graph corruption. Some of the code changes here might be strictly unnecessary, e.g. I was unable to find cases where the parse_commit() called from write_graph_chunk_data() didn't exit early due to "item->object.parsed" being true in repo_parse_commit_internal() (before the use_commit_graph=1 has any effect). But let's also convert those cases for good measure, we do not have exhaustive tests for all possible types of commit-graph corruption. This might need to be re-visited if we learn to write the commit-graph incrementally, but probably not. Hopefully we'll just start by finding out what commits we have in total, then read the old graph(s) to see what they cover, and finally write a new graph file with everything that's missing. In that case the new graph writing code just needs to continue to use e.g. a parse_commit() that doesn't consult the existing commit-graphs. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: don't pass filename to load_commit_graph_one_fd_st()Ævar Arnfjörð Bjarmason2019-04-011-2/+1
| | | | | | | | | | | | | | | | | | | An earlier change implemented load_commit_graph_one_fd_st() in a way that was bug-compatible with earlier code in terms of the "graph file %s is too small" error message printing out the path to the commit-graph (".git/objects/info/commit-graph"). But change that, because: * A function that takes an already-open file descriptor also needing the filename isn't very intuitive. * The vast majority of errors we might emit when loading the graph come from parse_commit_graph(), which doesn't report the filename. Let's not do that either in this case for consistency. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>