summaryrefslogtreecommitdiff
path: root/src/backend/access/heap
Commit message (Collapse)AuthorAgeFilesLines
* Fix typos in commentsMichael Paquier2023-05-024-6/+5
| | | | | | | | | The changes done in this commit impact comments with no direct user-visible changes, with fixes for incorrect function, variable or structure names. Author: Alexander Lakhin Discussion: https://postgr.es/m/e8c38840-596a-83d6-bd8d-cebc51111572@gmail.com
* Fix assertion failure in heap_vacuum_relDaniel Gustafsson2023-04-281-1/+5
| | | | | | | | | | | | | Commit 7d71d3dd08 changed resetting the VacuumFailsafeActive flag to an assertion since the flag is reset before starting vacuuming a relation. This however failed to take recursive calls of vacuum_rel() and vacuum of TOAST tables into consideration. Fix by reverting back to resettting the flag. Author: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reported-by: John Naylor <john.naylor@enterprisedb.com> Discussion: https://postgr.es/m/CAFBsxsFz=GqaG5Ens5aNgVYoV2Y+pfMUijX0ku+CCkWfALwiqg@mail.gmail.com
* Fix xl_heap_lock WAL record field's data type.Peter Geoghegan2023-04-111-3/+3
| | | | | | | | | | | | | | Make xl_heap_lock's infobits_set field of type uint8, not int8. Using int8 isn't appropriate given that the field just holds status bits. This fixes an oversight in commit 0ac5ad5134. In passing rename the nearby TransactionId field to "xmax" to make things consistency with related records, such as xl_heap_lock_updated. Deliberately avoid a bump in XLOG_PAGE_MAGIC. No backpatch, either. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzkCd3kOS8b7Rfxw7Mh1_6jvX=Nzo-CWR1VBTiOtVZkWHA@mail.gmail.com
* Handle logical slot conflicts on standbyAndres Freund2023-04-081-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During WAL replay on the standby, when a conflict with a logical slot is identified, invalidate such slots. There are two sources of conflicts: 1) Using the information added in 6af1793954e, logical slots are invalidated if required rows are removed 2) wal_level on the primary server is reduced to below logical Uses the infrastructure introduced in the prior commit. FIXME: add commit reference. Change InvalidatePossiblyObsoleteSlot() to use a recovery conflict to interrupt use of a slot, if called in the startup process. The new recovery conflict is added to pg_stat_database_conflicts, as confl_active_logicalslot. See 6af1793954e for an overall design of logical decoding on a standby. Bumps catversion for the addition of the pg_stat_database_conflicts column. Bumps PGSTAT_FILE_FORMAT_ID for the same reason. Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version) Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
* Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.Thomas Munro2023-04-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to have the option to use O_DIRECT/FILE_FLAG_NO_BUFFERING in a later commit, we need the addresses of user space buffers to be well aligned. The exact requirements vary by OS and file system (typically sectors and/or memory pages). The address alignment size is set to 4096, which is enough for currently known systems: it matches modern sectors and common memory page size. There is no standard governing O_DIRECT's requirements so we might eventually have to reconsider this with more information from the field or future systems. Aligning I/O buffers on memory pages is also known to improve regular buffered I/O performance. Three classes of I/O buffers for regular data pages are adjusted: (1) Heap buffers are now allocated with the new palloc_aligned() or MemoryContextAllocAligned() functions introduced by commit 439f6175. (2) Stack buffers now use a new struct PGIOAlignedBlock to respect PG_IO_ALIGN_SIZE, if possible with this compiler. (3) The buffer pool is also aligned in shared memory. WAL buffers were already aligned on XLOG_BLCKSZ. It's possible for XLOG_BLCKSZ to be configured smaller than PG_IO_ALIGNED_SIZE and thus for O_DIRECT WAL writes to fail to be well aligned, but that's a pre-existing condition and will be addressed by a later commit. BufFiles are not yet addressed (there's no current plan to use O_DIRECT for those, but they could potentially get some incidental speedup even in plain buffered I/O operations through better alignment). If we can't align stack objects suitably using the compiler extensions we know about, we disable the use of O_DIRECT by setting PG_O_DIRECT to 0. This avoids the need to consider systems that have O_DIRECT but can't align stack objects the way we want; such systems could in theory be supported with more work but we don't currently know of any such machines, so it's easier to pretend there is no O_DIRECT support instead. That's an existing and tested class of system. Add assertions that all buffers passed into smgrread(), smgrwrite() and smgrextend() are correctly aligned, unless PG_O_DIRECT is 0 (= stack alignment tricks may be unavailable) or the block size has been set too small to allow arrays of buffers to be all aligned. Author: Thomas Munro <thomas.munro@gmail.com> Author: Andres Freund <andres@anarazel.de> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
* hio: Use ExtendBufferedRelBy() to extend tables more efficientlyAndres Freund2023-04-062-144/+222
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit 4d330a61bb1). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
* heapam: Pass number of required pages to RelationGetBufferForTuple()Andres Freund2023-04-062-5/+61
| | | | | | | | | | A future commit will use this information to determine how aggressively to extend the relation by. In heap_multi_insert() we know accurately how many pages we need once we need to extend the relation, providing an accurate lower bound for how much to extend. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
* Refresh cost-based delay params more frequently in autovacuumDaniel Gustafsson2023-04-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow autovacuum to reload the config file more often so that cost-based delay parameters can take effect while VACUUMing a relation. Previously, autovacuum workers only reloaded the config file once per relation vacuumed, so config changes could not take effect until beginning to vacuum the next table. Now, check if a reload is pending roughly once per block, when checking if we need to delay. In order for autovacuum workers to safely update their own cost delay and cost limit parameters without impacting performance, we had to rethink when and how these values were accessed. Previously, an autovacuum worker's wi_cost_limit was set only at the beginning of vacuuming a table, after reloading the config file. Therefore, at the time that autovac_balance_cost() was called, workers vacuuming tables with no cost-related storage parameters could still have different values for their wi_cost_limit_base and wi_cost_delay. Now that the cost parameters can be updated while vacuuming a table, workers will (within some margin of error) have no reason to have different values for cost limit and cost delay (in the absence of cost-related storage parameters). This removes the rationale for keeping cost limit and cost delay in shared memory. Balancing the cost limit requires only the number of active autovacuum workers vacuuming a table with no cost-based storage parameters. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
* Make vacuum failsafe_active globally visibleDaniel Gustafsson2023-04-071-9/+7
| | | | | | | | | | | | | | | | | | | While vacuuming a table in failsafe mode, VacuumCostActive should not be re-enabled. This currently isn't a problem because vacuum cost parameters are only refreshed in between vacuuming tables and failsafe status is reset for every table. In preparation for allowing vacuum cost parameters to be updated more frequently, elevate LVRelState->failsafe_active to a global, VacuumFailsafeActive, which will be checked when determining whether or not to re-enable vacuum cost-related delays. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
* hio: Don't pin the VM while holding buffer lock while extendingAndres Freund2023-04-061-41/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a visibilitymap_pin() while holding an exclusive buffer content lock on a newly extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing IO while holding a content lock. And until 14f98e0af99, that happened while holding the relation extension lock. Practically, this isn't a huge issue, because COPY FREEZE is restricted to relations created or truncated in the current session, so it's unlikely there's a lot of contention. We can't avoid doing IO while holding the content lock by pinning the VM earlier, because we don't know which page it will be on. While we could just ignore the issue in this case, a future commit will add bulk relation extension, which needs to enter pages into the FSM while also trying to hold onto a buffer lock. To address this issue, use visibilitymap_pin_ok() to see if the relevant buffer is already pinned. If not, release the buffer, pin the VM buffer, and acquire the lock again. This opens up a small window for other backends to insert data onto the page - as the page is not entered into the freespacemap, other backends won't see it normally, but a concurrent vacuum could enter the page, if it started just after the relation is extended. In case the page is used by another backend, retry. This is very similar to how locking "otherBuffer" is already dealt with. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
* hio: Relax rules for calling GetVisibilityMapPins()Andres Freund2023-04-061-11/+26
| | | | | | | | GetVisibilityMapPins() insisted on the buffer1/buffer2 being in a specific order. This required checks at the callsite. As a subsequent patch will add another callsite, move related logic into GetVisibilityMapPins(). Discussion: https://postgr.es/m/20230403190030.fk2frxv6faklrseb@awork3.anarazel.de
* Use ExtendBufferedRelTo() in {vm,fsm}_extend()Andres Freund2023-04-051-55/+25
| | | | | | | | | | | This uses ExtendBufferedRelTo(), introduced in 31966b151e6, to extend the visibilitymap and freespacemap to the size needed. It also happens to fix a warning introduced in 3d6a98457d8, reported by Tom Lane. Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de Discussion: https://postgr.es/m/2194723.1680736788@sss.pgh.pa.us
* Don't initialize page in {vm,fsm}_extend(), not neededAndres Freund2023-04-051-5/+1
| | | | | | | | | | | | The read path needs to be able to initialize pages anyway, as relation extensions are not durable. By avoiding initializing pages, we can, in a future patch, extend the relation by multiple blocks at once. Using smgrextend() for {vm,fsm}_extend() is not a good idea in general, as at least one page of the VM/FSM will be read immediately after, always causing a cache miss, requiring us to read content we just wrote. Discussion: https://postgr.es/m/20230301223515.pucbj7nb54n4i4nv@awork3.anarazel.de
* Move heaprel struct field next to index rel field.Peter Geoghegan2023-04-031-2/+2
| | | | | | | | | Commit 61b313e4 added a heaprel struct member to IndexVacuumInfo, but placed it last. Move the heaprel struct member next to the index struct member to improve the code's readability. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WznG=TV6S9d3VA=y0vBHbXwnLs9_LLdiML=aNJuHeriwxg@mail.gmail.com
* Revert 11470f544eAlexander Korotkov2023-04-031-100/+9
| | | | Discussion: https://postgr.es/m/20230323003003.plgaxjqahjgkuxrk%40awork3.anarazel.de
* Disable vacuum's use of a buffer access strategy during failsafeDavid Rowley2023-04-031-0/+7
| | | | | | | | | | | | | | | | | | | | | | Traditionally, vacuum always makes use of a buffer access strategy 32 buffers in size. This means that running vacuums tend not to cause too many shared buffers to become dirty, however, this can cause vacuums to run much more slowly than they otherwise could as WAL flushes will occur more frequently due to having to flush WAL out to the LSN of the dirty page before that page can be written to disk. When we are performing failsafe VACUUMs (as added in 1e55e7d17), we really want to make the vacuum work go as quickly as possible, so here we disable the buffer access strategy when entering failsafe mode while vacuuming a relation. Per idea and analyis from Andres Freund. In passing, also include some changes I had intended for 32fbe0239. Author: Melanie Plageman Reviewed-by: Justin Pryzby, David Rowley Discussion: https://postgr.es/m/20230111182720.ejifsclfwymw2reb%40awork3.anarazel.de
* Add info in WAL records in preparation for logical slot conflict handlingAndres Freund2023-04-022-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit only implements one prerequisite part for allowing logical decoding. The commit message contains an explanation of the overall design, which later commits will refer back to. Overall design: 1. We want to enable logical decoding on standbys, but replay of WAL from the primary might remove data that is needed by logical decoding, causing error(s) on the standby. To prevent those errors, a new replication conflict scenario needs to be addressed (as much as hot standby does). 2. Our chosen strategy for dealing with this type of replication slot is to invalidate logical slots for which needed data has been removed. 3. To do this we need the latestRemovedXid for each change, just as we do for physical replication conflicts, but we also need to know whether any particular change was to data that logical replication might access. That way, during WAL replay, we know when there is a risk of conflict and, if so, if there is a conflict. 4. We can't rely on the standby's relcache entries for this purpose in any way, because the startup process can't access catalog contents. 5. Therefore every WAL record that potentially removes data from the index or heap must carry a flag indicating whether or not it is one that might be accessed during logical decoding. Why do we need this for logical decoding on standby? First, let's forget about logical decoding on standby and recall that on a primary database, any catalog rows that may be needed by a logical decoding replication slot are not removed. This is done thanks to the catalog_xmin associated with the logical replication slot. But, with logical decoding on standby, in the following cases: - hot_standby_feedback is off - hot_standby_feedback is on but there is no a physical slot between the primary and the standby. Then, hot_standby_feedback will work, but only while the connection is alive (for example a node restart would break it) Then, the primary may delete system catalog rows that could be needed by the logical decoding on the standby (as it does not know about the catalog_xmin on the standby). So, it’s mandatory to identify those rows and invalidate the slots that may need them if any. Identifying those rows is the purpose of this commit. Implementation: When a WAL replay on standby indicates that a catalog table tuple is to be deleted by an xid that is greater than a logical slot's catalog_xmin, then that means the slot's catalog_xmin conflicts with the xid, and we need to handle the conflict. While subsequent commits will do the actual conflict handling, this commit adds a new field isCatalogRel in such WAL records (and a new bit set in the xl_heap_visible flags field), that is true for catalog tables, so as to arrange for conflict handling. The affected WAL records are the ones that already contain the snapshotConflictHorizon field, namely: - gistxlogDelete - gistxlogPageReuse - xl_hash_vacuum_one_page - xl_heap_prune - xl_heap_freeze_page - xl_heap_visible - xl_btree_reuse_page - xl_btree_delete - spgxlogVacuumRedirect Due to this new field being added, xl_hash_vacuum_one_page and gistxlogDelete do now contain the offsets to be deleted as a FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignment. It's not needed on the others struct where isCatalogRel has been added. This commit just introduces the WAL format changes mentioned above. Handling the actual conflicts will follow in future commits. Bumps XLOG_PAGE_MAGIC as the several WAL records are changed. Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> (in an older version) Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version) Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
* Pass down table relation into more index relation functionsAndres Freund2023-04-014-5/+11
| | | | | | | | | | | | This is done in preparation for logical decoding on standby, which needs to include whether visibility affecting WAL records are about a (user) catalog table. Which is only known for the table, not the indexes. It's also nice to be able to pass the heap relation to GlobalVisTestFor() in vacuumRedirectAndPlaceholder(). Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/21b700c3-eecf-2e05-a699-f8c78dd31ec7@gmail.com
* Assert only valid flag bits are passed to visibilitymap_set()Andres Freund2023-04-011-1/+1
| | | | | | | | | If visibilitymap_set() is called with flags containing a higher bit than VISIBILITYMAP_ALL_FROZEN, the state of neighboring pages is affected. While there was an assertion that *some* valid bits were set, it did not check that *only* valid bits were. Change that. Discussion: https://postgr.es/m/20230331043300.gux3s5wzrapqi4oe@awork3.anarazel.de
* hio: Release extension lock before initializing page / pinning VMAndres Freund2023-04-011-7/+7
| | | | | | | | | | | | | | | | PageInit() while holding the extension lock is unnecessary after 0d1fe9f74e3 started to use RBM_ZERO_AND_LOCK - nobody can look at the new page before we release the page lock. PageInit() zeroes the page, which isn't that cheap, so deferring it until after the extension lock is released seems like a good idea. Doing visibilitymap_pin() while holding the extension lock, introduced in 7db0cd2145f2, looks like an accident. Due to the restrictions on HEAP_INSERT_FROZEN it's unlikely to be a performance issue, but it still seems better to move it out. We also are doing the visibilitymap_pin() while holding the buffer lock, which will be fixed in a separate commit. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: http://postgr.es/m/419312fd-9255-078c-c3e3-f0525f911d7f@iki.fi
* Count updates that move row to a new page.Peter Geoghegan2023-03-231-1/+1
| | | | | | | | | | | | | | | | | | Add pgstat counter to track row updates that result in the successor version going to a new heap page, leaving behind an original version whose t_ctid points to the new version. The current count is shown by the n_tup_newpage_upd column of each of the pg_stat_*_tables views. The new n_tup_newpage_upd column complements the existing n_tup_hot_upd and n_tup_upd columns. Tables that have high n_tup_newpage_upd values (relative to n_tup_upd) are good candidates for tuning heap fillfactor. Corey Huinker, with small tweaks by me. Author: Corey Huinker <corey.huinker@gmail.com> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CADkLM=ded21M9iZ36hHm-vj2rE2d=zcKpUQMds__Xm2pxLfHKA@mail.gmail.com
* Allow locking updated tuples in tuple_update() and tuple_delete()Alexander Korotkov2023-03-231-9/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, in read committed transaction isolation mode (default), we have the following sequence of actions when tuple_update()/tuple_delete() finds the tuple updated by concurrent transaction. 1. Attempt to update/delete tuple with tuple_update()/tuple_delete(), which returns TM_Updated. 2. Lock tuple with tuple_lock(). 3. Re-evaluate plan qual (recheck if we still need to update/delete and calculate the new tuple for update). 4. Second attempt to update/delete tuple with tuple_update()/tuple_delete(). This attempt should be successful, since the tuple was previously locked. This patch eliminates step 2 by taking the lock during first tuple_update()/tuple_delete() call. Heap table access method saves some efforts by checking the updated tuple once instead of twice. Future undo-based table access methods, which will start from the latest row version, can immediately place a lock there. The code in nodeModifyTable.c is simplified by removing the nested switch/case. Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp Reviewed-by: Andres Freund, Chris Travers
* Ignore BRIN indexes when checking for HOT updatesTomas Vondra2023-03-202-8/+59
| | | | | | | | | | | | | | | | | | | | | | | | When determining whether an index update may be skipped by using HOT, we can ignore attributes indexed by block summarizing indexes without references to individual tuples that need to be cleaned up. A new type TU_UpdateIndexes provides a signal to the executor to determine which indexes to update - no indexes, all indexes, or only the summarizing indexes. This also removes rd_indexattr list, and replaces it with rd_attrsvalid flag. The list was not used anywhere, and a simple flag is sufficient. This was originally committed as 5753d4ee32, but then got reverted by e3fcca0d0d because of correctness issues. Original patch by Josef Simanek, various fixes and improvements by Tomas Vondra and me. Authors: Matthias van de Meent, Josef Simanek, Tomas Vondra Reviewed-by: Tomas Vondra, Alvaro Herrera Discussion: https://postgr.es/m/05ebcb44-f383-86e3-4f31-0a97a55634cf@enterprisedb.com Discussion: https://postgr.es/m/CAFp7QwpMRGcDAQumN7onN9HjrJ3u4X3ZRXdGFT0K5G2JWvnbWg%40mail.gmail.com
* Remove bms_first_member().Tom Lane2023-03-021-18/+9
| | | | | | | | | | | | | | This function has been semi-deprecated ever since we invented bms_next_member(). Its habit of scribbling on the input bitmapset isn't great, plus for sufficiently large bitmapsets it would take O(N^2) time to complete a loop. Now we have the additional problem that reducing the input to empty while leaving it still accessible would violate a planned invariant. So let's just get rid of it, after updating the few extant callers to use bms_next_member(). Patch by me; thanks to Nathan Bossart and Richard Guo for review. Discussion: https://postgr.es/m/1159933.1677621588@sss.pgh.pa.us
* Update types in smgr APIPeter Eisentraut2023-02-271-2/+2
| | | | | | | | Change data buffer to void *, from char *, and add const where appropriate. This makes it match the File API (see also 2d4f1ba6cfc2f0a977f1c30bda9848041343e248) and stdio. Discussion: https://www.postgresql.org/message-id/flat/11dda853-bb5b-59ba-a746-e168b1ce4bdb%40enterprisedb.com
* More refactoring of heapgettup() and heapgettup_pagemode()David Rowley2023-02-071-169/+78
| | | | | | | | | | | | Here we further simplify the code in heapgettup() and heapgettup_pagemode() to make better use of the helper functions added in the previous recent refactors in this area. In passing, remove an unneeded cast added in 8ca6d49f6. Author: Melanie Plageman Reviewed-by: Andres Freund, David Rowley Discussion: https://postgr.es/m/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ@mail.gmail.com
* Reduce code duplication between heapgettup and heapgettup_pagemodeDavid Rowley2023-02-031-95/+91
| | | | | | | | | | The code to get the next block number was exactly the same between these two functions, so let's just put it into a helper function and call that from both locations. Author: Melanie Plageman Reviewed-by: Andres Freund, David Rowley Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com
* Add helper functions to simplify heapgettup codeDavid Rowley2023-02-031-35/+83
| | | | | | | | | Here we add heapgettup_start_page() and heapgettup_continue_page() to simplify the code in the heapgettup() function. Author: Melanie Plageman Reviewed-by: David Rowley Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com
* Further refactor of heapgettup and heapgettup_pagemodeDavid Rowley2023-02-031-138/+62
| | | | | | | | | | | | | | | Backward and forward scans share much of the same page acquisition code. Here we consolidate that code to reduce some duplication. Additionally, add a new rs_coffset field to HeapScanDescData to track the offset of the current tuple. The new field fits nicely into the padding between a bool and BlockNumber field and saves having to look at the last returned tuple to figure out which offset we should be looking at for the current tuple. Author: Melanie Plageman Reviewed-by: David Rowley Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com
* Refactor heapam.c adding heapgettup_initial_block functionDavid Rowley2023-02-021-121/+102
| | | | | | | | | | Here we adjust heapgettup() and heapgettup_pagemode() to move the code that fetches the first block number to scan out into a helper function. This removes some code duplication. Author: Melanie Plageman Reviewed-by: David Rowley Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com
* Remove dead NoMovementScanDirection codeDavid Rowley2023-02-011-65/+2
| | | | | | | | | | | | | | | | | | | | | | Here remove some dead code from heapgettup() and heapgettup_pagemode() which was trying to support NoMovementScanDirection scans. This code can never be reached as standard_ExecutorRun() never calls ExecutePlan with NoMovementScanDirection. Additionally, plans which were scanning an unordered index would use NoMovementScanDirection rather than ForwardScanDirection. There was no real need for this, so here we adjust this so we use ForwardScanDirection for unordered index scans. A comment in pathnodes.h claimed that NoMovementScanDirection was used for PathKey reasons, but if that was true, it no longer is, per code in build_index_paths(). This does change the non-text format of the EXPLAIN output so that unordered index scans now have a "Forward" scan direction rather than "NoMovement". The text format of EXPLAIN has not changed. Author: Melanie Plageman Reviewed-by: Tom Lane, David Rowley Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com
* Revert "Add eager and lazy freezing strategies to VACUUM."Peter Geoghegan2023-01-252-43/+1
| | | | | | | | | This reverts commit 4d417992613949af35530b4e8e83670c4e67e1b2. Broad concerns about regressions caused by eager freezing strategy have been raised. Whether or not these concerns can be worked through in any time frame is far from certain. Discussion: https://postgr.es/m/20230126004347.gepcmyenk2csxrri@awork3.anarazel.de
* Add eager and lazy freezing strategies to VACUUM.Peter Geoghegan2023-01-252-1/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Eager freezing strategy avoids large build-ups of all-visible pages. It makes VACUUM trigger page-level freezing whenever doing so will enable the page to become all-frozen in the visibility map. This is useful for tables that experience continual growth, particularly strict append-only tables such as pgbench's history table. Eager freezing significantly improves performance stability by spreading out the cost of freezing over time, rather than doing most freezing during aggressive VACUUMs. It complements the insert autovacuum mechanism added by commit b07642db. VACUUM determines its freezing strategy based on the value of the new vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables. Tables that exceed the size threshold use the eager freezing strategy. Unlogged tables and temp tables always use eager freezing strategy, since the added cost is negligible there. Non-permanent relations won't incur any extra overhead in WAL written (for the obvious reason), nor in pages dirtied (since any extra freezing will only take place on pages whose PD_ALL_VISIBLE bit needed to be set either way). VACUUM uses lazy freezing strategy for logged tables that fall under the GUC size threshold. Page-level freezing triggers based on the criteria established in commit 1de58df4, which added basic page-level freezing. Eager freezing is strictly more aggressive than lazy freezing. Settings like vacuum_freeze_min_age still get applied in just the same way in every VACUUM, independent of the strategy in use. The only mechanical difference between eager and lazy freezing strategies is that only the former applies its own additional criteria to trigger freezing pages. Note that even lazy freezing strategy will trigger freezing whenever a page happens to have required that an FPI be written during pruning, provided that the page will thereby become all-frozen in the visibility map afterwards (due to the FPI optimization from commit 1de58df4). The vacuum_freeze_strategy_threshold default setting is 4GB. This is a relatively low setting that prioritizes performance stability. It will be reviewed at the end of the Postgres 16 beta period. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Jeff Davis <pgsql@j-davis.com> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
* Run pgindent on heapam.cDavid Rowley2023-01-231-8/+9
| | | | | | | | | | | An upcoming patch by Melanie Plageman does some refactoring work in this area. Run pgindent on that file now before making any changes so that it's easier to maintain/evolve each of the individual patches doing the refactor work. Additionally, add a few new required typedefs to the list to make it easier to do future pgindent runs on this file during the refactor work. Discussion: https://postgr.es/m/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ@mail.gmail.com
* Tighten up VACUUM's approach to setting VM bits.Peter Geoghegan2023-01-162-40/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tighten up the way that visibilitymap_set() is called: request that both the all-visible and all-frozen bits get set whenever the all-frozen bit is set, regardless of what we think we know about the present state of the all-visible bit. Also make sure that the page level PD_ALL_VISIBLE flag is set in the same code path. In practice there doesn't seem to be a concrete scenario in which the previous approach could lead to inconsistencies. It was almost possible in scenarios involving concurrent HOT updates from transactions that abort, but (unlike pruning) freezing can never remove XIDs > VACUUM's OldestXmin, even those from transactions that are known to have aborted. That was protective here. These issues have been around since commit a892234f83, which added the all-frozen bit to the VM fork. There is no known live bug here, so no backpatch. In passing, add some defensive assertions to catch the issue, and stop reading the existing state of the VM when setting the VM in VACUUM's final heap pass. We already know that affected pages must have had at least one LP_DEAD item before we set it LP_UNUSED, so there is no point in reading the VM when it is set like this. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-WznuNGSzF8v6OsgjaC5aYsb3cZ6HW6MLm30X0d65cmSH6A@mail.gmail.com
* Make lazy_vacuum_heap_rel match lazy_scan_heap.Peter Geoghegan2023-01-111-16/+8
| | | | | | | | | | | Make lazy_vacuum_heap_rel variable names match those from lazy_scan_heap where that makes sense. Extracted from a larger patch to deal with issues with how vacuumlazy.c sets pages all-frozen. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WznuNGSzF8v6OsgjaC5aYsb3cZ6HW6MLm30X0d65cmSH6A@mail.gmail.com
* vacuumlazy.c: Tweak local variable name.Peter Geoghegan2023-01-111-5/+5
| | | | | | | | | | Make a local variable name consistent with the name from its WAL record. Extracted from a larger patch to deal with issues with how vacuumlazy.c sets pages all-frozen. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WznuNGSzF8v6OsgjaC5aYsb3cZ6HW6MLm30X0d65cmSH6A@mail.gmail.com
* Rename and relocate freeze plan dedup routines.Peter Geoghegan2023-01-111-149/+147
| | | | | | | | | | | | | | | Rename the heapam.c freeze plan deduplication routines added by commit 9e540599 to names that follow conventions for functions in heapam.c. Also relocate the functions so that they're next to their caller, which runs during original execution, when FREEZE_PAGE WAL records are built. The routines were initially placed next to (and followed the naming conventions of) conceptually related REDO routine code, but that scheme turned out to be kind of jarring when considered in a wider context. Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20230109214308.icz26oqvt3k2274c@awork3.anarazel.de
* Improve TransactionIdDidAbort() documentation.Peter Geoghegan2023-01-111-15/+18
| | | | | | | | | | | | | | Document that TransactionIdDidAbort() won't indicate that transactions that were in-progress during a crash have aborted. Tie this to existing discussion of the TransactionIdDidCommit() and TransactionIdDidCommit() protocol that code in heapam_visibility.c (and a few other places) must observe. Follow-up to bugfix commit eb5ad4ff. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-Wzn4bEEqgmaUQL3aJ73yM9gAeK-wE4ngi7kjRjLztb+P0w@mail.gmail.com
* Check that xmax didn't commit in freeze check.Peter Geoghegan2023-01-031-4/+10
| | | | | | | | | | | | | | We cannot rely on TransactionIdDidAbort here, since in general it may report transactions that were in-progress at the time of an earlier hard crash as not aborted, effectively behaving as if they were still in progress even after crash recovery completes. Go back to defensively verifying that xmax didn't commit instead. Oversight in commit 79d4bf4e. Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20230104035636.hy5djyr2as4gbc4q@awork3.anarazel.de
* vacuumlazy.c: Save get_database_name() in vacrel.Peter Geoghegan2023-01-031-8/+8
| | | | | | | | This brings dbname strings in line with namespace and relation name strings. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzkQ1TKU-DdNvnGeL870di3+CU1UTo-7nw7xFDpVE-XGjA@mail.gmail.com
* Delay commit status checks until freezing executes.Peter Geoghegan2023-01-031-27/+62
| | | | | | | | | | | | | | pg_xact lookups are relatively expensive. Move the xmin/xmax commit status checks from the point that freeze plans are prepared to the point that they're actually executed. Otherwise we'll repeat many commit status checks whenever multiple successive VACUUM operations scan the same pages and decide against freezing each time, which is a waste of cycles. Oversight in commit 1de58df4, which added page-level freezing. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzkZpe4K6qMfEt8H4qYJCKc2R7TPvKsBva7jc9w7iGXQSw@mail.gmail.com
* Refine the definition of page-level freezing.Peter Geoghegan2023-01-031-13/+7
| | | | | | | | | | | | | Improve comments added by commit 1de58df4 which describe the lazy_scan_prune "freeze the page" path. These newly revised comments are based on suggestions from Jeff Davis. In passing, remove nearby visibility_cutoff_xid comments left over from commit 6daeeb1f. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/ebc857107fe3edd422ef8a65191ca4a8da568b9b.camel@j-davis.com
* Update copyright for 2023Bruce Momjian2023-01-0210-10/+10
| | | | Backpatch-through: 11
* Adjust VACUUM hastup LP_REDIRECT comments.Peter Geoghegan2023-01-021-1/+2
| | | | | | | The term "truncation" has been ambiguous since commit 10a8d13823 added line pointer array truncation during heap pruning. Clear things up by specifying that we're talking about rel truncation here, to match nearby comments that apply to tuples with storage.
* Avoid special XID snapshotConflictHorizon values.Peter Geoghegan2023-01-021-3/+5
| | | | | | | | | | | | | Don't allow VACUUM to WAL-log the value FrozenTransactionId as the snapshotConflictHorizon of freezing or visibility map related WAL records. The only special XID value that's an allowable snapshotConflictHorizon is InvalidTransactionId, which is interpreted as "record definitely doesn't require a recovery conflict". Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WznuNGSzF8v6OsgjaC5aYsb3cZ6HW6MLm30X0d65cmSH6A@mail.gmail.com
* Push lpp variable closer to usage in heapgetpage()Peter Eisentraut2023-01-021-21/+19
| | | | | Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ@mail.gmail.com
* Add page-level freezing to VACUUM.Peter Geoghegan2022-12-282-283/+375
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Teach VACUUM to decide on whether or not to trigger freezing at the level of whole heap pages. Individual XIDs and MXIDs fields from tuple headers now trigger freezing of whole pages, rather than independently triggering freezing of each individual tuple header field. Managing the cost of freezing over time now significantly influences when and how VACUUM freezes. The overall amount of WAL written is the single most important freezing related cost, in general. Freezing each page's tuples together in batch allows VACUUM to take full advantage of the freeze plan WAL deduplication optimization added by commit 9e540599. Also teach VACUUM to trigger page-level freezing whenever it detects that heap pruning generated an FPI. We'll have already written a large amount of WAL just to do that much, so it's very likely a good idea to get freezing out of the way for the page early. This only happens in cases where it will directly lead to marking the page all-frozen in the visibility map. In most cases "freezing a page" removes all XIDs < OldestXmin, and all MXIDs < OldestMxact. It doesn't quite work that way in certain rare cases involving MultiXacts, though. It is convenient to define "freeze the page" in a way that gives FreezeMultiXactId the leeway to put off the work of processing an individual tuple's xmax whenever it happens to be a MultiXactId that would require an expensive second pass to process aggressively (allocating a new multi is especially worth avoiding here). FreezeMultiXactId is eager when processing is cheap (as it usually is), and lazy in the event of an individual multi that happens to require expensive second pass processing. This avoids regressions related to processing of multis that page-level freezing might otherwise cause. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Jeff Davis <pgsql@j-davis.com> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
* Remove overzealous MultiXact freeze assertion.Peter Geoghegan2022-12-261-1/+0
| | | | | | | | | | | | | When VACUUM determines that an existing MultiXact should use a freeze plan that sets xmax to InvalidTransactionId, the original Multi may or may not be before OldestMxact. Remove an incorrect assertion that expected it to always be from before OldestMxact. Oversight in commit 4ce3af. Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Hayato Kuroda <kuroda.hayato@fujitsu.com> Discussion: https://postgr.es/m/TYAPR01MB5866B24104FD80B5D7E65C3EF5ED9@TYAPR01MB5866.jpnprd01.prod.outlook.com
* Use scanned_pages to decide when to failsafe check.Peter Geoghegan2022-12-221-6/+3
| | | | | | | | | | | | | | | | Perform a failsafe check every time VACUUM's first heap scan scans a further FAILSAFE_EVERY_PAGES pages, rather than using an approach based on the number of physical blocks that our current blkno is from the blkno at the time of the previous failsafe check. That way VACUUM will perform a failsafe check every time it has scanned a uniform number of pages, without it mattering when or how VACUUM skipped pages using the visibility map. Sami Imseih, with changes to FAILSAFE_EVERY_PAGES comments added by me. Author: Sami Imseih <simseih@amazon.com> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/401CE010-4049-4B94-9961-0B610A5D254D%40amazon.com