summaryrefslogtreecommitdiff
path: root/src/backend/access/transam/xlog.c
Commit message (Collapse)AuthorAgeFilesLines
...
* Remove unstable, unnecessary test; fix typoAlvaro Herrera2021-10-011-1/+1
| | | | | | | | | | | | | Commit ff9f111bce24 added some test code that's unportable and doesn't add meaningful coverage. Remove it rather than try and get it to work everywhere. While at it, fix a typo in a log message added by the aforementioned commit. Backpatch to 14. Discussion: https://postgr.es/m/3000074.1632947632@sss.pgh.pa.us
* Fix WAL replay in presence of an incomplete recordAlvaro Herrera2021-09-291-5/+149
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Physical replication always ships WAL segment files to replicas once they are complete. This is a problem if one WAL record is split across a segment boundary and the primary server crashes before writing down the segment with the next portion of the WAL record: WAL writing after crash recovery would happily resume at the point where the broken record started, overwriting that record ... but any standby or backup may have already received a copy of that segment, and they are not rewinding. This causes standbys to stop following the primary after the latter crashes: LOG: invalid contrecord length 7262 at A8/D9FFFBC8 because the standby is still trying to read the continuation record (contrecord) for the original long WAL record, but it is not there and it will never be. A workaround is to stop the replica, delete the WAL file, and restart it -- at which point a fresh copy is brought over from the primary. But that's pretty labor intensive, and I bet many users would just give up and re-clone the standby instead. A fix for this problem was already attempted in commit 515e3d84a0b5, but it only addressed the case for the scenario of WAL archiving, so streaming replication would still be a problem (as well as other things such as taking a filesystem-level backup while the server is down after having crashed), and it had performance scalability problems too; so it had to be reverted. This commit fixes the problem using an approach suggested by Andres Freund, whereby the initial portion(s) of the split-up WAL record are kept, and a special type of WAL record is written where the contrecord was lost, so that WAL replay in the replica knows to skip the broken parts. With this approach, we can continue to stream/archive segment files as soon as they are complete, and replay of the broken records will proceed across the crash point without a hitch. Because a new type of WAL record is added, users should be careful to upgrade standbys first, primaries later. Otherwise they risk the standby being unable to start if the primary happens to write such a record. A new TAP test that exercises this is added, but the portability of it is yet to be seen. This has been wrong since the introduction of physical replication, so backpatch all the way back. In stable branches, keep the new XLogReaderState members at the end of the struct, to avoid an ABI break. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Nathan Bossart <bossartn@amazon.com> Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
* Revert "Avoid creating archive status ".ready" files too early"Alvaro Herrera2021-09-041-209/+11
| | | | | | | | | | This reverts commit 515e3d84a0b5 and equivalent commits in back branches. This solution to the problem has a number of problems, so we'll try again with a different approach. Per note from Andres Freund Discussion: https://postgr.es/m/20210831042949.52eqp5xwbxgrfank@alap3.anarazel.de
* Avoid creating archive status ".ready" files too earlyAlvaro Herrera2021-08-231-11/+209
| | | | | | | | | | | | | | | | | | | | | | | WAL records may span multiple segments, but XLogWrite() does not wait for the entire record to be written out to disk before creating archive status files. Instead, as soon as the last WAL page of the segment is written, the archive status file is created, and the archiver may process it. If PostgreSQL crashes before it is able to write and flush the rest of the record (in the next WAL segment), the wrong version of the first segment file lingers in the archive, which causes operations such as point-in-time restores to fail. To fix this, keep track of records that span across segments and ensure that segments are only marked ready-for-archival once such records have been completely written to disk. This has always been wrong, so backpatch all the way back. Author: Nathan Bossart <bossartn@amazon.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Ryo Matsumura <matsumura.ryo@fujitsu.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
* Refresh apply delay on reload of recovery_min_apply_delay at recoveryMichael Paquier2021-08-161-2/+11
| | | | | | | | | | | | | | | | | This commit ensures that the wait interval in the replay delay loop waiting for an amount of time defined by recovery_min_apply_delay is correctly handled on reload, recalculating the delay if this GUC value is updated, based on the timestamp of the commit record being replayed. The previous behavior would be problematic for example with replay still waiting even if the delay got reduced or just cancelled. If the apply delay was increased to a larger value, the wait would have just respected the old value set, finishing earlier. Author: Soumyadeep Chakraborty, Ashwin Agrawal Reviewed-by: Kyotaro Horiguchi, Michael Paquier Discussion: https://postgr.es/m/CAE-ML+93zfr-HLN8OuxF0BjpWJ17O5dv1eMvSE5jsj9jpnAXZA@mail.gmail.com Backpatch-through: 9.6
* pgstat: split reporting/fetching of bgwriter and checkpointer stats.Andres Freund2021-08-041-2/+2
| | | | | | | | | | | | | | | | These have been unrelated since bgwriter and checkpointer were split into two processes in 806a2aee379. As there several pending patches (shared memory stats, extending the set of tracked IO / buffer statistics) that are made a bit more awkward by the grouping, split them. Done separately to make reviewing easier. This does *not* change the contents of pg_stat_bgwriter or move fields out of bgwriter/checkpointer stats that arguably do not belong in either. However pgstat_fetch_global() was renamed and split into pgstat_fetch_stat_checkpointer() and pgstat_fetch_stat_bgwriter(). Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20210405092914.mmxqe7j56lsjfsej@alap3.anarazel.de
* Further simplify a bit of logic in StartupXLOG().Thomas Munro2021-08-031-20/+14
| | | | | | | | Commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 left us with two identical cases. Collapse them. Author: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGJ8NRsqgkZEnsnRc2MFROBV-jCnacbYvtpptK2A9YYp9Q%40mail.gmail.com
* Run checkpointer and bgwriter in crash recovery.Thomas Munro2021-08-021-21/+12
| | | | | | | | | | | | | Start up the checkpointer and bgwriter during crash recovery (except in --single mode), as we do for replication. This wasn't done back in commit cdd46c76 out of caution. Now it seems like a better idea to make the environment as similar as possible in both cases. There may also be some performance advantages. Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> Discussion: https://postgr.es/m/CA%2BhUKGJ8NRsqgkZEnsnRc2MFROBV-jCnacbYvtpptK2A9YYp9Q%40mail.gmail.com
* Move InRecovery and standbyState global vars to xlogutils.c.Heikki Linnakangas2021-07-311-16/+0
| | | | | | | | | | They are used in code that runs both during normal operation and during WAL replay, and needs to behave differently during replay. Move them to xlogutils.c, because that's where we have other helper functions used by redo routines. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/b3b71061-4919-e882-4857-27e370ab134a%40iki.fi
* Extract code to describe recovery stop reason to a function.Heikki Linnakangas2021-07-311-28/+39
| | | | | | | StartupXLOG() is very long, this makes it a little bit more readable. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/b3b71061-4919-e882-4857-27e370ab134a%40iki.fi
* Remove unnecessary 'restoredFromArchive' global variable.Heikki Linnakangas2021-07-311-9/+4
| | | | | | | | It might've been useful for debugging purposes, but meh. There's 'readSource' which does almost the same thing. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/b3b71061-4919-e882-4857-27e370ab134a%40iki.fi
* Don't use O_SYNC or similar when opening signal file to fsync it.Heikki Linnakangas2021-07-311-2/+2
| | | | | | | | | No need to use get_sync_bit() when we're calling pg_fsync() on the file. We're not writing to the files, so it doesn't make any difference in practice, but seems less surprising this way. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/b3b71061-4919-e882-4857-27e370ab134a%40iki.fi
* Remove unnecessary call to ReadCheckpointRecord().Robert Haas2021-07-301-23/+11
| | | | | | | | | | | | | | | | It should always be the case that the last checkpoint record is still readable, because otherwise, a crash would leave us in a situation from which we can't recover. Therefore the test removed by this patch should always succeed. For it to fail, either there has to be a serious bug in the code someplace, or the user has to be manually modifying pg_wal while crash recovery is running. If it's the first one, we should fix the bug. If it's the second one, they should stop, or anyway they're doing so at their own risk. In neither case does a full checkpoint instead of an end-of-recovery record seem like a clear winner. Furthermore, rarely-taken code paths are particularly vulnerable to bugs, so let's simplify by getting rid of this one. Discussion: http://postgr.es/m/CA+TgmoYmw==TOJ6EzYb_vcjyS09NkzrVKSyBKUUyo1zBEaJASA@mail.gmail.com
* Update obsolete comment that still referred to CheckpointLockHeikki Linnakangas2021-07-301-1/+1
| | | | | | CheckpointLock was removed in commit d18e75664a, and commit ce197e91d0 updated a leftover comment in CreateCheckPoint, but there was another copy of it in CreateRestartPoint still.
* Close yet another race condition in replication slot test codeAlvaro Herrera2021-07-291-1/+1
| | | | | | | | | | | | | | | Buildfarm shows that this test has a further failure mode when a checkpoint starts earlier than expected, so we detect a "checkpoint completed" line that's not the one we want. Change the config to try and prevent this. Per buildfarm While at it, update one comment that was forgotten in commit d18e75664a2f. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20210729.162038.534808353849568395.horikyota.ntt@gmail.com
* Make XLOG_FPI_FOR_HINT records honor full_page_writes setting.Fujii Masao2021-07-211-11/+21
| | | | | | | | | | | | | | | | Commit 2c03216d83 changed XLOG_FPI_FOR_HINT records so that they always included full-page images even when full_page_writes was disabled. However, in this setting, they don't need to do that because hint bit updates don't need to be protected from torn writes. Therefore, this commit makes XLOG_FPI_FOR_HINT records honor full_page_writes setting. That is, XLOG_FPI_FOR_HINT records may include no full-page images if full_page_writes is disabled, and WAL replay of them does nothing. Reported-by: Zhang Wenjie Author: Kyotaro Horiguchi Reviewed-by: Fujii Masao Discussion: https://postgr.es/m/tencent_60F11973A111EED97A8596FFECC4A91ED405@qq.com
* Advance old-segment horizon properly after slot invalidationAlvaro Herrera2021-07-161-2/+24
| | | | | | | | | | | | | | | | When some slots are invalidated due to the max_slot_wal_keep_size limit, the old segment horizon should move forward to stay within the limit. However, in commit c6550776394e we forgot to call KeepLogSeg again to recompute the horizon after invalidating replication slots. In cases where other slots remained, the limits would be recomputed eventually for other reasons, but if all slots were invalidated, the limits would not move at all afterwards. Repair. Backpatch to 13 where the feature was introduced. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reported-by: Marcin Krupowicz <mk@071.ovh> Discussion: https://postgr.es/m/17103-004130e8f27782c9@postgresql.org
* Replace explicit PIN entries in pg_depend with an OID range test.Tom Lane2021-07-151-1/+1
| | | | | | | | | | | | | | | | As of v14, pg_depend contains almost 7000 "pin" entries recording the OIDs of built-in objects. This is a fair amount of bloat for every database, and it adds time to pg_depend lookups as well as initdb. We can get rid of all of those entries in favor of an OID range check, i.e. "OIDs below FirstUnpinnedObjectId are pinned". (template1 and the public schema are exceptions. Those exceptions are now wired into IsPinnedObject() instead of initdb's code for filling pg_depend, but it's the same amount of cruft either way.) The contents of pg_shdepend are modified likewise. Discussion: https://postgr.es/m/3737988.1618451008@sss.pgh.pa.us
* Use WaitLatch() instead of pg_usleep() at the end of backupsMichael Paquier2021-07-061-3/+5
| | | | | | | | | | This concerns pg_stop_backup() and BASE_BACKUP, when waiting for the WAL segments required for a backup to be archived. This simplifies a bit the handling of the wait event used in this code path. Author: Bharath Rupireddy Reviewed-by: Michael Paquier, Stephen Frost Discussion: https://postgr.es/m/CALj2ACU4AdPCq6NLfcA-ZGwX7pPCK5FgEj-CAU0xCKzkASSy_A@mail.gmail.com
* Fix incorrect PITR message for transaction ROLLBACK PREPAREDMichael Paquier2021-06-301-1/+1
| | | | | | | | | | | Reaching PITR on such a transaction would cause the generation of a LOG message mentioning a transaction committed, not aborted. Oversight in 4f1b890. Author: Simon Riggs Discussion: https://postgr.es/m/CANbhV-GJ6KijeCgdOrxqMCQ+C8QiK657EMhCy4csjrPcEUFv_Q@mail.gmail.com Backpatch-through: 9.6
* Add support for LZ4 with compression of full-page writes in WALMichael Paquier2021-06-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | The logic is implemented so as there can be a choice in the compression used when building a WAL record, and an extra per-record bit is used to track down if a block is compressed with PGLZ, LZ4 or nothing. wal_compression, the existing parameter, is changed to an enum with support for the following backward-compatible values: - "off", the default, to not use compression. - "pglz" or "on", to compress FPWs with PGLZ. - "lz4", the new mode, to compress FPWs with LZ4. Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be also an interesting choice, but going just with LZ4 for now makes the patch minimalistic as toast compression is already able to use LZ4, so there is no need to worry about any build-related needs for this implementation. Author: Andrey Borodin, Justin Pryzby Reviewed-by: Dilip Kumar, Michael Paquier Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
* Skip WAL recycling and preallocation during archive recovery.Noah Misch2021-06-281-8/+57
| | | | | | | | | | | | The previous commit addressed the chief consequences of a race condition between InstallXLogFileSegment() and KeepFileRestoredFromArchive(). Fix three lesser consequences. A spurious durable_rename_excl() LOG message remained possible. KeepFileRestoredFromArchive() wasted the proceeds of WAL recycling and preallocation. Finally, XLogFileInitInternal() could return a descriptor for a file that KeepFileRestoredFromArchive() had already unlinked. That felt like a recipe for future bugs. Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
* Don't ERROR on PreallocXlogFiles() race condition.Noah Misch2021-06-281-23/+56
| | | | | | | | | | | | Before a restartpoint finishes PreallocXlogFiles(), a startup process KeepFileRestoredFromArchive() call can unlink the preallocated segment. If a CHECKPOINT sql command had elicited the restartpoint experiencing the race condition, that sql command failed. Moreover, the restartpoint omitted its log_checkpoints message and some inessential resource reclamation. Prevent the ERROR by skipping open() of the segment. Since these consequences are so minor, no back-patch. Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
* Remove XLogFileInit() ability to unlink a pre-existing file.Noah Misch2021-06-281-36/+25
| | | | | | | | Only initdb used it. initdb refuses to operate on a non-empty directory and generally does not cope with pre-existing files of other kinds. Hence, use the opportunity to simplify. Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
* In XLogFileInit(), fix *use_existent postcondition to suit callers.Noah Misch2021-06-281-7/+6
| | | | | | | | Infrequently, the mismatch caused log_checkpoints messages and TRACE_POSTGRESQL_CHECKPOINT_DONE() to witness an "added" count too high by one. Since that consequence is so minor, no back-patch. Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
* Remove XLogFileInit() ability to skip ControlFileLock.Noah Misch2021-06-281-32/+14
| | | | | | Cold paths, initdb and end-of-recovery, used it. Don't optimize them. Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
* Fix outdated comment that talked about seek position of WAL file.Heikki Linnakangas2021-06-161-5/+3
| | | | | | | | Since commit c24dcd0cfd, we have been using pg_pread() to read the WAL file, which doesn't change the seek position (unless we fall back to the implementation in src/port/pread.c). Update comment accordingly. Backpatch-through: 12, where we started to use pg_pread()
* Fix corner case failure of new standby to follow new primary.Robert Haas2021-06-091-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | This only happens if (1) the new standby has no WAL available locally, (2) the new standby is starting from the old timeline, (3) the promotion happened in the WAL segment from which the new standby is starting, (4) the timeline history file for the new timeline is available from the archive but the WAL files for are not (i.e. this is a race), (5) the WAL files for the new timeline are available via streaming, and (6) recovery_target_timeline='latest'. Commit ee994272ca50f70b53074f0febaec97e28f83c4e introduced this logic and was an improvement over the previous code, but it mishandled this case. If recovery_target_timeline='latest' and restore_command is set, validateRecoveryParameters() can change recoveryTargetTLI to be different from receiveTLI. If streaming is then tried afterward, expectedTLEs gets initialized with the history of the wrong timeline. It's supposed to be a list of entries explaining how to get to the target timeline, but in this case it ends up with a list of entries explaining how to get to the new standby's original timeline, which isn't right. Dilip Kumar and Robert Haas, reviewed by Kyotaro Horiguchi. Discussion: http://postgr.es/m/CAFiTN-sE-jr=LB8jQuxeqikd-Ux+jHiXyh4YDiZMPedgQKup0g@mail.gmail.com
* Make standby promotion reset the recovery pause state to 'not paused'.Fujii Masao2021-05-191-0/+8
| | | | | | | | | | | | | | If a promotion is triggered while recovery is paused, the paused state ends and promotion continues. But previously in that case pg_get_wal_replay_pause_state() returned 'paused' wrongly while a promotion was ongoing. This commit changes a standby promotion so that it marks the recovery pause state as 'not paused' when it's triggered, to fix the issue. Author: Fujii Masao Reviewed-by: Dilip Kumar, Kyotaro Horiguchi Discussion: https://postgr.es/m/f706876c-4894-0ba5-6f4d-79803eeea21b@oss.nttdata.com
* Initial pgindent and pgperltidy run for v14.Tom Lane2021-05-121-55/+58
| | | | | | | | Also "make reformat-dat-files". The only change worthy of note is that pgindent messed up the formatting of launcher.c's struct LogicalRepWorkerId, which led me to notice that that struct wasn't used at all anymore, so I just took it out.
* Revert recovery prefetching feature.Thomas Munro2021-05-101-113/+75
| | | | | | | | | | | | | | | | | | This set of commits has some bugs with known fixes, but at this late stage in the release cycle it seems best to revert and resubmit next time, along with some new automated test coverage for this whole area. Commits reverted: dc88460c: Doc: Review for "Optionally prefetch referenced data in recovery." 1d257577: Optionally prefetch referenced data in recovery. f003d9f8: Add circular WAL decoding buffer. 323cbe7c: Remove read_page callback from XLogReader. Remove the new GUC group WAL_RECOVERY recently added by a55a9847, as the corresponding section of config.sgml is now reverted. Discussion: https://postgr.es/m/CAOuzzgrn7iKnFRsB4MHp3UisEQAGgZMbk_ViTN4HV4-Ksq8zCg%40mail.gmail.com
* Optionally prefetch referenced data in recovery.Thomas Munro2021-04-081-9/+41
| | | | | | | | | | | | | | | | | | | | | | | | Introduce a new GUC recovery_prefetch, disabled by default. When enabled, look ahead in the WAL and try to initiate asynchronous reading of referenced data blocks that are not yet cached in our buffer pool. For now, this is done with posix_fadvise(), which has several caveats. Better mechanisms will follow in later work on the I/O subsystem. The GUC maintenance_io_concurrency is used to limit the number of concurrent I/Os we allow ourselves to initiate, based on pessimistic heuristics used to infer that I/Os have begun and completed. The GUC wal_decode_buffer_size is used to limit the maximum distance we are prepared to read ahead in the WAL to find uncached blocks. Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (parts) Reviewed-by: Andres Freund <andres@anarazel.de> (parts) Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (parts) Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
* Add circular WAL decoding buffer.Thomas Munro2021-04-081-5/+23
| | | | | | | | | | | | | | | | | | | | | | | | | Teach xlogreader.c to decode its output into a circular buffer, to support optimizations based on looking ahead. * XLogReadRecord() works as before, consuming records one by one, and allowing them to be examined via the traditional XLogRecGetXXX() macros. * An alternative new interface XLogNextRecord() is added that returns pointers to DecodedXLogRecord structs that can be examined directly. * XLogReadAhead() provides a second cursor that lets you see further ahead, as long as data is available and there is enough space in the decoding buffer. This returns DecodedXLogRecord pointers to the caller, but also adds them to a queue of records that will later be consumed by XLogNextRecord()/XLogReadRecord(). The buffer's size is controlled with wal_decode_buffer_size. The buffer could potentially be placed into shared memory, for future projects. Large records that don't fit in the circular buffer are called "oversized" and allocated separately with palloc(). Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
* Remove read_page callback from XLogReader.Thomas Munro2021-04-081-69/+57
| | | | | | | | | | | | | | | | | | | | | Previously, the XLogReader module would fetch new input data using a callback function. Redesign the interface so that it tells the caller to insert more data with a special return value instead. This API suits later patches for prefetching, encryption and maybe other future projects that would otherwise require continually extending the callback interface. As incidental cleanup work, move global variables readOff, readLen and readSegNo inside XlogReaderState. Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> Author: Heikki Linnakangas <hlinnaka@iki.fi> (parts of earlier version) Reviewed-by: Antonin Houska <ah@cybertec.at> Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> Reviewed-by: Takashi Menjo <takashi.menjo@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/20190418.210257.43726183.horiguchi.kyotaro%40lab.ntt.co.jp
* Stop archive recovery if WAL generated with wal_level=minimal is found.Fujii Masao2021-04-061-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously if hot standby was enabled, archive recovery exited with an error when it found WAL generated with wal_level=minimal. But if hot standby was disabled, it just reported a warning and continued in that case. Which could lead to data loss or errors during normal operation. A warning was emitted, but users could easily miss that and not notice this serious situation until they encountered the actual errors. To improve this situation, this commit changes archive recovery so that it exits with FATAL error when it finds WAL generated with wal_level=minimal whatever the setting of hot standby. This enables users to notice the serious situation soon. The FATAL error is thrown if archive recovery starts from a base backup taken before wal_level is changed to minimal. When archive recovery exits with the error, if users have a base backup taken after setting wal_level to higher than minimal, they can recover the database by starting archive recovery from that newer backup. But note that if such backup doesn't exist, there is no easy way to complete archive recovery, which may make the database server unstartable and users may lose whole database. The commit adds the note about this risk into the document. Even in the case of unstartable database server, previously by just disabling hot standby users could avoid the error during archive recovery, forcibly start up the server and salvage data from it. But note that this commit makes this procedure unavailable at all. Author: Takamichi Osumi Reviewed-by: Laurenz Albe, Kyotaro Horiguchi, David Steele, Fujii Masao Discussion: https://postgr.es/m/OSBPR01MB4888CBE1DA08818FD2D90ED8EDF90@OSBPR01MB4888.jpnprd01.prod.outlook.com
* Code review for server's handling of "tablespace map" files.Tom Lane2021-03-171-55/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While looking at Robert Foggia's report, I noticed a passel of other issues in the same area: * The scheme for backslash-quoting newlines in pathnames is just wrong; it will misbehave if the last ordinary character in a pathname is a backslash. I'm not sure why we're bothering to allow newlines in tablespace paths, but if we're going to do it we should do it without introducing other problems. Hence, backslashes themselves have to be backslashed too. * The author hadn't read the sscanf man page very carefully, because this code would drop any leading whitespace from the path. (I doubt that a tablespace path with leading whitespace could happen in practice; but if we're bothering to allow newlines in the path, it sure seems like leading whitespace is little less implausible.) Using sscanf for the task of finding the first space is overkill anyway. * While I'm not 100% sure what the rationale for escaping both \r and \n is, if the idea is to allow Windows newlines in the file then this code failed, because it'd throw an error if it saw \r followed by \n. * There's no cross-check for an incomplete final line in the map file, which would be a likely apparent symptom of the improper-escaping bug. On the generation end, aside from the escaping issue we have: * If needtblspcmapfile is true then do_pg_start_backup will pass back escaped strings in tablespaceinfo->path values, which no caller wants or is prepared to deal with. I'm not sure if there's a live bug from that, but it looks like there might be (given the dubious assumption that anyone actually has newlines in their tablespace paths). * It's not being very paranoid about the possibility of random stuff in the pg_tblspc directory. IMO we should ignore anything without an OID-like name. The escaping rule change doesn't seem back-patchable: it'll require doubling of backslashes in the tablespace_map file, which is basically a basebackup format change. The odds of that causing trouble are considerably more than the odds of the existing bug causing trouble. The rest of this seems somewhat unlikely to cause problems too, so no back-patch.
* Prevent buffer overrun in read_tablespace_map().Tom Lane2021-03-171-1/+1
| | | | | | | | | | | | | | | | Robert Foggia of Trustwave reported that read_tablespace_map() fails to prevent an overrun of its on-stack input buffer. Since the tablespace map file is presumed trustworthy, this does not seem like an interesting security vulnerability, but still we should fix it just in the name of robustness. While here, document that pg_basebackup's --tablespace-mapping option doesn't work with tar-format output, because it doesn't. To make it work, we'd have to modify the tablespace_map file within the tarball sent by the server, which might be possible but I'm not volunteering. (Less-painful solutions would require changing the basebackup protocol so that the source server could adjust the map. That's not very appetizing either.)
* Add condition variable for recovery resume.Thomas Munro2021-03-121-10/+21
| | | | | | | | | Replace a sleep loop with a CV, to get a fast reaction time when recovery is resumed or the postmaster exits via standard infrastructure. Unfortunately we still need to wake up every second to perform extra polling during the recovery pause loop. Discussion: https://postgr.es/m/CA%2BhUKGK1607VmtrDUHQXrsooU%3Dap4g4R2yaoByWOOA3m8xevUQ%40mail.gmail.com
* Be clear about whether a recovery pause has taken effect.Robert Haas2021-03-111-14/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the code and documentation seem to have essentially assumed than a call to pg_wal_replay_pause() would take place immediately, but that's not the case, because we only check for a pause in certain places. This means that a tool that uses this function and then wants to do something else afterward that is dependent on the pause having taken effect doesn't know how long it needs to wait to be sure that no more WAL is going to be replayed. To avoid that, add a new function pg_get_wal_replay_pause_state() which returns either 'not paused', 'paused requested', or 'paused'. After calling pg_wal_replay_pause() the status will immediate change from 'not paused' to 'pause requested'; when the startup process has noticed this, the status will change to 'pause'. For backward compatibility, pg_is_wal_replay_paused() still exists and returns the same thing as before: true if a pause has been requested, whether or not it has taken effect yet; and false if not. The documentation is updated to clarify. To improve the changes that a pause request is quickly confirmed effective, adjust things so that WaitForWALToBecomeAvailable will swiftly reach a call to recoveryPausesHere() when a pause request is made. Dilip Kumar, reviewed by Simon Riggs, Kyotaro Horiguchi, Yugo Nagata, Masahiko Sawada, and Bharath Rupireddy. Discussion: http://postgr.es/m/CAFiTN-vcLLWEm8Zr%3DYK83rgYrT9pbC8VJCfa1kY9vL3AUPfu6g%40mail.gmail.com
* Track total amounts of times spent writing and syncing WAL data to disk.Fujii Masao2021-03-091-1/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds new GUC track_wal_io_timing. When this is enabled, the total amounts of time XLogWrite writes and issue_xlog_fsync syncs WAL data to disk are counted in pg_stat_wal. This information would be useful to check how much WAL write and sync affect the performance. Enabling track_wal_io_timing will make the server query the operating system for the current time every time WAL is written or synced, which may cause significant overhead on some platforms. To avoid such additional overhead in the server with track_io_timing enabled, this commit introduces track_wal_io_timing as a separate parameter from track_io_timing. Note that WAL write and sync activity by walreceiver has not been tracked yet. This commit makes the server also track the numbers of times XLogWrite writes and issue_xlog_fsync syncs WAL data to disk, in pg_stat_wal, regardless of the setting of track_wal_io_timing. This counters can be used to calculate the WAL write and sync time per request, for example. Bump PGSTAT_FILE_FORMAT_ID. Bump catalog version. Author: Masahiro Ikeda Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David Johnston, Fujii Masao Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com
* Simplify printing of LSNsPeter Eisentraut2021-02-231-65/+45
| | | | | | | | | | Add a macro LSN_FORMAT_ARGS for use in printf-style printing of LSNs. Convert all applicable code to use it. Reviewed-by: Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://www.postgresql.org/message-id/flat/CAExHW5ub5NaTELZ3hJUCE6amuvqAtsSxc7O+uK7y4t9Rrk23cw@mail.gmail.com
* Use errmsg_internal for debug messagesPeter Eisentraut2021-02-171-12/+12
| | | | | | An inconsistent set of debug-level messages was not using errmsg_internal(), thus uselessly exposing the messages to translation work. Fix those.
* Clarify some comments around SharedRecoveryState in xlog.cMichael Paquier2021-02-061-9/+8
| | | | | | | | | SharedRecoveryState has been switched from a boolean to an enum as of commit 4e87c48, but some comments still referred to it as a boolean. Author: Amul Sul Reviewed-by: Dilip Kumar, Kyotaro Horiguchi Discussion: https://postgr.es/m/CAAJ_b97Hf+1SXnm8jySpO+Fhm+-VKFAAce1T_cupUYtnE3Nxig
* Retire pg_standby.Thomas Munro2021-01-291-1/+1
| | | | | | | | | | | | | | pg_standby was useful more than a decade ago, but now it is obsolete. It has been proposed that we retire it many times. Now seems like a good time to finally do it, because "waiting restore commands" are incompatible with a proposed recovery prefetching feature. Discussion: https://postgr.es/m/20201029024412.GP5380%40telsasoft.com Author: Justin Pryzby <pryzby@telsasoft.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
* Move StartupCLOG() calls to just after we initialize ShmemVariableCache.Robert Haas2021-01-271-7/+10
| | | | | | | | | | | | | Previously, the hot_standby=off code path did this at end of recovery, while the hot_standby=on code path did it at the beginning of recovery. It's better to do this in only one place because (a) it's simpler, (b) StartupCLOG() is trivial so trying to postpone the work isn't useful, and (c) this will make it possible to simplify some other logic. Patch by me, reviewed by Heikki Linnakangas. Discussion: http://postgr.es/m/CA+TgmoZYig9+AQodhF5sRXuKkJ=RgFDugLr3XX_dz_F-p=TwTg@mail.gmail.com
* Remove CheckpointLock.Robert Haas2021-01-251-27/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Up until now, we've held this lock when performing a checkpoint or restartpoint, but commit 076a055acf3c55314de267c62b03191586d79cf6 back in 2004 and commit 7e48b77b1cebb9a43f9fdd6b17128a0ba36132f9 from 2009, taken together, have removed all need for this. In the present code, there's only ever one process entitled to attempt a checkpoint: either the checkpointer, during normal operation, or the postmaster, during single-user operation. So, we don't need the lock. One possible concern in making this change is that it means that a substantial amount of code where HOLD_INTERRUPTS() was previously in effect due to the preceding LWLockAcquire() will now be running without that. This could mean that ProcessInterrupts() gets called in places from which it didn't before. However, this seems unlikely to do very much, because the checkpointer doesn't have any signal mapped to die(), so it's not clear how, for example, ProcDiePending = true could happen in the first place. Similarly with ClientConnectionLost and recovery conflicts. Also, if there are any such problems, we might want to fix them rather than reverting this, since running lots of code with interrupt handling suspended is generally bad. Patch by me, per an inquiry by Amul Sul. Review by Tom Lane and Michael Paquier. Discussion: http://postgr.es/m/CAAJ_b97XnBBfYeSREDJorFsyoD1sHgqnNuCi=02mNQBUMnA=FA@mail.gmail.com
* Pause recovery for insufficient parameter settingsPeter Eisentraut2021-01-181-5/+54
| | | | | | | | | | | | | | | | | | When certain parameters are changed on a physical replication primary, this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL record. The standby then checks whether its own settings are at least as big as the ones on the primary. If not, the standby shuts down with a fatal error. This patch changes this behavior for hot standbys to pause recovery at that point instead. That allows read traffic on the standby to continue while database administrators figure out next steps. When recovery is unpaused, the server shuts down (as before). The idea is to fix the parameters while recovery is paused and then restart when there is a maintenance window. Reviewed-by: Sergei Kornilov <sk@zsrv.org> Discussion: https://www.postgresql.org/message-id/flat/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b@2ndquadrant.com
* Fix O(N^2) stat() calls when recycling WAL segmentsMichael Paquier2021-01-151-30/+31
| | | | | | | | | | | | | | | | | | The counter tracking the last segment number recycled was getting initialized when recycling one single segment, while it should be used across a full cycle of segments recycled to prevent useless checks related to entries already recycled. This performance issue has been introduced by b2a5545, and it was first implemented in 61b86142. No backpatch is done per the lack of field complaints. Reported-by: Andres Freund, Thomas Munro Author: Michael Paquier Reviewed-By: Andres Freund Discussion: https://postgr.es/m/20170621211016.eln6cxxp3jrv7m4m@alap3.anarazel.de Discussion: https://postgr.es/m/CA+hUKG+DRiF9z1_MU4fWq+RfJMxP7zjoptfcmuCFPeO4JM2iVg@mail.gmail.com
* Use vectored I/O to fill new WAL segments.Thomas Munro2021-01-111-6/+22
| | | | | | | | | | | | | Instead of making many block-sized write() calls to fill a new WAL file with zeroes, make a smaller number of pwritev() calls (or various emulations). The actual number depends on the OS's IOV_MAX, which PG_IOV_MAX currently caps at 32. That means we'll write 256kB per call on typical systems. We may want to tune the number later with more experience. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGJA%2Bu-220VONeoREBXJ9P3S94Y7J%2BkqCnTYmahvZJwM%3Dg%40mail.gmail.com
* Update copyright for 2021Bruce Momjian2021-01-021-1/+1
| | | | Backpatch-through: 9.5