summaryrefslogtreecommitdiff
path: root/src/backend/access/transam/xlog.c
Commit message (Collapse)AuthorAgeFilesLines
...
* Revert "Add key management system" (978f869b99) & later commitsBruce Momjian2020-12-271-21/+0
| | | | | | | | | | The patch needs test cases, reorganization, and cfbot testing. Technically reverts commits 5c31afc49d..e35b2bad1a (exclusive/inclusive) and 08db7c63f3..ccbe34139b. Reported-by: Tom Lane, Michael Paquier Discussion: https://postgr.es/m/E1ktAAG-0002V2-VB@gemulon.postgresql.org
* Add key management systemBruce Momjian2020-12-251-0/+21
| | | | | | | | | | | | | | | This adds a key management system that stores (currently) two data encryption keys of length 128, 192, or 256 bits. The data keys are AES256 encrypted using a key encryption key, and validated via GCM cipher mode. A command to obtain the key encryption key must be specified at initdb time, and will be run at every database server start. New parameters allow a file descriptor open to the terminal to be passed. pg_upgrade support has also been added. Discussion: https://postgr.es/m/CA+fd4k7q5o6Nc_AaX6BcYM9yqTbC6_pnH-6nSD=54Zp6NBQTCQ@mail.gmail.com Discussion: https://postgr.es/m/20201202213814.GG20285@momjian.us Author: Masahiko Sawada, me, Stephen Frost
* Fix typos and grammar in docs and commentsMichael Paquier2020-12-241-1/+1
| | | | | | | | This fixes several areas of the documentation and some comments in matters of style, grammar, or even format. Author: Justin Pryzby Discussion: https://postgr.es/m/20201222041153.GK30237@telsasoft.com
* Revert "Get rid of the dedicated latch for signaling the startup process".Fujii Masao2020-12-171-15/+25
| | | | | | | | | | | | | | | | | Revert ac22929a26, as well as the followup fix 113d3591b8. Because it broke the assumption that the startup process waiting for the recovery conflict on buffer pin should be waken up only by buffer unpin or the timeout enabled in ResolveRecoveryConflictWithBufferPin(). It caused, for example, SIGHUP signal handler or walreceiver process to wake that startup process up unnecessarily frequently. Additionally, add the comments about why that dedicated latch that the reverted patch tried to get rid of should not be removed. Thanks to Kyotaro Horiguchi for the discussion. Author: Fujii Masao Discussion: https://postgr.es/m/d8c0c608-021b-3c73-fffd-3240829ee986@oss.nttdata.com
* Add some checkpoint/restartpoint status to ps displayMichael Paquier2020-12-141-0/+45
| | | | | | | | | | | | | | | | | This is done for end-of-recovery and shutdown checkpoints/restartpoints (end-of-recovery restartpoints don't exist) rather than all types of checkpoints, in cases where it may not be possible to rely on pg_stat_activity to get a status from the startup or checkpointer processes. For example, at the end of a crash recovery, this is useful to know if a checkpoint is running in the startup process, while previously the ps display may only show some information about "recovering" something, that can be confusing while a checkpoint runs. Author: Justin Pryzby Reviewed-by: Nathan Bossart, Kirk Jamison, Fujii Masao, Michael Paquier Discussion: https://postgr.es/m/20200818225238.GP17022@telsasoft.com
* Convert elog(LOG) calls to ereport() where appropriatePeter Eisentraut2020-12-041-32/+68
| | | | | | | | | | | User-visible log messages should go through ereport(), so they are subject to translation. Many remaining elog(LOG) calls are really debugging calls. Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://www.postgresql.org/message-id/flat/92d6f545-5102-65d8-3c87-489f71ea0a37%40enterprisedb.com
* Replace a macro by a functionPeter Eisentraut2020-11-201-10/+11
| | | | | | Using a macro is ugly and not justified here. Discussion: https://www.postgresql.org/message-id/flat/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b@2ndquadrant.com
* Make the standby server promptly handle interrupt signals.Fujii Masao2020-11-161-0/+3
| | | | | | | | | | | | | | | | This commit changes the startup process in the standby server so that it handles the interrupt signals after waiting for wal_retrieve_retry_interval on the latch and resetting it, before entering another wait on the latch. This change causes the standby server to promptly handle interrupt signals. Otherwise, previously, there was the case where the standby needs to wait extra five seconds to shutdown when the shutdown request arrived while the startup process was waiting for wal_retrieve_retry_interval on the latch. Author: Fujii Masao, but implementation idea is from Soumyadeep Chakraborty Reviewed-by: Soumyadeep Chakraborty Discussion: https://postgr.es/m/9d7e6ab0-8a53-ddb9-63cd-289bcb25fe0e@oss.nttdata.com
* Fix and simplify some usages of TimestampDifference().Tom Lane2020-11-101-54/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce TimestampDifferenceMilliseconds() to simplify callers that would rather have the difference in milliseconds, instead of the select()-oriented seconds-and-microseconds format. This gets rid of at least one integer division per call, and it eliminates some apparently-easy-to-mess-up arithmetic. Two of these call sites were in fact wrong: * pg_prewarm's autoprewarm_main() forgot to multiply the seconds by 1000, thus ending up with a delay 1000X shorter than intended. That doesn't quite make it a busy-wait, but close. * postgres_fdw's pgfdw_get_cleanup_result() thought it needed to compute microseconds not milliseconds, thus ending up with a delay 1000X longer than intended. Somebody along the way had noticed this problem but misdiagnosed the cause, and imposed an ad-hoc 60-second limit rather than fixing the units. This was relatively harmless in context, because we don't care that much about exactly how long this delay is; still, it's wrong. There are a few more callers of TimestampDifference() that don't have a direct need for seconds-and-microseconds, but can't use TimestampDifferenceMilliseconds() either because they do need microsecond precision or because they might possibly deal with intervals long enough to overflow 32-bit milliseconds. It might be worth inventing another API to improve that, but that seems outside the scope of this patch; so those callers are untouched here. Given the fact that we are fixing some bugs, and the likelihood that future patches might want to back-patch code that uses this new API, back-patch to all supported branches. Alexey Kondratov and Tom Lane Discussion: https://postgr.es/m/3b1c053a21c07c1ed5e00be3b2b855ef@postgrespro.ru
* Fix segmentation fault that commit ac22929a26 caused.Fujii Masao2020-11-041-1/+2
| | | | | | | | | | | | | | | | | Commit ac22929a26 changed recoveryWakeupLatch so that it's reset to NULL at the end of recovery. This change could cause a segmentation fault in the buildfarm member 'elver'. Previously the latch was reset to NULL after calling ShutdownWalRcv(). But there could be a window between ShutdownWalRcv() and the actual exit of walreceiver. If walreceiver set the latch during that window, the segmentation fault could happen. To fix the issue, this commit changes walreceiver so that it sets the latch only when the latch has not been reset to NULL yet. Author: Fujii Masao Discussion: https://postgr.es/m/5c1f8a85-747c-7bf9-241e-dd467d8a3586@iki.fi
* Get rid of the dedicated latch for signaling the startup process.Fujii Masao2020-11-041-15/+14
| | | | | | | | | | | | | | This commit gets rid of the dedicated latch for signaling the startup process in favor of using its procLatch, since that comports better with possible generic signal handlers using that latch. Commit 1e53fe0e70 changed background processes so that they use standard SIGHUP handler. Like that, this commit also makes the startup process use standard SIGHUP handler to simplify the code. Author: Fujii Masao Reviewed-by: Bharath Rupireddy, Michael Paquier Discussion: https://postgr.es/m/CALj2ACXPorUqePswDtOeM_s82v9RW32E1fYmOPZ5NuE+TWKj_A@mail.gmail.com
* Fix compilation warning in xlog.cMichael Paquier2020-10-061-2/+0
| | | | | | | Oversight in 9d0bd95. Reported-by: Andres Freund Discussion: https://postgr.es/m/20201006023802.qqfi6m5bw5y77zql@alap3.anarazel.de
* Add pg_stat_wal statistics view.Fujii Masao2020-10-021-0/+1
| | | | | | | | | | | | | | | | | | | | This view shows the statistics about WAL activity. Currently it has only two columns: wal_buffers_full and stats_reset. wal_buffers_full column indicates the number of times WAL data was written to the disk because WAL buffers got full. This information is useful when tuning wal_buffers. stats_reset column indicates the time at which these statistics were last reset. pg_stat_wal view is also the basic infrastructure to expose other various statistics about WAL activity later. Bump PGSTAT_FILE_FORMAT_ID due to the change in pgstat format. Bump catalog version. Author: Masahiro Ikeda Reviewed-by: Takayuki Tsunakawa, Kyotaro Horiguchi, Amit Kapila, Fujii Masao Discussion: https://postgr.es/m/188bd3f2d2233cf97753b5ced02bb050@oss.nttdata.com
* Add block information in error context of WAL REDO apply loopMichael Paquier2020-10-021-1/+15
| | | | | | | | | | | | | | | | Providing this information can be useful for example when diagnosing problems related to recovery conflicts or for recovery issues without having to go through the output generated by pg_waldump to get some information about the blocks a WAL record works on. The block information is printed in the same format as pg_waldump. This already existed in xlog.c for debugging purposes with -DWAL_DEBUG, so adding the block information in the callback has required just a small refactoring. Author: Bertrand Drouvot Reviewed-by: Michael Paquier, Masahiko Sawada Discussion: https://postgr.es/m/c31e2cba-efda-762c-f4ad-5c25e5dac3d0@amazon.com
* Defer flushing of SLRU files.Thomas Munro2020-09-251-10/+18
| | | | | | | | | | | | | | | | | | | | | | | | Previously, we called fsync() after writing out individual pg_xact, pg_multixact and pg_commit_ts pages due to cache pressure, leading to regular I/O stalls in user backends and recovery. Collapse requests for the same file into a single system call as part of the next checkpoint, as we already did for relation files, using the infrastructure developed by commit 3eb77eba. This can cause a significant improvement to recovery performance, especially when it's otherwise CPU-bound. Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer that it applies to all the SLRU mini-buffer-pools as well as the main buffer pool. Rearrange things so that data collected in CheckpointStats includes SLRU activity. Also remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions, because they were redundant after the shutdown checkpoint that immediately precedes them. (I'm not sure if they were ever needed, but they aren't now.) Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (parts) Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
* Report resource usage at the end of recoveryDavid Rowley2020-09-161-2/+7
| | | | | | | | | | Reporting this has been rather useful in some recent recovery speedup work. It also seems like something that will be useful to the average DBA too. Author: David Rowley Reviewed-by: Thomas Munro Discussion: https://postgr.es/m/CAApHDvqYVORiZxq2xPvP6_ndmmsTkvr6jSYv4UTNaFa5i1kd%3DQ%40mail.gmail.com
* Message fixes and style improvementsPeter Eisentraut2020-09-141-1/+1
|
* snapshot scalability: Don't compute global horizons while building snapshots.Andres Freund2020-08-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To make GetSnapshotData() more scalable, it cannot not look at at each proc's xmin: While snapshot contents do not need to change whenever a read-only transaction commits or a snapshot is released, a proc's xmin is modified in those cases. The frequency of xmin modifications leads to, particularly on higher core count systems, many cache misses inside GetSnapshotData(), despite the data underlying a snapshot not changing. That is the most significant source of GetSnapshotData() scaling poorly on larger systems. Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons / thresholds as it has so far. But we don't really have to: The horizons don't actually change that much between GetSnapshotData() calls. Nor are the horizons actually used every time a snapshot is built. The trick this commit introduces is to delay computation of accurate horizons until there use and using horizon boundaries to determine whether accurate horizons need to be computed. The use of RecentGlobal[Data]Xmin to decide whether a row version could be removed has been replaces with new GlobalVisTest* functions. These use two thresholds to determine whether a row can be pruned: 1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed are definitely still visible. 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can definitely be removed GetSnapshotData() updates definitely_needed to be the xmin of the computed snapshot. When testing whether a row can be removed (with GlobalVisTestIsRemovableXid()) and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID < definitely_needed) the boundaries can be recomputed to be more accurate. As it is not cheap to compute accurate boundaries, we limit the number of times that happens in short succession. As the boundaries used by GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by GetSnapshotData()), it is likely that further test can benefit from an earlier computation of accurate horizons. To avoid regressing performance when old_snapshot_threshold is set (as that requires an accurate horizon to be computed), heap_page_prune_opt() doesn't unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the computation of the limited horizon, and the triggering of errors (with SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove tuples. This commit just removes the accesses to PGXACT->xmin from GetSnapshotData(), but other members of PGXACT residing in the same cache line are accessed. Therefore this in itself does not result in a significant improvement. Subsequent commits will take advantage of the fact that GetSnapshotData() now does not need to access xmins anymore. Note: This contains a workaround in heap_page_prune_opt() to keep the snapshot_too_old tests working. While that workaround is ugly, the tests currently are not meaningful, and it seems best to address them separately. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
* Track latest completed xid as a FullTransactionId.Andres Freund2020-08-111-2/+2
| | | | | | | | | | | | | | | The reason for doing so is that a subsequent commit will need that to avoid wraparound issues. As the subsequent change is large this was split out for easier review. The reason this is not a perfect straight-forward change is that we do not want track 64bit xids in the procarray or the WAL. Therefore we need to advance lastestCompletedXid in relation to 32 bit xids. The code for that is now centralized in MaintainLatestCompletedXid*. Author: Andres Freund Reviewed-By: Thomas Munro, Robert Haas, David Rowley Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
* Rename VariableCacheData.nextFullXid to nextXid.Andres Freund2020-08-111-24/+24
| | | | | | | | | Including Full in variable names duplicates the type information and leads to overly long names. As FullTransactionId cannot accidentally be casted to TransactionId that does not seem necessary. Author: Andres Freund Discussion: https://postgr.es/m/20200724011143.jccsyvsvymuiqfxu@alap3.anarazel.de
* Remove non-fast promotion.Fujii Masao2020-07-291-37/+11
| | | | | | | | | | | | | | | When fast promotion was supported in 9.3, non-fast promotion became undocumented feature and it's basically not available for ordinary users. However we decided not to remove non-fast promotion at that moment, to leave it for a release or two for debugging purpose or as an emergency method because fast promotion might have some issues, and then to remove it later. Now, several versions were released since that decision and there is no longer reason to keep supporting non-fast promotion. Therefore this commit removes non-fast promotion. Author: Fujii Masao Reviewed-by: Hamid Akhtar, Kyotaro Horiguchi Discussion: https://postgr.es/m/76066434-648f-f567-437b-54853b43398f@oss.nttdata.com
* Rename wal_keep_segments to wal_keep_size.Fujii Masao2020-07-201-12/+18
| | | | | | | | | | | | | | | | | | | | | | | | | max_slot_wal_keep_size that was added in v13 and wal_keep_segments are the GUC parameters to specify how much WAL files to retain for the standby servers. While max_slot_wal_keep_size accepts the number of bytes of WAL files, wal_keep_segments accepts the number of WAL files. This difference of setting units between those similar parameters could be confusing to users. To alleviate this situation, this commit renames wal_keep_segments to wal_keep_size, and make users specify the WAL size in it instead of the number of WAL files. There was also the idea to rename max_slot_wal_keep_size to max_slot_wal_keep_segments, in the discussion. But we have been moving away from measuring in segments, for example, checkpoint_segments was replaced by max_wal_size. So we concluded to rename wal_keep_segments to wal_keep_size. Back-patch to v13 where max_slot_wal_keep_size was added. Author: Fujii Masao Reviewed-by: Álvaro Herrera, Kyotaro Horiguchi, David Steele Discussion: https://postgr.es/m/574b4ea3-e0f9-b175-ead2-ebea7faea855@oss.nttdata.com
* Fix uninitialized value in segno calculationAlvaro Herrera2020-07-131-6/+9
| | | | | | | | | | | | Remove previous hack in KeepLogSeg that added a case to deal with a (badly represented) invalid segment number. This was added for the sake of GetWALAvailability. But it's not needed if in that function we initialize the segment number to be retreated to the currently being written segment, so do that instead. Per valgrind-running buildfarm member skink, and some sparc64 animals. Discussion: https://postgr.es/m/1724648.1594230917@sss.pgh.pa.us
* code: replace most remaining uses of 'master'.Andres Freund2020-07-081-7/+9
| | | | | | Author: Andres Freund Reviewed-By: David Steele Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4aue@alap3.anarazel.de
* code: replace 'master' with 'primary' where appropriate.Andres Freund2020-07-081-28/+28
| | | | | | | | | Also changed "in the primary" to "on the primary", and added a few "the" before "primary". Author: Andres Freund Reviewed-By: David Steele Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4aue@alap3.anarazel.de
* Fix incorrect variable datatype.Fujii Masao2020-07-081-1/+1
| | | | | | | | | | | Since slot_keep_segs indicates the number of WAL segments not LSN, its datatype should not be XLogRecPtr. Back-patch to v13 where this issue was added. Reported-by: Atsushi Torikoshi Author: Atsushi Torikoshi, tweaked by Fujii Masao Discussion: https://postgr.es/m/ebd0d674f3e050222238a960cac5251a@oss.nttdata.com
* Morph pg_replication_slots.min_safe_lsn to safe_wal_sizeAlvaro Herrera2020-07-071-4/+2
| | | | | | | | | | | | | | | The previous definition of the column was almost universally disliked, so provide this updated definition which is more useful for monitoring purposes: a large positive value is good, while zero or a negative value means danger. This should be operationally more convenient. Backpatch to 13, where the new column to pg_replication_slots (and the feature it represents) were added. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com> Discussion: https://postgr.es/m/9ddfbf8c-2f67-904d-44ed-cf8bc5916228@oss.nttdata.com
* Remove duplicate check added by commit b2a5545bd6.Amit Kapila2020-06-271-1/+1
| | | | | | | As this doesn't cause any harm so we decided to this clean up in HEAD only. Author: Ádám Balogh Discussion: https://postgr.es/m/VI1PR0702MB36631BD67559461AFDE1FEEE81920@VI1PR0702MB3663.eurprd07.prod.outlook.com
* Adjust max_slot_wal_keep_size behavior per reviewAlvaro Herrera2020-06-241-26/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | In pg_replication_slot, change output from normal/reserved/lost to reserved/extended/unreserved/ lost, which better expresses the possible states particularly near the time where segments are no longer safe but checkpoint has not run yet. Under the new definition, reserved means the slot is consuming WAL that's still under the normal WAL size constraints; extended means it's consuming WAL that's being protected by wal_keep_segments or the slot itself, whose size is below max_slot_wal_keep_size; unreserved means the WAL is no longer safe, but checkpoint has not yet removed those files. Such as slot is in imminent danger, but can still continue for a little while and may catch up to the reserved WAL space. Also, there were some bugs in the calculations used to report the status; fixed those. Backpatch to 13. Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com> Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20200616.120236.1809496990963386593.horikyota.ntt@gmail.com
* Add parens to ConvertToXSegs macroAlvaro Herrera2020-06-241-2/+5
| | | | | | | The current definition is dangerous. No bugs exist in our code at present, but backpatch to 11 nonetheless where it was introduced. Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
* Don't export basebackup.c's sendTablespace().Robert Haas2020-06-171-12/+2
| | | | | | | | | | | | | Commit 72d422a5227ef6f76f412486a395aba9f53bf3f0 made xlog.c call sendTablespace() with the 'sizeonly' argument set to true, which required basebackup.c to export sendTablespace(). However, that's kind of ugly, so instead defer the call to sendTablespace() until basebackup.c regains control. That way, it can still be a static function. Patch by me, reviewed by Amit Kapila and Kyotaro Horiguchi. Discussion: http://postgr.es/m/CA+TgmoYq+59SJ2zBbP891ngWPA9fymOqntqYcweSDYXS2a620A@mail.gmail.com
* Fix locking bugs that could corrupt pg_control.Thomas Munro2020-06-081-0/+8
| | | | | | | | | | | | | | | | | | The redo routines for XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} must acquire ControlFileLock before modifying ControlFile->checkPointCopy, or the checkpointer could write out a control file with a bad checksum. Likewise, XLogReportParameters() must acquire ControlFileLock before modifying ControlFile and calling UpdateControlFile(). Back-patch to all supported releases. Author: Nathan Bossart <bossartn@amazon.com> Author: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/70BF24D6-DC51-443F-B55A-95735803842A%40amazon.com
* Initial pgindent and pgperltidy run for v13.Tom Lane2020-05-141-9/+14
| | | | | | | | | | | Includes some manual cleanup of places that pgindent messed up, most of which weren't per project style anyway. Notably, it seems some people didn't absorb the style rules of commit c9d297751, because there were a bunch of new occurrences of function calls with a newline just after the left paren, all with faulty expectations about how the rest of the call would get indented.
* Collect built-in LWLock tranche names statically, not dynamically.Tom Lane2020-05-141-4/+1
| | | | | | | | | | | | | | | | | | There is little point in using the LWLockRegisterTranche mechanism for built-in tranche names. It wastes cycles, it creates opportunities for bugs (since failing to register a tranche name is a very hard-to-detect problem), and the lack of any centralized list of names encourages sloppy nonconformity in name choices. Moreover, since we have a centralized list of the tranches anyway in enum BuiltinTrancheIds, we're certainly not buying any flexibility in return for these disadvantages. Hence, nuke all the backend-internal LWLockRegisterTranche calls, and instead provide a const array of the builtin tranche names. (I have in mind to change a bunch of these names shortly, but this patch is just about getting them into one place.) Discussion: https://postgr.es/m/9056.1589419765@sss.pgh.pa.us
* Rework XLogReader callback systemAlvaro Herrera2020-05-081-3/+7
| | | | | | | | | | | | | | | | | | | Code review for 0dc8ead46363, prompted by a bug closed by 91c40548d5f7. XLogReader's system for opening and closing segments had gotten too complicated, with callbacks being passed at both the XLogReaderAllocate level (read_page) as well as at the WALRead level (segment_open). This was confusing and hard to follow, so restructure things so that these callbacks are passed together at XLogReaderAllocate time, and add another callback to the set (segment_close) to make it a coherent whole. Also, ensure XLogReaderState is an argument to all the callbacks, so that they can grab at the ->private data if necessary. Document the whole arrangement more clearly. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20200422175754.GA19858@alvherre.pgsql
* Change the display of WAL usage statistics in Explain.Amit Kapila2020-05-051-2/+2
| | | | | | | | | | | | | | | | In commit 33e05f89c5, we have added the option to display WAL usage statistics in Explain and auto_explain. The display format used two spaces between each field which is inconsistent with Buffer usage statistics which is using one space between each field. Change the format to make WAL usage statistics consistent with Buffer usage statistics. This commit also changed the usage of "full page writes" to "full page images" for WAL usage statistics to make it consistent with other parts of code and docs. Author: Julien Rouhaud, Amit Kapila Reviewed-by: Justin Pryzby, Kyotaro Horiguchi and Amit Kapila Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com
* Fix typoPeter Eisentraut2020-04-261-1/+1
| | | | from 303640199d0
* Fix handling of WAL segments ready to be archived during crash recoveryMichael Paquier2020-04-241-9/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | 78ea8b5 has fixed an issue related to the recycling of WAL segments on standbys depending on archive_mode. However, it has introduced a regression with the handling of WAL segments ready to be archived during crash recovery, causing those files to be recycled without getting archived. This commit fixes the regression by tracking in shared memory if a live cluster is either in crash recovery or archive recovery as the handling of WAL segments ready to be archived is different in both cases (those WAL segments should not be removed during crash recovery), and by using this new shared memory state to decide if a segment can be recycled or not. Previously, it was not possible to know if a cluster was in crash recovery or archive recovery as the shared state was able to track only if recovery was happening or not, leading to the problem. A set of TAP tests is added to close the gap here, making sure that WAL segments ready to be archived are correctly handled when a cluster is in archive or crash recovery with archive_mode set to "on" or "always", for both standby and primary. Reported-by: Benoît Lobréau Author: Jehan-Guillaume de Rorthais Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost Backpatch-through: 9.5
* Cosmetic fixups for WAL usage work.Amit Kapila2020-04-131-1/+1
| | | | | | | Reported-by: Justin Pryzby and Euler Taveira Author: Justin Pryzby and Julien Rouhaud Reviewed-by: Amit Kapila Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com
* Rationalize GetWalRcv{Write,Flush}RecPtr().Thomas Munro2020-04-081-10/+10
| | | | | | | | | | | | | | | | GetWalRcvWriteRecPtr() previously reported the latest *flushed* location. Adopt the conventional terminology used elsewhere in the tree by renaming it to GetWalRcvFlushRecPtr(), and likewise for some related variables that used the term "received". Add a new definition of GetWalRcvWriteRecPtr(), which returns the latest *written* value. This will allow later patches to use the value for non-data-integrity purposes, without having to wait for the flush pointer to advance. Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
* Revert 0f5ca02f53Alexander Korotkov2020-04-081-13/+0
| | | | | | | | 0f5ca02f53 introduces 3 new keywords. It appears to be too much for relatively small feature. Given now we past feature freeze, it's already late for discussion of the new syntax. So, revert. Discussion: https://postgr.es/m/28209.1586294824%40sss.pgh.pa.us
* Allow users to limit storage reserved by replication slotsAlvaro Herrera2020-04-071-23/+122
| | | | | | | | | | | | | | | | Replication slots are useful to retain data that may be needed by a replication system. But experience has shown that allowing them to retain excessive data can lead to the primary failing because of running out of space. This new feature allows the user to configure a maximum amount of space to be reserved using the new option max_slot_wal_keep_size. Slots that overrun that space are invalidated at checkpoint time, enabling the storage to be released. Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Jehan-Guillaume de Rorthais <jgdr@dalibo.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20170228.122736.123383594.horiguchi.kyotaro@lab.ntt.co.jp
* Implement waiting for given lsn at transaction startAlexander Korotkov2020-04-071-0/+13
| | | | | | | | | | | | | | | | | | | | This commit adds following optional clause to BEGIN and START TRANSACTION commands. WAIT FOR LSN lsn [ TIMEOUT timeout ] New clause pospones transaction start till given lsn is applied on standby. This clause allows user be sure, that changes previously made on primary would be visible on standby. New shared memory struct is used to track awaited lsn per backend. Recovery process wakes up backend once required lsn is applied. Author: Ivan Kartyshov, Anna Akenteva Reviewed-by: Craig Ringer, Thomas Munro, Robert Haas, Kyotaro Horiguchi Reviewed-by: Masahiko Sawada, Ants Aasma, Dmitry Ivanov, Simon Riggs Reviewed-by: Amit Kapila, Alexander Korotkov Discussion: https://postgr.es/m/0240c26c-9f84-30ea-fca9-93ab2df5f305%40postgrespro.ru
* Prevent archive recovery from scanning non-existent WAL files.Fujii Masao2020-04-081-1/+26
| | | | | | | | | | | | | | | | | | | Previously when there were multiple timelines listed in the history file of the recovery target timeline, archive recovery searched all of them, starting from the newest timeline to the oldest one, to find the segment to read. That is, archive recovery had to continuously fail scanning the segment until it reached the timeline that the segment belonged to. These scans for non-existent segment could be harmful on the recovery performance especially when archival area was located on the remote storage and each scan could take a long time. To address the issue, this commit changes archive recovery so that it skips scanning the timeline that the segment to read doesn't belong to. Author: Kyotaro Horiguchi, tweaked a bit by Fujii Masao Reviewed-by: David Steele, Pavel Suderevsky, Grigory Smolkin Discussion: https://postgr.es/m/16159-f5a34a3a04dc67e0@postgresql.org Discussion: https://postgr.es/m/20200129.120222.1476610231001551715.horikyota.ntt@gmail.com
* Revert "Improve handling of parameter differences in physical replication"Peter Eisentraut2020-04-041-64/+10
| | | | | | | | This reverts commit 246f136e76ecd26844840f2b2057e2c87ec9868d. That patch wasn't quite complete enough. Discussion: https://www.postgresql.org/message-id/flat/E1jIpJu-0007Ql-CL%40gemulon.postgresql.org
* Add infrastructure to track WAL usage.Amit Kapila2020-04-041-1/+11
| | | | | | | | | | | | | | | | | | | | | | This allows gathering the WAL generation statistics for each statement execution. The three statistics that we collect are the number of WAL records, the number of full page writes and the amount of WAL bytes generated. This helps the users who have write-intensive workload to see the impact of I/O due to WAL. This further enables us to see approximately what percentage of overall WAL is due to full page writes. In the future, we can extend this functionality to allow us to compute the the exact amount of WAL data due to full page writes. This patch in itself is just an infrastructure to compute WAL usage data. The upcoming patches will expose this data via explain, auto_explain, pg_stat_statements and verbose (auto)vacuum output. Author: Kirill Bychik, Julien Rouhaud Reviewed-by: Dilip Kumar, Fujii Masao and Amit Kapila Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com
* Generate backup manifests for base backups, and validate them.Robert Haas2020-04-031-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A manifest is a JSON document which includes (1) the file name, size, last modification time, and an optional checksum for each file backed up, (2) timelines and LSNs for whatever WAL will need to be replayed to make the backup consistent, and (3) a checksum for the manifest itself. By default, we use CRC-32C when checksumming data files, because we are trying to detect corruption and user error, not foil an adversary. However, pg_basebackup and the server-side BASE_BACKUP command now have options to select a different algorithm, so users wanting a cryptographic hash function can select SHA-224, SHA-256, SHA-384, or SHA-512. Users not wanting file checksums at all can disable them, or disable generating of the backup manifest altogether. Using a cryptographic hash function in place of CRC-32C consumes significantly more CPU cycles, which may slow down backups in some cases. A new tool called pg_validatebackup can validate a backup against the manifest. If no checksums are present, it can still check that the right files exist and that they have the expected sizes. If checksums are present, it can also verify that each file has the expected checksum. Additionally, it calls pg_waldump to verify that the expected WAL files are present and parseable. Only plain format backups can be validated directly, but tar format backups can be validated after extracting them. Robert Haas, with help, ideas, review, and testing from David Steele, Stephen Frost, Andrew Dunstan, Rushabh Lathia, Suraj Kharage, Tushar Ahuja, Rajkumar Raghuwanshi, Mark Dilger, Davinder Singh, Jeevan Chalke, Amit Kapila, Andres Freund, and Noah Misch. Discussion: http://postgr.es/m/CA+TgmoZV8dw1H2bzZ9xkKwdrk8+XYa+DC9H=F7heO2zna5T6qg@mail.gmail.com
* Improve the message logged when recovery is paused.Fujii Masao2020-04-011-8/+17
| | | | | | | | | | | | | | | | | When recovery target is reached and recovery is paused because of recovery_target_action=pause, executing pg_wal_replay_resume() causes the standby to promote, i.e., the recovery to end. So, in this case, the previous message "Execute pg_wal_replay_resume() to continue" logged was confusing because pg_wal_replay_resume() doesn't cause the recovery to continue. This commit improves the message logged when recovery is paused, and the proper message is output based on what (pg_wal_replay_pause or recovery_target_action) causes recovery to be paused. Author: Sergei Kornilov, revised by Fujii Masao Reviewed-by: Robert Haas Discussion: https://postgr.es/m/19168211580382043@myt5-b646bde4b8f3.qloud-c.yandex.net
* Move routine definitions of xlogarchive.c to a new header fileMichael Paquier2020-03-311-0/+1
| | | | | | | | | | | | | | The definitions of the routines defined in xlogarchive.c have been part of xlog_internal.h which is included by several frontend tools, but all those routines are only called by the backend. More cleanup could be done within xlog_internal.h, but that's already a nice cut. This will help a follow-up patch for pg_rewind where handling of restore_command is added for frontends. Author: Alexey Kondratov, Michael Paquier Reviewed-by: Álvaro Herrera, Alexander Korotkov Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1@postgrespro.ru
* Improve handling of parameter differences in physical replicationPeter Eisentraut2020-03-301-10/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When certain parameters are changed on a physical replication primary, this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL record. The standby then checks whether its own settings are at least as big as the ones on the primary. If not, the standby shuts down with a fatal error. The correspondence of settings between primary and standby is required because those settings influence certain shared memory sizings that are required for processing WAL records that the primary might send. For example, if the primary sends a prepared transaction, the standby must have had max_prepared_transaction set appropriately or it won't be able to process those WAL records. However, fatally shutting down the standby immediately upon receipt of the parameter change record might be a bit of an overreaction. The resources related to those settings are not required immediately at that point, and might never be required if the activity on the primary does not exhaust all those resources. If we just let the standby roll on with recovery, it will eventually produce an appropriate error when those resources are used. So this patch relaxes this a bit. Upon receipt of XLOG_PARAMETER_CHANGE, we still check the settings but only issue a warning and set a global flag if there is a problem. Then when we actually hit the resource issue and the flag was set, we issue another warning message with relevant information. At that point we pause recovery, so a hot standby remains usable. We also repeat the last warning message once a minute so it is harder to miss or ignore. Reviewed-by: Sergei Kornilov <sk@zsrv.org> Reviewed-by: Masahiko Sawada <masahiko.sawada@2ndquadrant.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b@2ndquadrant.com