summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Use correct args in get_doc_infofix-test-get-doc-infoJay Doane2021-04-201-1/+1
| | | | | `fabric.get_doc_info/3` requires three arguments, but this line was only using one.
* Merge pull request #3312 from apache/weatherreportJay Doane2021-04-2035-1/+3326
|\ | | | | Import weatherreport
| * Build and escriptize weatherreportweatherreportJay Doane2021-04-204-2/+23
| |
| * Crash if config app fails to startJay Doane2021-04-201-1/+1
| |
| * Add getopt copyright to NOTICEJay Doane2021-04-201-0/+4
| |
| * Update weatherreport documentationJay Doane2021-04-202-3/+16
| |
| * Delete obsolete weatherreport files and documentationJay Doane2021-04-205-149/+0
| |
| * Change header preamble to "derived from riaknostic"Jay Doane2021-04-2011-11/+11
| |
| * Change search check failure from error to warningJay Doane2021-04-201-1/+2
| | | | | | | | | | In default CouchDB, search is disabled by default, so a failure to connect to clouseau should only be a warning.
| * Support default IOQ in weatherreportJay Doane2021-04-202-1/+28
| | | | | | | | | | weatherreport previously relied on Cloudant's IOQ implementation. This adds support for the default IOQ so that it works with either.
| * Port custom recon process call checksJay Doane2021-04-201-2/+57
| | | | | | | | Port fork of custom `recon` functions for checking process calls.
| * Merge remote-tracking branch 'weatherreport/riaknostic-squash' into ↵Jay Doane2021-04-2034-0/+3353
| |\ |/ / | | | | weatherreport
| * Change s/cloudant/couchdb/g for maintenance_modeILYA Khlopotov2021-04-191-1/+1
| | | | | | | | BugzID: 45855
| * Downgrade process call count to noticeMike Wallace2021-04-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit downgrades messages about process call counts to notice level. The previous level of warning was inappropriate for many processes as it really just indicated normal operation of the system. Until this check is made a bit smarter, with custom thresholds per MFA, we will log at nothing more severe than notice so that we don't confuse operators into thinking something is wrong. BugzID: 37593
| * Fix description of process_memory checkMike Wallace2021-04-191-1/+1
| |
| * Check mean node statistics over one secondRiccardo Brognara2021-04-191-0/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | Check the absolute statistics obtained by recon:node_stats/4 over a one second period. The values are sampled ten times and the mean is returned. For run_queue and process_count the mean is compared to hard-coded thresholds which determine whether a warning or info message is returned. For all other statistics an info message is always returned. BugzID: 32877
| * Check number of pending internal replication jobsRiccardo Brognara2021-04-191-0/+57
| | | | | | | | | | | | | | | | | | | | | | This commit adds a check for the number of pending internal replication jobs on a node. A large number of pending internal replication jobs indicates that internal replication is falling behind. A warning message is returned if the number of jobs exceeds a hard-coded threshold, otherwise an info message is returned. BugzID: 32872
| * Handle conflicted shard mapsMike Wallace2021-04-191-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | Custodian will output a message if there are one or more conflicting revisions for a shard map. Previously weatherreport_check_custodian failed to match that message which caused the check to fail. This commit adds new function clauses to report_to_message/1 and format/1 so that the conflict messages are output to the console at the warning log level. BugzID: 34157
| * there is no appRobert Newson2021-04-191-2/+1
| |
| * Don't include node name in diagnostic messagesMike Wallace2021-04-192-17/+14
| | | | | | | | | | | | Unless it is explicitly required for a check, avoid returning the node name in diagnostic messages. The node name is added by the logging code so including it in the messages is not necessary.
| * Remove redundant rpc callsMike Wallace2021-04-198-25/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The initial design of Riaknostic (and hence WeatherReport) was that checks ran in the script and made RPC calls to the local cluster node when needed. Due to a requirement to allow checks to run on multiple cluster nodes without guarantees of SSH connectivity, the design has changed such that each check is now run entirely on the cluster node. This means it is no longer necessary to use weatherreport_node:local_command/3 to interact with the cluster, the calls can just be made directly. This commit removes the redundant calls to weatherreport_node:local_command/3 which makes the intent of the code a bit clearer. BugzID: 34016
| * Remove escriptize from all targetMike Wallace2021-04-191-1/+1
| |
| * Remove unused function exportsPaul J. Davis2021-04-191-3/+1
| |
| * Remove packaged rebarPaul J. Davis2021-04-192-7/+7
| |
| * Remove the getopt dependencyPaul J. Davis2021-04-193-3/+624
| |
| * Remove twig dependencyPaul J. Davis2021-04-197-62/+88
| |
| * Remove rebar.config dependenciesPaul J. Davis2021-04-191-7/+1
| |
| * Diagnostic check for processes by memory usageMike Wallace2021-04-191-0/+57
| | | | | | | | | | | | | | This commit adds a diagnostic check that checks for processes where memory usage exceeds a set threshold (100MB). BugzID: 32875
| * Generalise threshold checks on process attributesMike Wallace2021-04-192-26/+37
| | | | | | | | | | | | | | | | | | | | | | | | This commit generalises the logic that checks for process attributes and moves it to weatherreport_utils. This is groundwork for the following commit which will add a check for processes by memory usage by re-using the generalised code. We also take the opportunity to rename fold_processes to something that is more descriptive and also correct. BugzID: 32875
| * Add diagnostic check for TCP recv and send queuesMike Wallace2021-04-191-0/+89
| | | | | | | | | | | | | | | | | | This commit adds a diagnostic check for the sum of the TCP send and recv queues in the OS kernel. If the total sum of all TCP send or recv queues exceeds a hardcoded threshold then a warning message is returned, otherwise an informational message. BugzID: 32881
| * Add check for whether node can be safely rebuiltMike Wallace2021-04-191-0/+116
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a diagnostic check which determines whether a node can be safely rebuilt. If a node can be taken out of service without leaving any shards on the cluster with less than two live copies then it is considered safe and an info message is returned. If taking the node out of service would leave one or more shards with one live copy then an error message is returned. If zero live copies would be left then a critical message is returned. The bulk of work in this commit was authored in: cloudant/snippet@004537c28ae2c764c29c20e708681da7f542e21c BugzID: 33831
| * Try to ensure writes to stdout are flushedMike Wallace2021-04-192-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a promisingly named function to weatherreport_util (flush_stdout/0) which attempts to ensure writes to stdout are flushed. This is called before the script exits in order to avoid an issue where the script output is truncated due to halt being called before stdout is flushed. The implementation of this function is the result of many hours of experimentation and is quite clearly a hack. BugzID: 33697
| * Failed checks are turned into diagnostic messagesMike Wallace2021-04-191-28/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When failed checks occur (either bad RPC nodes or badrpc responses) they were previously logged directly and not included in the diagnostic messages. This meant that the script would exit with status 0 if there were no other diagnostic messages. This commit turns log messages for failed checks into diagnostic messages (with level crit) so they get processed in the same way as other diagnostic messages, i.e. printed in the correct place and influencing the exit status accordingly. BugzID: 33740
| * Run local checks via RPCMike Wallace2021-04-190-0/+0
| | | | | | | | | | | | | | | | | | | | | | | | This commit changes the run functionality such that when checks are run on the single-node they are called via RPC to the local cluster node. This resolves an issue where the file permissions check would run using the escript user rather than the dbcore user in single-node mode. BugzID: 33731
| * Run local checks via RPCMike Wallace2021-04-191-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | This commit changes the run functionality such that when checks are run on the single-node they are called via RPC to the local cluster node. This resolves an issue where the file permissions check would run using the escript user rather than the dbcore user in single-node mode. BugzID: 33731
| * Improve logging of failed RPC callsMike Wallace2021-04-192-3/+22
| | | | | | | | | | | | | | | | | | Checks called via RPC that exceed the timeout value were being poorly logged. This commit adds the check name to the logging of bad nodes and adds explicit logging of badrpc responses which were previously being silently discarded. BugzID: 33697
| * Make timeout for check-over-RPC configurableMike Wallace2021-04-194-3/+18
| | | | | | | | | | | | | | | | | | | | This commit adds a --timeout option (short name -t) which specifies the timeout value used when running each diagnostic check via RPC (i.e. using the -a option). The default timeout is also changed from 5s to 5m. BugzID: 33697
| * WhitespaceMike Wallace2021-04-191-1/+0
| |
| * Handle cases where first/current call is undefinedMike Wallace2021-04-191-6/+13
| | | | | | | | | | | | | | | | | | | | | | The first or current call of functions returned by recon:show_current_call_counts/0 and recon:show_first_call_counts/0 can be undefined. This commit adds a function clause to fold_processes/5 which handles that case and a corresponding function clause to format/1 for formatting the message returned by the new clause. BugzID: 33696
| * Route rpc checks via local cluster nodeMike Wallace2021-04-192-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit routes rpc check calls via the local cluster node instead of making them directly from the escript. When rpc calls were made directly from the escript the escript would connect to cluster nodes without being hidden. This meant rexi_server and rexi_buffer processes would be spawned as well as various other undesirable outcomes. By making these rpc calls via our hidden connection to the local cluster node we avoid these problems. BugzID: 33695
| * Remove weatherreport_node:cluster_command funsMike Wallace2021-04-191-30/+0
| | | | | | | | | | Remove unused cluster_command functions from the weatherreport_node module.
| * Allow checks to be run across the clusterMike Wallace2021-04-1916-97/+245
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a `--all-nodes` option (short name `-a`) which causes the specified checks to be run on all cluster nodes. This is achieved by running weatherreport_check:check via rpc:multicall/5. This requires a number of supporting changes: 1. Log messages to the console now report the node that is the origin of a message, rather than the current node 2. Checks now learn about expert mode from supplied options rather than the application environment. This is because remote nodes will not have the same environment as the escript. 3. Additional logging in checks is converted to additional messages which are returned to the caller. BugzID: 33243
| * Optimize noatime checkMike Wallace2021-04-191-1/+1
| | | | | | | | | | | | | | | | This commit removes the need to sleep for one second when testing for noatime by forcing the atime to a specific date in the past, rather than using the current time and sleeping for 1001 ms. BugzID: 33261
| * Add check for IOQ active requestsMike Wallace2021-04-191-0/+78
| | | | | | | | | | | | | | | | | | | | | | This commit adds a function that calculates the total number of requests in the IOQ disk queues by folding through the output of ioq:get_disk_queues/1. A warning message is returned if the total number of requests exceeds a hardcode threshold, otherwise an info message is returned. The printed messages include the raw output of ioq:get_disk_queues/1. BugzID: 32880
| * Add check for processes by first/current callMike Wallace2021-04-191-0/+99
| | | | | | | | | | | | | | | | | | Add a diagnostic check for the number of processes in sharing first and current function calls. This is useful from a diagnostic point of view because it gives an idea of the work being carried out on the node. BugzID: 32911
| * Add check for shard safety/livenessMike Wallace2021-04-191-0/+76
| | | | | | | | | | | | | | | | | | This commit adds a function that checks shard safety and liveness by calling custodian:report/0 on the target node and returning log messages at warning, error and critical when the number of safe/live shard copies is 2, 1 or 0 respectively. BugzID: 32914
| * Add check for search availabilityMike Wallace2021-04-191-0/+57
| | | | | | | | | | | | | | Add a diagnostic check which verifies the local search node is available using `net_adm:ping/1`. BugzID: 32909
| * Fix unused variable warningMike Wallace2021-04-191-1/+1
| |
| * Improve default etc directoryMike Wallace2021-04-191-1/+2
| | | | | | | | | | | | | | | | | | This commit sets the default etc directory location relative to the path of the escript. This will be the correct location when the weatherreport escript is included in the bin directory of a release. BugzID: 32856
| * Include the node name when logging to stdoutMike Wallace2021-04-191-2/+9
| | | | | | | | BugzID: 33172