| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
| |
|
|
|
|
|
| |
`fabric.get_doc_info/3` requires three arguments, but this line was
only using one.
|
|\
| |
| | |
Import weatherreport
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| |
| |
| |
| |
| | |
In default CouchDB, search is disabled by default, so a failure to
connect to clouseau should only be a warning.
|
| |
| |
| |
| |
| | |
weatherreport previously relied on Cloudant's IOQ implementation.
This adds support for the default IOQ so that it works with either.
|
| |
| |
| |
| | |
Port fork of custom `recon` functions for checking process calls.
|
| |\
|/ /
| |
| | |
weatherreport
|
| |
| |
| |
| | |
BugzID: 45855
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This commit downgrades messages about process call counts to
notice level.
The previous level of warning was inappropriate for many
processes as it really just indicated normal operation of the
system.
Until this check is made a bit smarter, with custom thresholds
per MFA, we will log at nothing more severe than notice so that
we don't confuse operators into thinking something is wrong.
BugzID: 37593
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Check the absolute statistics obtained by recon:node_stats/4
over a one second period. The values are sampled ten times and
the mean is returned.
For run_queue and process_count the mean is compared to
hard-coded thresholds which determine whether a warning or
info message is returned. For all other statistics an info
message is always returned.
BugzID: 32877
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This commit adds a check for the number of pending internal
replication jobs on a node. A large number of pending internal
replication jobs indicates that internal replication is falling
behind. A warning message is returned if the number of jobs
exceeds a hard-coded threshold, otherwise an info message is
returned.
BugzID: 32872
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Custodian will output a message if there are one or more
conflicting revisions for a shard map. Previously
weatherreport_check_custodian failed to match that message which
caused the check to fail.
This commit adds new function clauses to report_to_message/1 and
format/1 so that the conflict messages are output to the console
at the warning log level.
BugzID: 34157
|
| | |
|
| |
| |
| |
| |
| |
| | |
Unless it is explicitly required for a check, avoid returning
the node name in diagnostic messages. The node name is added by
the logging code so including it in the messages is not necessary.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The initial design of Riaknostic (and hence WeatherReport) was
that checks ran in the script and made RPC calls to the local
cluster node when needed. Due to a requirement to allow checks
to run on multiple cluster nodes without guarantees of SSH
connectivity, the design has changed such that each check is
now run entirely on the cluster node.
This means it is no longer necessary to use
weatherreport_node:local_command/3 to interact with the cluster,
the calls can just be made directly.
This commit removes the redundant calls to
weatherreport_node:local_command/3 which makes the intent of the
code a bit clearer.
BugzID: 34016
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| |
| |
| |
| |
| |
| |
| | |
This commit adds a diagnostic check that checks for processes
where memory usage exceeds a set threshold (100MB).
BugzID: 32875
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This commit generalises the logic that checks for process
attributes and moves it to weatherreport_utils. This is
groundwork for the following commit which will add a check
for processes by memory usage by re-using the generalised code.
We also take the opportunity to rename fold_processes to
something that is more descriptive and also correct.
BugzID: 32875
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This commit adds a diagnostic check for the sum of the TCP send
and recv queues in the OS kernel. If the total sum of all TCP
send or recv queues exceeds a hardcoded threshold then a warning
message is returned, otherwise an informational message.
BugzID: 32881
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This commit adds a diagnostic check which determines whether a
node can be safely rebuilt. If a node can be taken out of service
without leaving any shards on the cluster with less than two live
copies then it is considered safe and an info message is returned.
If taking the node out of service would leave one or more shards
with one live copy then an error message is returned. If zero live
copies would be left then a critical message is returned.
The bulk of work in this commit was authored in:
cloudant/snippet@004537c28ae2c764c29c20e708681da7f542e21c
BugzID: 33831
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This commit adds a promisingly named function to weatherreport_util
(flush_stdout/0) which attempts to ensure writes to stdout are
flushed. This is called before the script exits in order to avoid
an issue where the script output is truncated due to halt being
called before stdout is flushed.
The implementation of this function is the result of many hours
of experimentation and is quite clearly a hack.
BugzID: 33697
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When failed checks occur (either bad RPC nodes or badrpc responses)
they were previously logged directly and not included in the
diagnostic messages. This meant that the script would exit with
status 0 if there were no other diagnostic messages.
This commit turns log messages for failed checks into diagnostic
messages (with level crit) so they get processed in the same way
as other diagnostic messages, i.e. printed in the correct place
and influencing the exit status accordingly.
BugzID: 33740
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This commit changes the run functionality such that when checks
are run on the single-node they are called via RPC to the local
cluster node.
This resolves an issue where the file permissions check would run
using the escript user rather than the dbcore user in single-node
mode.
BugzID: 33731
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This commit changes the run functionality such that when checks
are run on the single-node they are called via RPC to the local
cluster node.
This resolves an issue where the file permissions check would run
using the escript user rather than the dbcore user in single-node
mode.
BugzID: 33731
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Checks called via RPC that exceed the timeout value were being
poorly logged. This commit adds the check name to the logging
of bad nodes and adds explicit logging of badrpc responses
which were previously being silently discarded.
BugzID: 33697
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This commit adds a --timeout option (short name -t) which specifies
the timeout value used when running each diagnostic check via RPC
(i.e. using the -a option).
The default timeout is also changed from 5s to 5m.
BugzID: 33697
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The first or current call of functions returned by
recon:show_current_call_counts/0 and recon:show_first_call_counts/0
can be undefined. This commit adds a function clause to
fold_processes/5 which handles that case and a corresponding
function clause to format/1 for formatting the message returned by
the new clause.
BugzID: 33696
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This commit routes rpc check calls via the local cluster node
instead of making them directly from the escript.
When rpc calls were made directly from the escript the escript
would connect to cluster nodes without being hidden. This meant
rexi_server and rexi_buffer processes would be spawned as well
as various other undesirable outcomes. By making these rpc
calls via our hidden connection to the local cluster node we avoid
these problems.
BugzID: 33695
|
| |
| |
| |
| |
| | |
Remove unused cluster_command functions from the
weatherreport_node module.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add a `--all-nodes` option (short name `-a`) which causes the
specified checks to be run on all cluster nodes. This is achieved
by running weatherreport_check:check via rpc:multicall/5.
This requires a number of supporting changes:
1. Log messages to the console now report the node that is
the origin of a message, rather than the current node
2. Checks now learn about expert mode from supplied options
rather than the application environment. This is because
remote nodes will not have the same environment as the escript.
3. Additional logging in checks is converted to additional
messages which are returned to the caller.
BugzID: 33243
|
| |
| |
| |
| |
| |
| |
| |
| | |
This commit removes the need to sleep for one second when testing
for noatime by forcing the atime to a specific date in the past,
rather than using the current time and sleeping for 1001 ms.
BugzID: 33261
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This commit adds a function that calculates the total number of
requests in the IOQ disk queues by folding through the output
of ioq:get_disk_queues/1. A warning message is returned if the
total number of requests exceeds a hardcode threshold, otherwise
an info message is returned. The printed messages include the raw
output of ioq:get_disk_queues/1.
BugzID: 32880
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add a diagnostic check for the number of processes in sharing
first and current function calls. This is useful from a diagnostic
point of view because it gives an idea of the work being carried
out on the node.
BugzID: 32911
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This commit adds a function that checks shard safety and liveness
by calling custodian:report/0 on the target node and returning
log messages at warning, error and critical when the number of
safe/live shard copies is 2, 1 or 0 respectively.
BugzID: 32914
|
| |
| |
| |
| |
| |
| |
| | |
Add a diagnostic check which verifies the local search node is
available using `net_adm:ping/1`.
BugzID: 32909
|