delta/couchdb.git - github.com: apache/couchdb.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	Reserve aegis namespace under ?CLUSTER_CONFIGfdb-layer-aegis-namespace	Eric Avdey	2020-06-16	1	-0/+4
\|
*	add back r and w options	Tony Sun	2020-06-12	1	-0/+12
\|
*	Bump erlfdb to v1.2.2	Nick Vatamaniuc	2020-06-12	1	-1/+1
\| \| \| \|	https://github.com/apache/couchdb-erlfdb/releases/tag/v1.2.2
*	Handle transaction and future timeouts in couch_jobs notifiers	Nick Vatamaniuc	2020-06-10	1	-1/+10
\| \| \| \| \| \|	In an overload scenario do not let notifiers crash and lose their subscribers, instead make them more robust and let them retry on future or transaction timeouts.
*	Split couch_views acceptors and workers	Nick Vatamaniuc	2020-06-08	5	-22/+309
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Optimize couch_views by using a separate set of acceptors and workers. Previously, all `max_workers` where spawned on startup, and were to waiting to accept jobs in parallel. In a setup with a large number of pods, and 100 workers per pod, that could lead to a lot of conflicts being generated when all those workers race to accept the same job at the same time. The improvement is to spawn only a limited number of acceptors (5, by default), then, spawn more after some of them become workers. Also, when some workers finish or die with an error, check if more acceptors could be spawned. As an example, here is what might happen with `max_acceptors = 5` and `max_workers = 100` (`A` and `W` are the current counts of acceptors and workers, respectively): 1. Starting out: `A = 5, W = 0` 2. After 2 acceptors start running: `A = 3, W = 2` Then immediately 2 more acceptors are spawned: `A = 5, W = 2` 3. After 95 workers are started: `A = 5, W = 95` 4. Now if 3 acceptors accept, it would look like: `A = 2, W = 98` But no more acceptors would be started. 5. If the last 2 acceptors also accept jobs: `A = 0, W = 100` At this point no more indexing jobs can be accepted and started until at least one of the workers finish and exit. 6. If 1 worker exits: `A = 0, W = 99` An acceptor will be immediately spawned `A = 1, W = 99` 7. If all 99 workers exit, it will go back to: `A = 5, W = 0`
*	Include database uuid in db info result	Nick Vatamaniuc	2020-06-04	3	-5/+16
\| \| \| \| \| \| \| \|	As per ML [discussion](https://lists.apache.org/thread.html/rb328513fb932e231cf8793f92dd1cc2269044cb73cb43a6662c464a1%40%3Cdev.couchdb.apache.org%3E) add a `uuid` field to db info results in order to be able to uniquely identify a particular instance of a database. When a database is deleted and re-created with the same name, it will return a new `uuid` value.
*	Fix couch_jobs accept timeout when no_schedule option is used	Nick Vatamaniuc	2020-06-03	1	-8/+11
\| \| \| \| \| \| \| \| \| \| \| \|	When waiting to accept jobs and scheduling was used, timeout is limited based on the time scheduling parameter. When no_schedule option is used, time scheduling parameter is set to 0 always, and so in that case, we have to special-case the limit to return `infinity`. Later on when we wait for the watch to fire, the actual timeout can still be limited, by a separate user specified timeout option, but if user specifies `infinity` there and sets `#{no_schedule => true}` then we should respect and never return `{error, not_found}` in response.
*	Improve efficiency of couch_jobs:accept/2 for views	Nick Vatamaniuc	2020-06-02	3	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use the `no_schedule` option to speed up job dequeuing. This optimization allows dequeuing jobs more efficiently if these conditions are met: 1) Job IDs start with a random prefix 2) No time-based scheduling is used Both of those can be true for views job IDs can be generated such that signature comes before the db name part, which is what this commit does. The way the optimization works, is random IDs are generating in pending jobs range, then, a key selection is used to pick either a job before or after it. That reduces each dequeue attempt to just 1 read instead of reading up to 1000 jobs.
*	Handle error:{timeout, _} exception in couch_jobs:accept	Nick Vatamaniuc	2020-06-02	1	-0/+2
\| \| \| \| \| \|	Under load accept loop can blow up with timeout error from `erlfdb:wait(...)`(https://github.com/apache/couchdb-erlfdb/blob/master/src/erlfdb.erl#L255) so guard against it just like we do for fdb transaction timeout (1031) errors.
*	Remove on_commit handler from fabric2_fdb	Nick Vatamaniuc	2020-06-02	2	-47/+26
\| \| \| \| \| \|	Update db handles right away as soon we db verison is checked. This ensures concurrent requests will get access to the current handle as soon as possible and may avoid doing extra version checks and re-opens.
*	Prevent eviction of newer handles from fabric_server cache	Nick Vatamaniuc	2020-06-02	2	-9/+70
\| \| \| \| \| \|	Check metadata versions to ensure newer handles are not clobbered. The same thing is done for removal, `maybe_remove/1` removes handle only if there isn't a newer handle already there.
*	Guard couch_jobs:accept_loop timing out	Nick Vatamaniuc	2020-05-29	1	-1/+9
\| \| \| \|	And also against too many conflicts during overload
*	Protect couch_jobs activity monitor against timeouts as well	Nick Vatamaniuc	2020-05-29	1	-3/+3
\|
*	Fix bad catch statement in couch_jobs activity monitor	Nick Vatamaniuc	2020-05-29	1	-1/+1
\|
*	Fix mango erlfdb error catch clause erlfdb -> erlfdb_error	Nick Vatamaniuc	2020-05-28	2	-5/+6
\|
*	Don't skip over docs in mango indices on erlfdb errors	Nick Vatamaniuc	2020-05-28	2	-1/+16
\|
*	Introduce _bulk_docs max_doc_count limit	Nick Vatamaniuc	2020-05-27	4	-1/+32
\| \| \| \| \| \| \|	Let users specify the maximum document count for the _bulk_docs requests. If the document count exceeds the maximum it would return a 413 HTTP error. This would also signal the replicator to try to bisect the _bulk_docs array into smaller batches.
*	Lower the default batch size for update_docs to 2.5MB	Nick Vatamaniuc	2020-05-27	2	-2/+2
\| \| \| \|	Observed a number of timeouts with the previous default
*	Remove erlfdb mock from update_docs/2,3 test	Nick Vatamaniuc	2020-05-22	1	-14/+0
\| \| \| \| \|	In a constrained CI environment transactions could retry multiple times so we cannot rely on precisely counting erlfdb:transactional/2 calls.
*	Improve load handling in couch_jobs and couch_views	Nick Vatamaniuc	2020-05-21	2	-2/+9
\| \| \| \| \| \| \| \| \| \|	Increase couch_views job timeout by 20 seconds. This will set a larger jitter when multiple nodes concurrently check and re-equeue jobs. It would reduce the chance of them bumping into each other and conflicting. If they do conflict in activity monitor, catch the error and emit an error log. We gain some more robustness under load for a longer timeout for jobs with workers that have suddenly died getting re-enqueued.
*	Merge pull request #2896 from cloudant/pagination-api-fix-limit	iilyak	2020-05-21	2	-19/+93
\|\ \| \| \| \|	Fix handling of limit query parameter
\| *	Fix handling of limit query parameter	ILYA Khlopotov	2020-05-20	2	-19/+93
\| \|
* \|	Merge pull request #2897 from apache/improve-db-expiration-log	Peng Hui Jiang	2020-05-21	1	-2/+2
\|\ \ \| \| \| \| \| \|	Improve log of permanently deleting databases
\| * \|	Improve log of permanently deleting databasesimprove-db-expiration-log	jiangph	2020-05-21	1	-2/+2
\|/ /
* \|	Bulk docs transaction batching	Nick Vatamaniuc	2020-05-20	5	-29/+379
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Interactive (regular) requests are split into smaller transactions, so larger updates won't fail with either timeout so or transaction too large FDB errors. * Non-interactive (replicated) requests can now batch their updates in a few transaction and gain extra performance. Batch size is configurable: ``` [fabric] update_docs_batch_size = 5000000 ```
*	Fix flaky couch_jobs type monitor test	Nick Vatamaniuc	2020-05-15	1	-2/+36
\| \| \| \| \| \| \|	Sometimes this test fails on Jenkins but doesn't fail locally. The attempted fix is to make sure to simply retry a few times for the number of children in the supervisor to be the expected values. Also extend the timeout to 15 seconds.
*	Merge pull request #2870 from cloudant/pagination-api-2	iilyak	2020-05-15	7	-54/+1683
\|\ \| \| \| \|	Pagination API
\| *	Add tests for pagination API	ILYA Khlopotov	2020-05-15	1	-0/+771
\| \|
\| *	Implement pagination API	ILYA Khlopotov	2020-05-15	6	-45/+600
\| \|
\| *	Add tests for legacy API before refactoring	ILYA Khlopotov	2020-05-15	1	-0/+302
\| \|
\| *	Move not_implemented check down to allow testing of validation	ILYA Khlopotov	2020-05-15	1	-5/+6
\| \|
\| *	Fix variable shadowing	ILYA Khlopotov	2020-05-15	1	-4/+4
\|/
*	Fix compiler warning	Jay Doane	2020-05-14	1	-1/+1
\|
*	Fix a few flaky tests in fabric2_db	Nick Vatamaniuc	2020-05-13	4	-17/+22
\| \| \| \| \| \|	Add some longer timeouts and fix a race condition in db cleanup tests (Thanks to @jdoane for the patch)
*	Merge pull request #2857 from apache/background-db-deletion	Peng Hui Jiang	2020-05-13	4	-3/+358
\|\ \| \| \| \|	Background database deletion
\| *	background deletion for soft-deleted database	jiangph	2020-05-13	4	-3/+358
\|/ \| \| \| \| \| \| \|	allow background job to delete soft-deleted database according to specified criteria to release space. Once database is hard-deleted, the data can't be fetched back. Co-authored-by: Nick Vatamaniuc<vatamane@apache.org>
*	Fix couch_views updater_running info result	Nick Vatamaniuc	2020-05-09	3	-34/+56
\| \| \| \| \| \| \| \| \| \| \| \|	Previously we always returned `false` because the result from `couch_jobs:get_job_state` was expected to be just `Status`, but it is `{ok, Status}`. That part is now explicit so we account for every possible job state and would fail on a clause match if we get something else there. Moved `job_state/2` function to `couch_view_jobs` to avoid duplicating the logic on how to calculate job_id and keep it all in one module. Tests were updated to explicitly check for each state job state.
*	mix format all_docs_test.exs	Garren Smith	2020-05-08	1	-58/+65
\|
*	add local_docs to fold_doc with docids	Garren Smith	2020-05-08	4	-23/+292
\|
*	Convert aegis key cach to LRU with hard expiration time	Eric Avdey	2020-05-07	4	-20/+327
\|
*	Merge pull request #2874 from cloudant/enable-exunit	iilyak	2020-05-07	3	-2/+4
\|\ \| \| \| \|	Re-enable ExUnit tests
\| *	Update erlfdb	ILYA Khlopotov	2020-05-07	1	-1/+1
\| \|
\| *	Re-enable ExUnit tests	ILYA Khlopotov	2020-05-07	2	-1/+3
\|/
*	add test to make sure type <<"text">> design docs are ignored (#2866)	Tony Sun	2020-05-05	1	-0/+8
\|
*	return correct not implemented for reduce	Garren Smith	2020-05-04	1	-1/+1
\|
*	Fix list_dbs_info_tx_too_old flaky test	Nick Vatamaniuc	2020-04-29	1	-1/+1
\| \| \| \| \|	On CI creating a 100 dbs in a row was too much to do in 5 seconds so bump it to 15.
*	Fix a flaky fdbcore index test	Nick Vatamaniuc	2020-04-29	1	-2/+2
\|
*	Improve robustness of couch expiring cache test	Jay Doane	2020-04-28	2	-34/+77
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In its current incarnation, the so-called "simple lifecycle" test is prone to numerous failures in the CI system [1], doubtless because it's riddled with race conditions. The original author makes many assumptions about how quickly an (actual, unmocked) FDB instance will respond to a request. The primary goal is to stop failing CI builds, while other considerations include: keeping the run time of the test as low as possible, keeping the code coverage high, and documenting the known races. Specifically: - Increase the `stale` and `expired` times by a factor of 5 to decrease sensitivity to poor FDB performance. - Change default timer from `erlang:system_time/1` to `os:timestamp` on the assumption that the latter is less prone to warping [2]. - Decrease the period of the cache server reaper by half to increase accuracy of eviction time. - Inline and modify the `test_util:wait` code to make the timer explicit, and emphasize that `timer:delay/1` only works with millisecond resolution. - Don't fail the test if it can't get a fresh lookup immediately after insertion, but let it continue on to the next race, at least to the point of expiration and deletion, which continue to be asserted. - Factor `Timeout` and `Interval` to allow declarations near the other hard-coded parameters. - Move cache server `Opts` into `setup/0` and eliminate `start_link/0`. - Double the overall test timeout to 20 seconds. This has soaked for hundreds of runs on a 5 year old laptop, but the real test is the CI system. Should this test continue to fail CI builds, additional improvements could include mocking the timer and/or FDB layer to eliminate the variability of an integrated system. [1] https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FPullRequests/detail/PR-2813/10/pipeline [2] http://erlang.org/doc/apps/erts/time_correction.html#terminology
*	Re-enable the tx options tests	Nick Vatamaniuc	2020-04-28	2	-3/+15
\| \| \| \| \| \| \|	And an extra level of error checking to erlfdb:set_option since it could fail if we forget to update erlfdb dependency or fdb server version is too old. That operation can fail with an error:badarg which is exactly how list_to_integer fails and result in a confusing log message.
*	Temporary disable fabric2_tx_options_tests	Eric Avdey	2020-04-28	1	-1/+1
\|