summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* add multiple indexer active task testadd_active_tasks_fdbTony Sun2020-07-231-5/+29
|
* use filtermapTony Sun2020-07-231-6/+22
|
* formattingTony Sun2020-07-232-3/+2
|
* add fabric verification to testTony Sun2020-07-231-3/+7
|
* scrub extra info from get_active_job_idsTony Sun2020-07-232-17/+25
|
* fix testTony Sun2020-07-231-0/+127
|
* encapsulate <<"active_task_info">> to fabricTony Sun2020-07-234-25/+35
|
* move active_task_info into utilTony Sun2020-07-232-26/+28
|
* add specTony Sun2020-07-231-0/+2
|
* revert to using job_dataTony Sun2020-07-226-119/+38
|
* add get_active_jobs to couch_jobsTony Sun2020-07-211-0/+7
|
* add active_tasks for view builds using version stampsTony Sun2020-07-218-12/+162
| | | | | | | | | | | | Active Tasks requires TotalChanges and ChangesDone to show the progress of long running tasks. This requires count_changes_since to be implemented. Unfortunately, that is not easily done via with foundationdb. This commit replaces TotalChanges with the versionstamp + the number of docs as a progress indicator. This can possibly break existing api that relys on TotalChanges. ChangesDone will still exist, but instead of relying on the current changes seq it is simply a reflection of how many documents were written by the updater process.
* Merge pull request #2960 from cloudant/add-max_bulk_get_countiilyak2020-06-234-1/+34
|\ | | | | Add max_bulk_get_count configuration option
| * Add max_bulk_get_count configuration optionILYA Khlopotov2020-06-224-1/+34
|/
* Reserve aegis namespace under ?CLUSTER_CONFIGEric Avdey2020-06-171-0/+4
|
* add back r and w optionsTony Sun2020-06-121-0/+12
|
* Bump erlfdb to v1.2.2Nick Vatamaniuc2020-06-121-1/+1
| | | | https://github.com/apache/couchdb-erlfdb/releases/tag/v1.2.2
* Handle transaction and future timeouts in couch_jobs notifiersNick Vatamaniuc2020-06-101-1/+10
| | | | | | In an overload scenario do not let notifiers crash and lose their subscribers, instead make them more robust and let them retry on future or transaction timeouts.
* Split couch_views acceptors and workersNick Vatamaniuc2020-06-085-22/+309
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimize couch_views by using a separate set of acceptors and workers. Previously, all `max_workers` where spawned on startup, and were to waiting to accept jobs in parallel. In a setup with a large number of pods, and 100 workers per pod, that could lead to a lot of conflicts being generated when all those workers race to accept the same job at the same time. The improvement is to spawn only a limited number of acceptors (5, by default), then, spawn more after some of them become workers. Also, when some workers finish or die with an error, check if more acceptors could be spawned. As an example, here is what might happen with `max_acceptors = 5` and `max_workers = 100` (`A` and `W` are the current counts of acceptors and workers, respectively): 1. Starting out: `A = 5, W = 0` 2. After 2 acceptors start running: `A = 3, W = 2` Then immediately 2 more acceptors are spawned: `A = 5, W = 2` 3. After 95 workers are started: `A = 5, W = 95` 4. Now if 3 acceptors accept, it would look like: `A = 2, W = 98` But no more acceptors would be started. 5. If the last 2 acceptors also accept jobs: `A = 0, W = 100` At this point no more indexing jobs can be accepted and started until at least one of the workers finish and exit. 6. If 1 worker exits: `A = 0, W = 99` An acceptor will be immediately spawned `A = 1, W = 99` 7. If all 99 workers exit, it will go back to: `A = 5, W = 0`
* Include database uuid in db info resultNick Vatamaniuc2020-06-043-5/+16
| | | | | | | | As per ML [discussion](https://lists.apache.org/thread.html/rb328513fb932e231cf8793f92dd1cc2269044cb73cb43a6662c464a1%40%3Cdev.couchdb.apache.org%3E) add a `uuid` field to db info results in order to be able to uniquely identify a particular instance of a database. When a database is deleted and re-created with the same name, it will return a new `uuid` value.
* Fix couch_jobs accept timeout when no_schedule option is usedNick Vatamaniuc2020-06-031-8/+11
| | | | | | | | | | | | When waiting to accept jobs and scheduling was used, timeout is limited based on the time scheduling parameter. When no_schedule option is used, time scheduling parameter is set to 0 always, and so in that case, we have to special-case the limit to return `infinity`. Later on when we wait for the watch to fire, the actual timeout can still be limited, by a separate user specified timeout option, but if user specifies `infinity` there and sets `#{no_schedule => true}` then we should respect and never return `{error, not_found}` in response.
* Improve efficiency of couch_jobs:accept/2 for viewsNick Vatamaniuc2020-06-023-3/+6
| | | | | | | | | | | | | | | | Use the `no_schedule` option to speed up job dequeuing. This optimization allows dequeuing jobs more efficiently if these conditions are met: 1) Job IDs start with a random prefix 2) No time-based scheduling is used Both of those can be true for views job IDs can be generated such that signature comes before the db name part, which is what this commit does. The way the optimization works, is random IDs are generating in pending jobs range, then, a key selection is used to pick either a job before or after it. That reduces each dequeue attempt to just 1 read instead of reading up to 1000 jobs.
* Handle error:{timeout, _} exception in couch_jobs:acceptNick Vatamaniuc2020-06-021-0/+2
| | | | | | Under load accept loop can blow up with timeout error from `erlfdb:wait(...)`(https://github.com/apache/couchdb-erlfdb/blob/master/src/erlfdb.erl#L255) so guard against it just like we do for fdb transaction timeout (1031) errors.
* Remove on_commit handler from fabric2_fdbNick Vatamaniuc2020-06-022-47/+26
| | | | | | Update db handles right away as soon we db verison is checked. This ensures concurrent requests will get access to the current handle as soon as possible and may avoid doing extra version checks and re-opens.
* Prevent eviction of newer handles from fabric_server cacheNick Vatamaniuc2020-06-022-9/+70
| | | | | | Check metadata versions to ensure newer handles are not clobbered. The same thing is done for removal, `maybe_remove/1` removes handle only if there isn't a newer handle already there.
* Guard couch_jobs:accept_loop timing outNick Vatamaniuc2020-05-291-1/+9
| | | | And also against too many conflicts during overload
* Protect couch_jobs activity monitor against timeouts as wellNick Vatamaniuc2020-05-291-3/+3
|
* Fix bad catch statement in couch_jobs activity monitorNick Vatamaniuc2020-05-291-1/+1
|
* Fix mango erlfdb error catch clause erlfdb -> erlfdb_errorNick Vatamaniuc2020-05-282-5/+6
|
* Don't skip over docs in mango indices on erlfdb errorsNick Vatamaniuc2020-05-282-1/+16
|
* Introduce _bulk_docs max_doc_count limitNick Vatamaniuc2020-05-274-1/+32
| | | | | | | Let users specify the maximum document count for the _bulk_docs requests. If the document count exceeds the maximum it would return a 413 HTTP error. This would also signal the replicator to try to bisect the _bulk_docs array into smaller batches.
* Lower the default batch size for update_docs to 2.5MBNick Vatamaniuc2020-05-272-2/+2
| | | | Observed a number of timeouts with the previous default
* Remove erlfdb mock from update_docs/2,3 testNick Vatamaniuc2020-05-221-14/+0
| | | | | In a constrained CI environment transactions could retry multiple times so we cannot rely on precisely counting erlfdb:transactional/2 calls.
* Improve load handling in couch_jobs and couch_viewsNick Vatamaniuc2020-05-212-2/+9
| | | | | | | | | | Increase couch_views job timeout by 20 seconds. This will set a larger jitter when multiple nodes concurrently check and re-equeue jobs. It would reduce the chance of them bumping into each other and conflicting. If they do conflict in activity monitor, catch the error and emit an error log. We gain some more robustness under load for a longer timeout for jobs with workers that have suddenly died getting re-enqueued.
* Merge pull request #2896 from cloudant/pagination-api-fix-limitiilyak2020-05-212-19/+93
|\ | | | | Fix handling of limit query parameter
| * Fix handling of limit query parameterILYA Khlopotov2020-05-202-19/+93
| |
* | Merge pull request #2897 from apache/improve-db-expiration-logPeng Hui Jiang2020-05-211-2/+2
|\ \ | | | | | | Improve log of permanently deleting databases
| * | Improve log of permanently deleting databasesimprove-db-expiration-logjiangph2020-05-211-2/+2
|/ /
* | Bulk docs transaction batchingNick Vatamaniuc2020-05-205-29/+379
|/ | | | | | | | | | | | | | | * Interactive (regular) requests are split into smaller transactions, so larger updates won't fail with either timeout so or transaction too large FDB errors. * Non-interactive (replicated) requests can now batch their updates in a few transaction and gain extra performance. Batch size is configurable: ``` [fabric] update_docs_batch_size = 5000000 ```
* Fix flaky couch_jobs type monitor testNick Vatamaniuc2020-05-151-2/+36
| | | | | | | Sometimes this test fails on Jenkins but doesn't fail locally. The attempted fix is to make sure to simply retry a few times for the number of children in the supervisor to be the expected values. Also extend the timeout to 15 seconds.
* Merge pull request #2870 from cloudant/pagination-api-2iilyak2020-05-157-54/+1683
|\ | | | | Pagination API
| * Add tests for pagination APIILYA Khlopotov2020-05-151-0/+771
| |
| * Implement pagination APIILYA Khlopotov2020-05-156-45/+600
| |
| * Add tests for legacy API before refactoringILYA Khlopotov2020-05-151-0/+302
| |
| * Move not_implemented check down to allow testing of validationILYA Khlopotov2020-05-151-5/+6
| |
| * Fix variable shadowingILYA Khlopotov2020-05-151-4/+4
|/
* Fix compiler warningJay Doane2020-05-141-1/+1
|
* Fix a few flaky tests in fabric2_dbNick Vatamaniuc2020-05-134-17/+22
| | | | | | Add some longer timeouts and fix a race condition in db cleanup tests (Thanks to @jdoane for the patch)
* Merge pull request #2857 from apache/background-db-deletionPeng Hui Jiang2020-05-134-3/+358
|\ | | | | Background database deletion
| * background deletion for soft-deleted databasejiangph2020-05-134-3/+358
|/ | | | | | | | allow background job to delete soft-deleted database according to specified criteria to release space. Once database is hard-deleted, the data can't be fetched back. Co-authored-by: Nick Vatamaniuc<vatamane@apache.org>