summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Fix full ring assertion in fabric stream shard replacementsfix-replacements-ring-assertion-logicNick Vatamaniuc2019-05-011-2/+7
| | | | | | | | | | In fabric streams logic, after replacements are processed, the full ring assertion was made for the waiting and replaced workers. But, in cases when some of the workers already returned results, the assertion would fail, since those ranges would be missing. The fix is to consider not just waiting and replaced workers, but also the results that were already processed.
* Use individual rexi kill messages by defaultNick Vatamaniuc2019-05-011-6/+18
| | | | | | | | | | | When performing a rolling upgrade older nodes won't be able to handle newer kill_all messages. In busy clusters this could cause couch_server and other message queue backups while the upgrade is in progress. Make message sending behavior is configurable. After all the nodes have been upgraded, can toggle `[rexi] use_kill_all = true` to save on some inter-node network traffic. In a future release the configuration check might be removed, default to sending kill_all messages only.
* Handle database re-creation edge case in internal replicatorNick Vatamaniuc2019-05-013-6/+33
| | | | | | | | | | | | | | | | | | Previously, if a database was deleted and re-created while the internal replication request was pending, the job would have been retried continuously. mem3:targets_map/2 function would return an empty targets map and mem3_rep:go would raise a function clause exception if the database as present but it was an older "incarnation" of it (with shards living on different target nodes). Because it was an exception and not an {error, ...} result, the process would exit with an error. Subsequently, mem3_sync would try to handle process exit and check of the database was deleted, but it also didn't account for the case when the database was created, so it would resubmit the into queue again. To fix it, we introduce a function to check if the database shard is part of the current database shard map. Then perform the check both before building the targets map and also on job retries.
* Increase max number of resharding jobsNick Vatamaniuc2019-04-241-1/+1
| | | | Also make it divisible by default Q and N
* Allow restricting resharding parametersNick Vatamaniuc2019-04-232-0/+51
| | | | | | To avoid inadvertently splitting all the shards in all the ranges due to user error, introduce an option enforce the presence of node and range job creation parameters.
* Expose node name via /_node/_local, closes #2005 (#2006)Joan Touzet2019-04-151-0/+2
|
* Merge pull request #2003 from apache/dont-reset-indexRobert Newson2019-04-121-2/+2
|\ | | | | Don't reset_index if read_header fails
| * Don't reset_index if read_header failsdont-reset-indexRobert Newson2019-04-122-3/+3
| | | | | | | | | | | | | | | | This decision is an old one, I think, and took the view that an error from read_header meant something fatally wrong with this index file. These days, it's far more likely to be something else, a process crash at the wrong moment, and therefore wiping the index out is an overreaction.
| * Change _security object for new dbs to admin-only by defaultRobert Newson2019-04-121-1/+1
|/
* Fix upgrade clause for mem3_rpc:load_checkpoint/4,5Nick Vatamaniuc2019-04-111-1/+5
| | | | | | | | | When upgrading, the new mem3_rpc:load_checkpoint with a filter hash arg won't be available on older nodes. Filter hashes are not currently used anyway, so to avoid crashes on mixed cluster call the older version without the filter hash part when the filter has the default <<>> value.
* In the resharding API test pick the first live nodeNick Vatamaniuc2019-04-101-3/+6
| | | | | Previously the first cluster node was picked. However, when running a test with a degraded cluster and that node is down the test would fail.
* Merge pull request #2001 from cloudant/promote-ibrowse-4.0.1-1Jay Doane2019-04-101-1/+1
|\ | | | | Promote ibrowse 4.0.1-1
| * Promote ibrowse 4.0.1-1Jay Doane2019-04-091-1/+1
|/
* Port copy doc tests into elixir test suite (#2000)Juanjo Rodriguez2019-04-092-1/+72
|
* Port javascript attachment test suite into elixir (#1999)Juanjo Rodriguez2019-04-088-6/+1107
|
* Implement resharding HTTP APINick Vatamaniuc2019-04-0313-2/+2971
| | | | | | | | | | | | This implements the API as defined in RFC #1920 The handlers live in the `mem3_reshard_httpd` and helpers, like validators live in the `mem3_reshard_api` module. There are also a bunch of high level (HTTP & fabric) API tests that check that shard splitting happens properly, jobs are behaving as defined in the RFC, etc. Co-authored-by: Eric Avdey <eiri@eiri.ca>
* Resharding supervisor and job managerNick Vatamaniuc2019-04-038-4/+1196
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the resharding logic lives in the mem3 application under the `mem3_reshard_sup` supervisor. `mem3_reshard_sup` has three children: 1) `mem3_reshard` : The main reshading job manager. 2) `mem3_reshard_job_sup` : A simple-one-for-one supervisor to keep track of individual resharding jobs. 3) `mem3_reshard_dbdoc` : Helper gen_server used to update the shard map. `mem_reshard` gen_server is the central point in the resharding logic. It is a job manager which accept new jobs, monitors jobs when they run, checkpoints their status as they make progress, and knows how to restore their state when a node reboots. Jobs are represented as instances of the `#job{}` records defined in `mem3_reshard.hrl` header. There is also a global resharding state represented by a `#state{}` record. `mem3_reshard` gen_server maintains an ets table of "live" `#job{}` records. as its gen_server state represented by `#state{}`. When jobs are checkpointed or user updates the global resharding state, `mem3_reshard` will use the `mem3_reshard_store` module to persist those updates to `_local/...` documents in the shards database. The idea is to allow jobs to persist across node or application restarts. After a job is added, if the global state is not `stopped`, `mem3_reshard` manager will ask the `mem3_reshard_job_sup` to spawn a new child. That child will be running in a gen_server defined in `mem3_reshard_job` module (included in subsequent commits). Each child process will periodically ask `mem3_reshard` manager to checkpoint when it jump to a new state. `mem3_reshard` checkpoints then informs the child to continue its work.
* Shard splitting job implementationNick Vatamaniuc2019-04-035-0/+1575
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the implementation of the shard splitting job. `mem3_reshard` manager spawns `mem3_reshard_job` instances via the `mem3_reshard_job_sup` supervisor. Each job is a gen_server process that starts in `mem3_reshard_job:init/1` with `#job{}` record instance as the argument. Then the job goes through recovery, so it can handle resuming in cases where the job was interrupted previously and it was initialized from a checkpointed state. Checkpoiting happens in `mem3_reshard` manager with the help of the `mem3_reshard_store` module (introduced in a previous commit). After recovery, processing starts in the `switch_state` function. The states are defined as a sequence of atoms in a list in `mem3_reshard.hrl`. In the `switch_state()` function, the state and history is updated in the `#job{}` record, then `mem3_reshard` manager is asked to checkpoint the new state. The job process waits for `mem3_reshard` manager to notify it when checkpointing has finished so it can continue processesing the new state. That happens when the `do_state` gen_server cast is received. `do_state` function has state matching heads for each state. Usually if there are long running tasks to be performed `do_state` will spawn a few workers and perform all the work in there. In the meantime the main job process will simpy wait for all the workers to exit. When that happens, it will call `switch_state` to switch to the new state, checkpoint again and so on. Since there are quite a few steps needed to split a shard, some of the helper function needed are defined in separate modules such as: * mem3_reshard_index : Index discovery and building. * mem3_reshard_dbdoc : Shard map updates. * couch_db_split : Initial (bulk) data copy (added in a separate commit). * mem3_rep : To perfom "top-offs" in between some steps.
* Update internal replicator to handle split shardsNick Vatamaniuc2019-04-033-192/+849
| | | | | | | | | | | | | | | | | | Shard splitting will result in uneven shard copies. Previously internal replicator knew to replicate from one shard copy to another but now it needs to know how to replicate from one source to possibly multiple targets. The main idea is to reuse the same logic and "pick" function as `couch_db_split`. But to avoid a penalty of calling the custom hash function for every document even for cases when there is just a single target, there is a special "1 target" case where the hash function is `undefined`. Another case where internal replicator is used is to topoff replication and to replicate the shard map dbs to and from current node (used in shard map update logic). For that reason there are a few helper mem3_util and mem3_rpc functions.
* Implement initial shard splitting data copyNick Vatamaniuc2019-04-038-1/+970
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The first step when a new shard splitting job starts is to do a bulk copy of data from the source to the target. Ideally this should happen as fast as possible as it could potentially churn through billions of documents. This logic is implemented in the `couch_db_split` module in the main `couch` application. To understand better what happens in `couch_db_split` it is better to think of it as a version of `couch_bt_engine_compactor` that lives just above the couch_db_engine (PSE) interface instead of below it. The first initial data copy does is it creates the targets. Targets are created based on the source parameters. So if the source uses a specific PSE engine, targets will use the same PSE engine. If the source is partitioned, the targets will use the same partitioned hash function as well. An interesting bit with respect to target creation is that targets are not regular couch_db databases but are closer to a couch_file with a couch_db_updater process linked to them. They are linked directly without going through couch_server. This is done in order to avoid the complexity of handling concurrent updates, handling VDU, interactive vs non-interactive updates, making sure it doesn't compact while copying happens, doesn't update any LRUs, or emit `db_updated` events. Those are things are not needed and handling them would make this more fragile. Another way to think of the targets during the initial bulk data copy is as "hidden" or "write-only" dbs. Another notable thing is that `couch_db_split` doesn't know anything about shards and only knows about databases. The input is a source, a map of targets and a caller provided "picker" function which will know how for each given document ID to pick one of the targets. This will work for both regular dbs as well as partitioned ones. All the logic will be inside the pick function not embedded in `couch_db_split`. One last point is about handling internal replicator _local checkpoint docs. Those documents are transformed when they are copied such that the old source UUID is replaced with the new target's UUID, since each shard will have its own new UUID. That is done to avoid replications rewinding. Besides those points, the rest is rather boring and it's just "open documents from the source, pick the target, copy the documents to one of the targets, read more documents from the source, etc". Co-authored-by: Paul J. Davis <davisp@apache.org> Co-authored-by: Eric Avdey <eiri@eiri.ca>
* Uneven shard copy handling in mem3 and fabricNick Vatamaniuc2019-04-0329-422/+1719
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The introduction of shard splitting will eliminate the contraint that all document copies are located in shards with same range boundaries. That assumption was made by default in mem3 and fabric functions that do shard replacement, worker spawning, unpacking `_changes` update sequences and some others. This commit updates those places to handle the case where document copies might be in different shard ranges. A good place to start from is the `mem3_util:get_ring()` function. This function returns a full non-overlapped ring from a set of possibly overlapping shards. This function is used by almost everything else in this commit: 1) It's used when only a single copy of the data is needed, for example in cases where _all_docs or _changes procesessig. 2) Used when checking if progress is possible after some nodes died. `get_ring()` returns `[]` when it cannot find a full ring is used to indicate that progress is not possible. 3) During shard replacement. This is pershaps the most complicated case. During replacement besides just finding a possible covering of the ring from the set of shards, it is also desirable to find one that minimizes the number of workers that have to be replaced. A neat trick used here is to provide `get_ring` with a custom sort function, which prioritizes certain shard copies over others. In case of replacements it prioritiezes shards for which workers have already spawned. In the default cause `get_ring()` will prioritize longer ranges over shorter ones, so for example, to cover the interval [00-ff] with either [00-7f, 80-ff] or [00-ff] shards ranges, it will pick the single [00-ff] range instead of [00-7f, 80-ff] pair. Co-authored-by: Paul J. Davis <davisp@apache.org>
* Merge pull request #1983 from cloudant/fix-external-docs-sizeEric Avdey2019-03-292-7/+7
|\ | | | | More precise calculation of external docs' size
| * Reuse pre-calculated external docs' size on compactionEric Avdey2019-03-281-6/+6
| |
| * Use couch_ejson_size for calculation of doc's ejson sizeEric Avdey2019-03-271-1/+1
|/
* Skip running PropEr's own unit testsNick Vatamaniuc2019-03-261-1/+1
| | | | We skip mochiweb, snappy and others already.
* Merge pull request #1991 from cloudant/improve-elixir-test-stabilityJay Doane2019-03-252-11/+16
|\ | | | | Improve elixir test stability
| * Improve elixir test stabilityJay Doane2019-03-252-11/+16
|/ | | | | • Disable self-admitted flaky "partition _all_docs with timeout" test • Double timeout for "GET /dbname/_design_docs" test
* Merge pull request #1981 from cloudant/update/ioq-2.1.1iilyak2019-03-181-1/+1
|\ | | | | Update ioq to 2.1.1
| * Update ioq to 2.1.1ILYA Khlopotov2019-03-181-1/+1
|/
* test: port invalid_docids to Elixir test suite (#1968)Alessio Biancalana2019-03-172-1/+86
|
* Add security item to the RFC template (#1914)Joan Touzet2019-03-111-2/+6
|
* Added more info for a mango sort error (#1970)garren smith2019-03-073-5/+65
| | | | | | | | | Added more info for a mango sort error The mango sort error can be a bit confusing with using a partitioned database this gives a little more detail on why mango could not find a specific index when a sort is used.
* Merge pull request #1971 from apache/weak-etag-comparisonRobert Newson2019-03-062-1/+11
|\ | | | | Ignore weak ETag part
| * Ignore weak ETag partRobert Newson2019-03-062-1/+11
|/ | | | | | | | | | | | | | | Some load balancer configurations (HAproxy with compression enabled is the motivating example) will add W/ to our response ETags if they modify the response before sending it to the client. as per rfc7232 section 3.2; "A recipient MUST use the weak comparison function when comparing entity-tags for If-None-Match (Section 2.3.2), since weak entity-tags can be used for cache validation even if there have been changes to the representation data." This change improves our ETag checking toward RFC compliance.
* Add stats in fabric for partition and normal views (#1963)garren smith2019-03-064-5/+104
| | | | | | | | Add metrics for partition queries This adds httpd metrics to monitor the number of request for partition queries. It also adds metrics to record the number of timeouts for partition and global query requests
* Improve chttpd_socket_buffer_size_testNick Vatamaniuc2019-03-051-94/+58
| | | | | | | | | | | Previously this test used to fail on Windows, possibly due restarting of the chttp and mochiweb applications. To avoid restarting those applications switched updating the configuration before the test instance of couch is started. Some of the tests were redundant. Switched testing the default case, the case when userland buffer is set to small, triggering the http headers parsing bug and test that setting recbuf too low also sets the buffer as well.
* test: port multiple_rows.js to Elixir test suite (#1958)Alessio Biancalana2019-03-051-0/+136
|
* Jenkins add attachment test (#1953)garren smith2019-03-052-6/+488
| | | | | | | | | | | | | | * update readme for all tests written * Convert Attachment.js to Elixir * add w:3 * add more w:3 * more w:3 * mix format
* Warn people to edit both Makefiles. (#1952)Joan Touzet2019-02-282-1/+9
|
* Merge pull request #1955 from apache/optional-properRobert Newson2019-02-282-3/+27
|\ | | | | Make PropEr an optional (test) dependency
| * Make PropEr an optional (test) dependencyRobert Newson2019-02-282-3/+27
|/
* Merge pull request #1951 from apache/fail-make-on-eunit-failureRussell Branca2019-02-271-1/+1
|\ | | | | Fail make eunit upon eunit app suite failure
| * Fail make eunit upon eunit app suite failurefail-make-on-eunit-failureRussell Branca2019-02-271-1/+1
|/
* Merge pull request #1942 from cloudant/update-smoosh-1.0.1iilyak2019-02-261-1/+1
|\ | | | | Update smoosh to 1.0.1
| * Update smoosh to 1.0.1ILYA Khlopotov2019-02-261-1/+1
|/
* Merge pull request #1941 from apache/upgrade-ken-1.0.3Robert Newson2019-02-251-1/+1
|\ | | | | upgrade ken to 1.0.3
| * upgrade ken to 1.0.3Tony Sun2019-02-251-1/+1
|/
* Merge pull request #1938 from cloudant/update-folsomiilyak2019-02-251-1/+1
|\ | | | | Update folsom to support newer erlang
| * Update folsom to support newer erlangILYA Khlopotov2019-02-251-1/+1
|/
* fixes to elixir tests (#1939)garren smith2019-02-252-6/+11
|