summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
Diffstat (limited to 'doc')
-rw-r--r--doc/saio/swift/object-server/1.conf2
-rw-r--r--doc/saio/swift/object-server/2.conf2
-rw-r--r--doc/saio/swift/object-server/3.conf2
-rw-r--r--doc/saio/swift/object-server/4.conf2
-rw-r--r--doc/saio/swift/proxy-server.conf1
-rw-r--r--doc/source/config/container_server_config.rst224
-rw-r--r--doc/source/deployment_guide.rst98
-rw-r--r--doc/source/overview_encryption.rst5
-rw-r--r--doc/source/overview_expiring_objects.rst20
-rw-r--r--doc/source/overview_policies.rst3
-rw-r--r--doc/source/ring_partpower.rst39
11 files changed, 337 insertions, 61 deletions
diff --git a/doc/saio/swift/object-server/1.conf b/doc/saio/swift/object-server/1.conf
index 7405dddf9..ecd5ff01c 100644
--- a/doc/saio/swift/object-server/1.conf
+++ b/doc/saio/swift/object-server/1.conf
@@ -30,3 +30,5 @@ rsync_module = {replication_ip}::object{replication_port}
[object-updater]
[object-auditor]
+
+[object-relinker]
diff --git a/doc/saio/swift/object-server/2.conf b/doc/saio/swift/object-server/2.conf
index 2b6bc0358..456f7d558 100644
--- a/doc/saio/swift/object-server/2.conf
+++ b/doc/saio/swift/object-server/2.conf
@@ -30,3 +30,5 @@ rsync_module = {replication_ip}::object{replication_port}
[object-updater]
[object-auditor]
+
+[object-relinker]
diff --git a/doc/saio/swift/object-server/3.conf b/doc/saio/swift/object-server/3.conf
index 6ef04ee59..9a0ebbdca 100644
--- a/doc/saio/swift/object-server/3.conf
+++ b/doc/saio/swift/object-server/3.conf
@@ -30,3 +30,5 @@ rsync_module = {replication_ip}::object{replication_port}
[object-updater]
[object-auditor]
+
+[object-relinker]
diff --git a/doc/saio/swift/object-server/4.conf b/doc/saio/swift/object-server/4.conf
index 3fbf83c6f..1c0db1ff5 100644
--- a/doc/saio/swift/object-server/4.conf
+++ b/doc/saio/swift/object-server/4.conf
@@ -30,3 +30,5 @@ rsync_module = {replication_ip}::object{replication_port}
[object-updater]
[object-auditor]
+
+[object-relinker]
diff --git a/doc/saio/swift/proxy-server.conf b/doc/saio/swift/proxy-server.conf
index 57a054087..124b36c62 100644
--- a/doc/saio/swift/proxy-server.conf
+++ b/doc/saio/swift/proxy-server.conf
@@ -92,6 +92,7 @@ use = egg:swift#symlink
use = egg:swift#s3api
s3_acl = yes
check_bucket_owner = yes
+cors_preflight_allow_origin = *
# Example to create root secret: `openssl rand -base64 32`
[filter:keymaster]
diff --git a/doc/source/config/container_server_config.rst b/doc/source/config/container_server_config.rst
index d5ab30d9e..acbb32310 100644
--- a/doc/source/config/container_server_config.rst
+++ b/doc/source/config/container_server_config.rst
@@ -16,6 +16,7 @@ The following configuration sections are available:
* :ref:`[DEFAULT] <container_server_default_options>`
* `[container-server]`_
* `[container-replicator]`_
+* `[container-sharder]`_
* `[container-updater]`_
* `[container-auditor]`_
@@ -268,6 +269,229 @@ ionice_priority None I/O scheduling priority of
==================== =========================== =============================
*******************
+[container-sharder]
+*******************
+
+The container-sharder re-uses features of the container-replicator and inherits
+the following configuration options defined for the `[container-replicator]`_:
+
+* interval
+* databases_per_second
+* per_diff
+* max_diffs
+* concurrency
+* node_timeout
+* conn_timeout
+* reclaim_age
+* rsync_compress
+* rsync_module
+* recon_cache_path
+
+
+================================= ================= =======================================
+Option Default Description
+--------------------------------- ----------------- ---------------------------------------
+log_name container-sharder Label used when logging
+log_facility LOG_LOCAL0 Syslog log facility
+log_level INFO Logging level
+log_address /dev/log Logging directory
+
+
+auto_shard false If the auto_shard option
+ is true then the sharder
+ will automatically select
+ containers to shard, scan
+ for shard ranges, and
+ select shards to shrink.
+ Warning: auto-sharding is
+ still under development
+ and should not be used in
+ production; do not set
+ this option to true in a
+ production cluster.
+
+shard_container_threshold 1000000 When auto-sharding is
+ enabled this defines the
+ object count at which a
+ container with
+ container-sharding
+ enabled will start to
+ shard. This also
+ indirectly determines the
+ initial nominal size of
+ shard containers, which
+ is shard_container_threshold//2,
+ as well as determining
+ the thresholds for
+ shrinking and merging
+ shard containers.
+
+shard_shrink_point 10 When auto-sharding is
+ enabled this defines the
+ object count below which
+ a 'donor' shard container
+ will be considered for
+ shrinking into another
+ 'acceptor' shard
+ container.
+ shard_shrink_point is a
+ percentage of
+ shard_container_threshold
+ e.g. the default value of
+ 10 means 10% of the
+ shard_container_threshold.
+
+shard_shrink_merge_point 75 When auto-sharding is
+ enabled this defines the
+ maximum allowed size of
+ an acceptor shard
+ container after having a
+ donor merged into it.
+ Shard_shrink_merge_point
+ is a percentage of
+ shard_container_threshold.
+ e.g. the default value of
+ 75 means that the
+ projected sum of a donor
+ object count and acceptor
+ count must be less than
+ 75% of shard_container_threshold
+ for the donor to be
+ allowed to merge into the
+ acceptor.
+
+ For example, if
+ shard_container_threshold
+ is 1 million,
+ shard_shrink_point is 5,
+ and shard_shrink_merge_point
+ is 75 then a shard will
+ be considered for
+ shrinking if it has less
+ than or equal to 50
+ thousand objects but will
+ only merge into an
+ acceptor if the combined
+ object count would be
+ less than or equal to 750
+ thousand objects.
+
+
+shard_scanner_batch_size 10 When auto-sharding is
+ enabled this defines the
+ maximum number of shard
+ ranges that will be found
+ each time the sharder
+ daemon visits a sharding
+ container. If necessary
+ the sharder daemon will
+ continue to search for
+ more shard ranges each
+ time it visits the
+ container.
+
+cleave_batch_size 2 Defines the number of
+ shard ranges that will be
+ cleaved each time the
+ sharder daemon visits a
+ sharding container.
+
+cleave_row_batch_size 10000 Defines the size of
+ batches of object rows
+ read from a sharding
+ container and merged to a
+ shard container during
+ cleaving.
+
+shard_replication_quorum auto Defines the number of
+ successfully replicated
+ shard dbs required when
+ cleaving a previously
+ uncleaved shard range
+ before the sharder will
+ progress to the next
+ shard range. The value
+ should be less than or
+ equal to the container
+ ring replica count. The
+ default of 'auto' causes
+ the container ring quorum
+ value to be used. This
+ option only applies to
+ the container-sharder
+ replication and does not
+ affect the number of
+ shard container replicas
+ that will eventually be
+ replicated by the
+ container-replicator.
+
+
+existing_shard_replication_quorum auto Defines the number of
+ successfully replicated
+ shard dbs required when
+ cleaving a shard range
+ that has been previously
+ cleaved on another node
+ before the sharder will
+ progress to the next
+ shard range. The value
+ should be less than or
+ equal to the container
+ ring replica count. The
+ default of 'auto' causes
+ the shard_replication_quorum
+ value to be used. This
+ option only applies to
+ the container-sharder
+ replication and does not
+ affect the number of
+ shard container replicas
+ that will eventually be
+ replicated by the
+ container-replicator.
+
+internal_client_conf_path see description The sharder uses an
+ internal client to create
+ and make requests to
+ containers. The absolute
+ path to the client config
+ file can be configured.
+ Defaults to
+ /etc/swift/internal-client.conf
+
+request_tries 3 The number of time the
+ internal client will
+ retry requests.
+
+recon_candidates_limit 5 Each time the sharder
+ dumps stats to the recon
+ cache file it includes a
+ list of containers that
+ appear to need sharding
+ but are not yet sharding.
+ By default this list is
+ limited to the top 5
+ containers, ordered by
+ object count. The limit
+ may be changed by setting
+ recon_candidates_limit to
+ an integer value. A
+ negative value implies no
+ limit.
+
+broker_timeout 60 Large databases tend to
+ take a while to work
+ with, but we want to make
+ sure we write down our
+ progress. Use a
+ larger-than-normal broker
+ timeout to make us less
+ likely to bomb out on a
+ LockTimeout.
+================================= ================= =======================================
+
+*******************
[container-updater]
*******************
diff --git a/doc/source/deployment_guide.rst b/doc/source/deployment_guide.rst
index 22199ff13..5188bb895 100644
--- a/doc/source/deployment_guide.rst
+++ b/doc/source/deployment_guide.rst
@@ -10,12 +10,9 @@ Detailed descriptions of configuration options can be found in the
Hardware Considerations
-----------------------
-Swift is designed to run on commodity hardware. At Rackspace, our storage
-servers are currently running fairly generic 4U servers with 24 2T SATA
-drives and 8 cores of processing power. RAID on the storage drives is not
-required and not recommended. Swift's disk usage pattern is the worst
-case possible for RAID, and performance degrades very quickly using RAID 5
-or 6.
+Swift is designed to run on commodity hardware. RAID on the storage drives is
+not required and not recommended. Swift's disk usage pattern is the worst case
+possible for RAID, and performance degrades very quickly using RAID 5 or 6.
------------------
Deployment Options
@@ -40,12 +37,12 @@ and network I/O intensive.
The easiest deployment is to install all services on each server. There is
nothing wrong with doing this, as it scales each service out horizontally.
-At Rackspace, we put the Proxy Services on their own servers and all of the
-Storage Services on the same server. This allows us to send 10g networking to
-the proxy and 1g to the storage servers, and keep load balancing to the
-proxies more manageable. Storage Services scale out horizontally as storage
-servers are added, and we can scale overall API throughput by adding more
-Proxies.
+Alternatively, one set of servers may be dedicated to the Proxy Services and a
+different set of servers dedicated to the Storage Services. This allows faster
+networking to be configured to the proxy than the storage servers, and keeps
+load balancing to the proxies more manageable. Storage Services scale out
+horizontally as storage servers are added, and the overall API throughput can
+be scaled by adding more proxies.
If you need more throughput to either Account or Container Services, they may
each be deployed to their own servers. For example you might use faster (but
@@ -303,7 +300,7 @@ You can inspect the resulting combined configuration object using the
General Server Configuration
----------------------------
-Swift uses paste.deploy (http://pythonpaste.org/deploy/) to manage server
+Swift uses paste.deploy (https://pypi.org/project/Paste/) to manage server
configurations. Detailed descriptions of configuration options can be found in
the :doc:`configuration documentation <config/index>`.
@@ -485,12 +482,44 @@ and authorization middleware <overview_auth>` is highly recommended.
Memcached Considerations
------------------------
-Several of the Services rely on Memcached for caching certain types of
-lookups, such as auth tokens, and container/account existence. Swift does
-not do any caching of actual object data. Memcached should be able to run
-on any servers that have available RAM and CPU. At Rackspace, we run
-Memcached on the proxy servers. The ``memcache_servers`` config option
-in the ``proxy-server.conf`` should contain all memcached servers.
+Several of the Services rely on Memcached for caching certain types of lookups,
+such as auth tokens, and container/account existence. Swift does not do any
+caching of actual object data. Memcached should be able to run on any servers
+that have available RAM and CPU. Typically Memcached is run on the proxy
+servers. The ``memcache_servers`` config option in the ``proxy-server.conf``
+should contain all memcached servers.
+
+*************************
+Shard Range Listing Cache
+*************************
+
+When a container gets :ref:`sharded<sharding_doc>` the root container will still be the
+primary entry point to many container requests, as it provides the list of shards.
+To take load off the root container Swift by default caches the list of shards returned.
+
+As the number of shards for a root container grows to more than 3k the memcache default max
+size of 1MB can be reached.
+
+If you over-run your max configured memcache size you'll see messages like::
+
+ Error setting value in memcached: 127.0.0.1:11211: SERVER_ERROR object too large for cache
+
+When you see these messages your root containers are getting hammered and
+probably returning 503 reponses to clients. Override the default 1MB limit to
+5MB with something like::
+
+ /usr/bin/memcached -I 5000000 ...
+
+Memcache has a ``stats sizes`` option that can point out the current size usage. As this
+reaches the current max an increase might be in order::
+
+ # telnet <memcache server> 11211
+ > stats sizes
+ STAT 160 2
+ STAT 448 1
+ STAT 576 1
+ END
+
-----------
System Time
@@ -500,9 +529,9 @@ Time may be relative but it is relatively important for Swift! Swift uses
timestamps to determine which is the most recent version of an object.
It is very important for the system time on each server in the cluster to
by synced as closely as possible (more so for the proxy server, but in general
-it is a good idea for all the servers). At Rackspace, we use NTP with a local
-NTP server to ensure that the system times are as close as possible. This
-should also be monitored to ensure that the times do not vary too much.
+it is a good idea for all the servers). Typical deployments use NTP with a
+local NTP server to ensure that the system times are as close as possible.
+This should also be monitored to ensure that the times do not vary too much.
.. _general-service-tuning:
@@ -512,20 +541,23 @@ General Service Tuning
Most services support either a ``workers`` or ``concurrency`` value in the
settings. This allows the services to make effective use of the cores
-available. A good starting point to set the concurrency level for the proxy
+available. A good starting point is to set the concurrency level for the proxy
and storage services to 2 times the number of cores available. If more than
one service is sharing a server, then some experimentation may be needed to
find the best balance.
-At Rackspace, our Proxy servers have dual quad core processors, giving us 8
-cores. Our testing has shown 16 workers to be a pretty good balance when
-saturating a 10g network and gives good CPU utilization.
+For example, one operator reported using the following settings in a production
+Swift cluster:
+
+- Proxy servers have dual quad core processors (i.e. 8 cores); testing has
+ shown 16 workers to be a pretty good balance when saturating a 10g network
+ and gives good CPU utilization.
-Our Storage server processes all run together on the same servers. These servers have
-dual quad core processors, for 8 cores total. We run the Account, Container,
-and Object servers with 8 workers each. Most of the background jobs are run at
-a concurrency of 1, with the exception of the replicators which are run at a
-concurrency of 2.
+- Storage server processes all run together on the same servers. These servers
+ have dual quad core processors, for 8 cores total. The Account, Container,
+ and Object servers are run with 8 workers each. Most of the background jobs
+ are run at a concurrency of 1, with the exception of the replicators which
+ are run at a concurrency of 2.
The ``max_clients`` parameter can be used to adjust the number of client
requests an individual worker accepts for processing. The fewer requests being
@@ -623,8 +655,8 @@ where Swift stores its data to the setting PRUNEPATHS in ``/etc/updatedb.conf``:
General System Tuning
---------------------
-Rackspace currently runs Swift on Ubuntu Server 10.04, and the following
-changes have been found to be useful for our use cases.
+The following changes have been found to be useful when running Swift on Ubuntu
+Server 10.04.
The following settings should be in ``/etc/sysctl.conf``::
diff --git a/doc/source/overview_encryption.rst b/doc/source/overview_encryption.rst
index cc429737e..beab7ba11 100644
--- a/doc/source/overview_encryption.rst
+++ b/doc/source/overview_encryption.rst
@@ -781,8 +781,9 @@ encrypted.
Encryption has no impact on the `container-reconciler` service. The
`container-reconciler` uses an internal client to move objects between
-different policy rings. The destination object has the same URL as the source
-object and the object is moved without re-encryption.
+different policy rings. The reconciler's pipeline *MUST NOT* have encryption
+enabled. The destination object has the same URL as the source object and the
+object is moved without re-encryption.
Considerations for developers
diff --git a/doc/source/overview_expiring_objects.rst b/doc/source/overview_expiring_objects.rst
index 361937d5a..78d8d3e3b 100644
--- a/doc/source/overview_expiring_objects.rst
+++ b/doc/source/overview_expiring_objects.rst
@@ -62,20 +62,20 @@ The expirer daemon will be moving to a new general task-queue based design that
will divide the work across all object servers, as such only expirers defined
in the object-server config will be able to use the new system.
The parameters in both files are identical except for a new option in the
-object-server ``[object-expirer]`` section, ``dequeue_from_legacy_queue``
+object-server ``[object-expirer]`` section, ``dequeue_from_legacy``
which when set to ``True`` will tell the expirer that in addition to using
the new task queueing system to also check the legacy (soon to be deprecated)
queue.
.. note::
The new task-queue system has not been completed yet. So an expirer's with
- ``dequeue_from_legacy_queue`` set to ``False`` will currently do nothing.
+ ``dequeue_from_legacy`` set to ``False`` will currently do nothing.
-By default ``dequeue_from_legacy_queue`` will be ``False``, it is necessary to
+By default ``dequeue_from_legacy`` will be ``False``, it is necessary to
be set to ``True`` explicitly while migrating from the old expiring queue.
Any expirer using the old config ``/etc/swift/object-expirer.conf`` will not
-use the new general task queue. It'll ignore the ``dequeue_from_legacy_queue``
+use the new general task queue. It'll ignore the ``dequeue_from_legacy``
and will only check the legacy queue. Meaning it'll run as a legacy expirer.
Why is this important? If you are currently running object-expirers on nodes
@@ -91,7 +91,7 @@ However, if your old expirers are running on the object-servers, the most
common topology, then you would add the new section to all object servers, to
deal the new queue. In order to maintain the same number of expirers checking
the legacy queue, pick the same number of nodes as you previously had and turn
-on ``dequeue_from_legacy_queue`` on those nodes only. Also note on these nodes
+on ``dequeue_from_legacy`` on those nodes only. Also note on these nodes
you'd need to keep the legacy ``process`` and ``processes`` options to maintain
the concurrency level for the legacy queue.
@@ -113,10 +113,10 @@ Here is a quick sample of the ``object-expirer`` section required in the
interval = 300
# If this true, expirer execute tasks in legacy expirer task queue
- dequeue_from_legacy_queue = false
+ dequeue_from_legacy = false
- # processes can only be used in conjunction with `dequeue_from_legacy_queue`.
- # So this option is ignored if dequeue_from_legacy_queue=false.
+ # processes can only be used in conjunction with `dequeue_from_legacy`.
+ # So this option is ignored if dequeue_from_legacy=false.
# processes is how many parts to divide the legacy work into, one part per
# process that will be doing the work
# processes set 0 means that a single legacy process will be doing all the work
@@ -124,8 +124,8 @@ Here is a quick sample of the ``object-expirer`` section required in the
# config value
# processes = 0
- # process can only be used in conjunction with `dequeue_from_legacy_queue`.
- # So this option is ignored if dequeue_from_legacy_queue=false.
+ # process can only be used in conjunction with `dequeue_from_legacy`.
+ # So this option is ignored if dequeue_from_legacy=false.
# process is which of the parts a particular legacy process will work on
# process can also be specified on the command line and will override the config
# value
diff --git a/doc/source/overview_policies.rst b/doc/source/overview_policies.rst
index 94fb72a88..822db5037 100644
--- a/doc/source/overview_policies.rst
+++ b/doc/source/overview_policies.rst
@@ -292,6 +292,9 @@ Each policy section contains the following options:
- Policy names can be changed.
- The name ``Policy-0`` can only be used for the policy with
index ``0``.
+ - To avoid confusion with policy indexes it is strongly recommended that
+ policy names are not numbers (e.g. '1'). However, for backwards
+ compatibility, names that are numbers are supported.
* ``aliases = <policy_name>[, <policy_name>, ...]`` (optional)
- A comma-separated list of alternative names for the policy.
- The default value is an empty list (i.e. no aliases).
diff --git a/doc/source/ring_partpower.rst b/doc/source/ring_partpower.rst
index 47d09beff..42b94c5c1 100644
--- a/doc/source/ring_partpower.rst
+++ b/doc/source/ring_partpower.rst
@@ -30,16 +30,21 @@ Caveats
Before increasing the partition power, consider the possible drawbacks.
There are a few caveats when increasing the partition power:
-* All hashes.pkl files will become invalid once hard links are created, and the
- replicators will need significantly more time on the first run after finishing
- the partition power increase.
-* Object replicators will skip partitions during the partition power increase.
- Replicators are not aware of hard-links, and would simply copy the content;
- this would result in heavy data movement and the worst case would be that all
- data is stored twice.
+* Almost all diskfiles in the cluster need to be relinked then cleaned up,
+ and all partition directories need to be rehashed. This imposes significant
+ I/O load on object servers, which may impact client requests. Consider using
+ cgroups, ``ionice``, or even just the built-in ``--files-per-second``
+ rate-limiting to reduce client impact.
+* Object replicators and reconstructors will skip affected policies during the
+ partition power increase. Replicators are not aware of hard-links, and would
+ simply copy the content; this would result in heavy data movement and the
+ worst case would be that all data is stored twice.
* Due to the fact that each object will now be hard linked from two locations,
- many more inodes will be used - expect around twice the amount. You need to
- check the free inode count *before* increasing the partition power.
+ many more inodes will be used temporarily - expect around twice the amount.
+ You need to check the free inode count *before* increasing the partition
+ power. Even after the increase is complete and extra hardlinks are cleaned
+ up, expect increased inode usage since there will be twice as many partition
+ and suffix directories.
* Also, object auditors might read each object twice before cleanup removes the
second hard link.
* Due to the new inodes more memory is needed to cache them, and your
@@ -76,13 +81,14 @@ on all object servers in this phase::
which normally happens within 15 seconds after writing a modified ring.
Also, make sure the modified rings are pushed to all nodes running object
services (replicators, reconstructors and reconcilers)- they have to skip
- partitions during relinking.
+ the policy during relinking.
.. note::
The relinking command must run as the same user as the daemon processes
(usually swift). It will create files and directories that must be
manipulable by the daemon processes (server, auditor, replicator, ...).
+ If necessary, the ``--user`` option may be used to drop privileges.
Relinking might take some time; while there is no data copied or actually
moved, the tool still needs to walk the whole file system and create new hard
@@ -131,10 +137,11 @@ is provided to do this. Run the following command on each storage node::
.. note::
- The cleanup must be finished within your object servers reclaim_age period
- (which is by default 1 week). Otherwise objects that have been overwritten
- between step #1 and step #2 and deleted afterwards can't be cleaned up
- anymore.
+ The cleanup must be finished within your object servers ``reclaim_age``
+ period (which is by default 1 week). Otherwise objects that have been
+ overwritten between step #1 and step #2 and deleted afterwards can't be
+ cleaned up anymore. You may want to increase your ``reclaim_age`` before
+ or during relinking.
Afterwards it is required to update the rings one last
time to inform servers that all steps to increase the partition power are done,
@@ -180,9 +187,9 @@ shows the mapping between old and new location::
>>> from swift.common.utils import replace_partition_in_path
>>> old='objects/16003/a38/fa0fcec07328d068e24ccbf2a62f2a38/1467658208.57179.data'
- >>> replace_partition_in_path(old, 14)
+ >>> replace_partition_in_path('', '/sda/' + old, 14)
'objects/16003/a38/fa0fcec07328d068e24ccbf2a62f2a38/1467658208.57179.data'
- >>> replace_partition_in_path(old, 15)
+ >>> replace_partition_in_path('', '/sda/' + old, 15)
'objects/32007/a38/fa0fcec07328d068e24ccbf2a62f2a38/1467658208.57179.data'
Using the original partition power (14) it returned the same path; however