summaryrefslogtreecommitdiff
path: root/doc/administration/pseudonymizer.md
diff options
context:
space:
mode:
authorMarcel Amirault <ravlen@gmail.com>2019-05-05 15:21:25 +0000
committerAchilleas Pipinellis <axil@gitlab.com>2019-05-05 15:21:25 +0000
commitf4a1dcbe2fc31eca55c11be90588465924019972 (patch)
tree0a8cb51ba7f55ebf2235f607ed94e4c8abf39f00 /doc/administration/pseudonymizer.md
parent9b06dffc719ff068ccb690d193e2bce17ef05fcc (diff)
downloadgitlab-ce-f4a1dcbe2fc31eca55c11be90588465924019972.tar.gz
Docs: Merge Misc EE doc/administration files and dirs to CE
Diffstat (limited to 'doc/administration/pseudonymizer.md')
-rw-r--r--doc/administration/pseudonymizer.md103
1 files changed, 103 insertions, 0 deletions
diff --git a/doc/administration/pseudonymizer.md b/doc/administration/pseudonymizer.md
new file mode 100644
index 00000000000..036e1d3fe61
--- /dev/null
+++ b/doc/administration/pseudonymizer.md
@@ -0,0 +1,103 @@
+# Pseudonymizer **[ULTIMATE]**
+
+> [Introduced](https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/5532) in [GitLab Ultimate][ee] 11.1.
+
+As GitLab's database hosts sensitive information, using it unfiltered for analytics
+implies high security requirements. To help alleviate this constraint, the Pseudonymizer
+service is used to export GitLab's data in a pseudonymized way.
+
+CAUTION: **Warning:**
+This process is not impervious. If the source data is available, it's possible for
+a user to correlate data to the pseudonymized version.
+
+The Pseudonymizer currently uses `HMAC(SHA256)` to mutate fields that shouldn't
+be textually exported. This ensures that:
+
+- the end-user of the data source cannot infer/revert the pseudonymized fields
+- the referential integrity is maintained
+
+## Configuration
+
+To configure the pseudonymizer, you need to:
+
+- Provide a manifest file that describes which fields should be included or
+ pseudonymized ([example `manifest.yml` file](https://gitlab.com/gitlab-org/gitlab-ee/tree/master/config/pseudonymizer.yml)).
+ A default manifest is provided with the GitLab installation. Using a relative file path will be resolved from the Rails root.
+ Alternatively, you can use an absolute file path.
+- Use an object storage and specify the connection parameters in the `pseudonymizer.upload.connection` configuration option.
+
+**For Omnibus installations:**
+
+1. Edit `/etc/gitlab/gitlab.rb` and add the following lines by replacing with
+ the values you want:
+
+ ```ruby
+ gitlab_rails['pseudonymizer_manifest'] = 'config/pseudonymizer.yml'
+ gitlab_rails['pseudonymizer_upload_remote_directory'] = 'gitlab-elt' # bucket name
+ gitlab_rails['pseudonymizer_upload_connection'] = {
+ 'provider' => 'AWS',
+ 'region' => 'eu-central-1',
+ 'aws_access_key_id' => 'AWS_ACCESS_KEY_ID',
+ 'aws_secret_access_key' => 'AWS_SECRET_ACCESS_KEY'
+ }
+ ```
+
+ NOTE: **Note:**
+ If you are using AWS IAM profiles, be sure to omit the AWS access key and secret access key/value pairs.
+
+ ```ruby
+ gitlab_rails['pseudonymizer_upload_connection'] = {
+ 'provider' => 'AWS',
+ 'region' => 'eu-central-1',
+ 'use_iam_profile' => true
+ }
+ ```
+
+1. Save the file and [reconfigure GitLab](restart_gitlab.md#omnibus-gitlab-reconfigure)
+ for the changes to take effect.
+
+---
+
+**For installations from source:**
+
+1. Edit `/home/git/gitlab/config/gitlab.yml` and add or amend the following
+ lines:
+
+ ```yaml
+ pseudonymizer:
+ manifest: config/pseudonymizer.yml
+ upload:
+ remote_directory: 'gitlab-elt' # bucket name
+ connection:
+ provider: AWS
+ aws_access_key_id: AWS_ACCESS_KEY_ID
+ aws_secret_access_key: AWS_SECRET_ACCESS_KEY
+ region: eu-central-1
+ ```
+
+1. Save the file and [restart GitLab](restart_gitlab.md#installations-from-source)
+ for the changes to take effect.
+
+## Usage
+
+You can optionally run the pseudonymizer using the following environment variables:
+
+- `PSEUDONYMIZER_OUTPUT_DIR` - where to store the output CSV files (defaults to `/tmp`)
+- `PSEUDONYMIZER_BATCH` - the batch size when querying the DB (defaults to `100000`)
+
+```bash
+## Omnibus
+sudo gitlab-rake gitlab:db:pseudonymizer
+
+## Source
+sudo -u git -H bundle exec rake gitlab:db:pseudonymizer RAILS_ENV=production
+```
+
+This will produce some CSV files that might be very large, so make sure the
+`PSEUDONYMIZER_OUTPUT_DIR` has sufficient space. As a rule of thumb, at least
+10% of the database size is recommended.
+
+After the pseudonymizer has run, the output CSV files should be uploaded to the
+configured object storage and deleted from the local disk.
+
+[ee]: https://about.gitlab.com/pricing/