diff options
author | Marcel Amirault <ravlen@gmail.com> | 2019-05-05 15:21:25 +0000 |
---|---|---|
committer | Achilleas Pipinellis <axil@gitlab.com> | 2019-05-05 15:21:25 +0000 |
commit | f4a1dcbe2fc31eca55c11be90588465924019972 (patch) | |
tree | 0a8cb51ba7f55ebf2235f607ed94e4c8abf39f00 /doc/administration/pseudonymizer.md | |
parent | 9b06dffc719ff068ccb690d193e2bce17ef05fcc (diff) | |
download | gitlab-ce-f4a1dcbe2fc31eca55c11be90588465924019972.tar.gz |
Docs: Merge Misc EE doc/administration files and dirs to CE
Diffstat (limited to 'doc/administration/pseudonymizer.md')
-rw-r--r-- | doc/administration/pseudonymizer.md | 103 |
1 files changed, 103 insertions, 0 deletions
diff --git a/doc/administration/pseudonymizer.md b/doc/administration/pseudonymizer.md new file mode 100644 index 00000000000..036e1d3fe61 --- /dev/null +++ b/doc/administration/pseudonymizer.md @@ -0,0 +1,103 @@ +# Pseudonymizer **[ULTIMATE]** + +> [Introduced](https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/5532) in [GitLab Ultimate][ee] 11.1. + +As GitLab's database hosts sensitive information, using it unfiltered for analytics +implies high security requirements. To help alleviate this constraint, the Pseudonymizer +service is used to export GitLab's data in a pseudonymized way. + +CAUTION: **Warning:** +This process is not impervious. If the source data is available, it's possible for +a user to correlate data to the pseudonymized version. + +The Pseudonymizer currently uses `HMAC(SHA256)` to mutate fields that shouldn't +be textually exported. This ensures that: + +- the end-user of the data source cannot infer/revert the pseudonymized fields +- the referential integrity is maintained + +## Configuration + +To configure the pseudonymizer, you need to: + +- Provide a manifest file that describes which fields should be included or + pseudonymized ([example `manifest.yml` file](https://gitlab.com/gitlab-org/gitlab-ee/tree/master/config/pseudonymizer.yml)). + A default manifest is provided with the GitLab installation. Using a relative file path will be resolved from the Rails root. + Alternatively, you can use an absolute file path. +- Use an object storage and specify the connection parameters in the `pseudonymizer.upload.connection` configuration option. + +**For Omnibus installations:** + +1. Edit `/etc/gitlab/gitlab.rb` and add the following lines by replacing with + the values you want: + + ```ruby + gitlab_rails['pseudonymizer_manifest'] = 'config/pseudonymizer.yml' + gitlab_rails['pseudonymizer_upload_remote_directory'] = 'gitlab-elt' # bucket name + gitlab_rails['pseudonymizer_upload_connection'] = { + 'provider' => 'AWS', + 'region' => 'eu-central-1', + 'aws_access_key_id' => 'AWS_ACCESS_KEY_ID', + 'aws_secret_access_key' => 'AWS_SECRET_ACCESS_KEY' + } + ``` + + NOTE: **Note:** + If you are using AWS IAM profiles, be sure to omit the AWS access key and secret access key/value pairs. + + ```ruby + gitlab_rails['pseudonymizer_upload_connection'] = { + 'provider' => 'AWS', + 'region' => 'eu-central-1', + 'use_iam_profile' => true + } + ``` + +1. Save the file and [reconfigure GitLab](restart_gitlab.md#omnibus-gitlab-reconfigure) + for the changes to take effect. + +--- + +**For installations from source:** + +1. Edit `/home/git/gitlab/config/gitlab.yml` and add or amend the following + lines: + + ```yaml + pseudonymizer: + manifest: config/pseudonymizer.yml + upload: + remote_directory: 'gitlab-elt' # bucket name + connection: + provider: AWS + aws_access_key_id: AWS_ACCESS_KEY_ID + aws_secret_access_key: AWS_SECRET_ACCESS_KEY + region: eu-central-1 + ``` + +1. Save the file and [restart GitLab](restart_gitlab.md#installations-from-source) + for the changes to take effect. + +## Usage + +You can optionally run the pseudonymizer using the following environment variables: + +- `PSEUDONYMIZER_OUTPUT_DIR` - where to store the output CSV files (defaults to `/tmp`) +- `PSEUDONYMIZER_BATCH` - the batch size when querying the DB (defaults to `100000`) + +```bash +## Omnibus +sudo gitlab-rake gitlab:db:pseudonymizer + +## Source +sudo -u git -H bundle exec rake gitlab:db:pseudonymizer RAILS_ENV=production +``` + +This will produce some CSV files that might be very large, so make sure the +`PSEUDONYMIZER_OUTPUT_DIR` has sufficient space. As a rule of thumb, at least +10% of the database size is recommended. + +After the pseudonymizer has run, the output CSV files should be uploaded to the +configured object storage and deleted from the local disk. + +[ee]: https://about.gitlab.com/pricing/ |