diff options
Diffstat (limited to 'doc/administration/pseudonymizer.md')
-rw-r--r-- | doc/administration/pseudonymizer.md | 85 |
1 files changed, 50 insertions, 35 deletions
diff --git a/doc/administration/pseudonymizer.md b/doc/administration/pseudonymizer.md index da3a2e4b34c..bd6982bea12 100644 --- a/doc/administration/pseudonymizer.md +++ b/doc/administration/pseudonymizer.md @@ -6,33 +6,38 @@ info: To determine the technical writer assigned to the Stage/Group associated w # Pseudonymizer **(ULTIMATE)** -> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/5532) in GitLab 11.1. +Your GitLab database contains sensitive information. To protect sensitive information +when you run analytics on your database, you can use the Pseudonymizer service, which: -As the GitLab database hosts sensitive information, using it unfiltered for analytics -implies high security requirements. To help alleviate this constraint, the Pseudonymizer -service is used to export GitLab data in a pseudonymized way. +1. Uses `HMAC(SHA256)` to mutate fields containing sensitive information. +1. Preserves references (referential integrity) between fields. +1. Exports your GitLab data, scrubbed of sensitive material. WARNING: -This process is not impervious. If the source data is available, it's possible for -a user to correlate data to the pseudonymized version. +If the source data is available, users can compare and correlate the scrubbed data +with the original. -The Pseudonymizer currently uses `HMAC(SHA256)` to mutate fields that shouldn't -be textually exported. This ensures that: +To generate a pseudonymized data set: -- the end-user of the data source cannot infer/revert the pseudonymized fields -- the referential integrity is maintained +1. [Configure Pseudonymizer](#configure-pseudonymizer) fields and output location. +1. [Enable Pseudonymizer data collection](#enable-pseudonymizer-data-collection). +1. Optional. [Generate a data set manually](#generate-data-set-manually). -## Configuration +## Configure Pseudonymizer -To configure the Pseudonymizer, you need to: +To use the Pseudonymizer, configure both the fields you want to anonymize, and the location to +store the scrubbed data: -- Provide a manifest file that describes which fields should be included or - pseudonymized ([example `manifest.yml` file](https://gitlab.com/gitlab-org/gitlab/-/tree/master/config/pseudonymizer.yml)). - A default manifest is provided with the GitLab installation, using a relative file path that resolves from the Rails root. - Alternatively, you can use an absolute file path. -- Use an object storage and specify the connection parameters in the `pseudonymizer.upload.connection` configuration option. - -[Read more about using object storage with GitLab](object_storage.md). +1. **Create a manifest file**: This file describes the fields to include or pseudonymize. + - **Default manifest** - GitLab provides a default manifest in your GitLab installation + ([example `manifest.yml` file](https://gitlab.com/gitlab-org/gitlab/-/blob/master/config/pseudonymizer.yml)). + To use the example manifest file, use the `config/pseudonymizer.yml` relative path + when you configure connection parameters. + - **Custom manifest** - To use a custom manifest file, use the absolute path to + the file when you configure the connection parameters. +1. **Configure connection parameters**: In the configuration method appropriate for + your version of GitLab, specify the [object storage](object_storage.md) + connection parameters (`pseudonymizer.upload.connection`). **For Omnibus installations:** @@ -50,7 +55,7 @@ To configure the Pseudonymizer, you need to: } ``` - If you are using AWS IAM profiles, be sure to omit the AWS access key and secret access key/value pairs. + If you are using AWS IAM profiles, omit the AWS access key and secret access key/value pairs. ```ruby gitlab_rails['pseudonymizer_upload_connection'] = { @@ -85,24 +90,34 @@ To configure the Pseudonymizer, you need to: 1. Save the file and [restart GitLab](restart_gitlab.md#installations-from-source) for the changes to take effect. -## Usage +## Enable Pseudonymizer data collection + +To enable data collection: + +1. On the top bar, select **Menu > Admin**. +1. On the left sidebar, select **Settings > Metrics and Profiling**, then expand + **Pseudonymizer data collection**. +1. Select **Enable Pseudonymizer data collection**. +1. Select **Save changes**. -You can optionally run the Pseudonymizer using the following environment variables: +## Generate data set manually -- `PSEUDONYMIZER_OUTPUT_DIR` - where to store the output CSV files (defaults to `/tmp`) -- `PSEUDONYMIZER_BATCH` - the batch size when querying the DB (defaults to `100000`) +You can also run the Pseudonymizer manually: -```shell -## Omnibus -sudo gitlab-rake gitlab:db:pseudonymizer +1. Set these environment variables: + - `PSEUDONYMIZER_OUTPUT_DIR` - Where to store the output CSV files. Defaults to `/tmp`. + These commands produce CSV files that can be quite large. Make sure the directory + can store a file at least 10% of the size of your database. + - `PSEUDONYMIZER_BATCH` - The batch size when querying the database. Defaults to `100000`. +1. Run the command appropriate for your application: + - **Omnibus GitLab**: + `sudo gitlab-rake gitlab:db:pseudonymizer` + - **Installations from source**: + `sudo -u git -H bundle exec rake gitlab:db:pseudonymizer RAILS_ENV=production` -## Source -sudo -u git -H bundle exec rake gitlab:db:pseudonymizer RAILS_ENV=production -``` +After you run the command, upload the output CSV files to your configured object +storage. After the upload completes, delete the output file from the local disk. -This produces some CSV files that might be very large, so make sure the -`PSEUDONYMIZER_OUTPUT_DIR` has sufficient space. As a rule of thumb, at least -10% of the database size is recommended. +## Related topics -After the pseudonymizer has run, the output CSV files should be uploaded to the -configured object storage and deleted from the local disk. +- [Using object storage with GitLab](object_storage.md). |