summaryrefslogtreecommitdiff
path: root/doc/development/snowplow/index.md
blob: 9b684757fe1cced4e34a087d0925a9687a9b03c5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
stage: Growth
group: Product Intelligence
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---

# Snowplow

Snowplow is an enterprise-grade marketing and Product Intelligence platform that tracks how users engage with our website and application.

[Snowplow](https://snowplowanalytics.com) consists of several loosely-coupled sub-systems:

- **Trackers** fire Snowplow events. Snowplow has twelve trackers that cover web, mobile, desktop, server, and IoT.
- **Collectors** receive Snowplow events from trackers. We use different event collectors that synchronize events to Amazon S3, Apache Kafka, or Amazon Kinesis.
- **Enrich** cleans raw Snowplow events, enriches them, and puts them into storage. There is a Hadoop-based enrichment process, and a Kinesis-based or Kafka-based process.
- **Storage** stores Snowplow events. We store the Snowplow events in a flat file structure on S3, and in the Redshift and PostgreSQL databases.
- **Data modeling** joins event-level data with other data sets, aggregates them into smaller data sets, and applies business logic. This produces a clean set of tables for data analysis. We use data models for Redshift and Looker.
- **Analytics** are performed on Snowplow events or on aggregate tables.

![snowplow_flow](../img/snowplow_flow.png)

## Enable Snowplow tracking

Tracking can be enabled at:

- The instance level, which enables tracking on both the frontend and backend layers.
- The user level. User tracking can be disabled on a per user basis.
  GitLab respects the [Do Not Track](https://www.eff.org/issues/do-not-track) standard, so any user who has enabled the Do Not Track option in their browser is not tracked at a user level.

Snowplow tracking is enabled on GitLab.com, and we use it for most of our tracking strategy.

To enable Snowplow tracking on a self-managed instance:

1. On the top bar, select **Menu > Admin**, then select **Settings > General**.
   Alternatively, go to `admin/application_settings/general` in your browser.

1. Expand **Snowplow**.

1. Select **Enable Snowplow tracking** and enter your Snowplow configuration information. For example:

   | Name               | Value                         |
   |--------------------|-------------------------------|
   | Collector hostname | `your-snowplow-collector.net` |
   | App ID             | `gitlab`                      |
   | Cookie domain      | `.your-gitlab-instance.com`   |

1. Select **Save changes**.

## Snowplow request flow

The following example shows a basic request/response flow between the following components:

- Snowplow JS / Ruby Trackers on GitLab.com
- [GitLab.com Snowplow Collector](https://gitlab.com/gitlab-com/gl-infra/readiness/-/blob/master/library/snowplow/index.md)
- The GitLab S3 Bucket
- The GitLab Snowflake Data Warehouse
- Sisense:

```mermaid
sequenceDiagram
    participant Snowplow JS (Frontend)
    participant Snowplow Ruby (Backend)
    participant GitLab.com Snowplow Collector
    participant S3 Bucket
    participant Snowflake DW
    participant Sisense Dashboards
    Snowplow JS (Frontend) ->> GitLab.com Snowplow Collector: FE Tracking event
    Snowplow Ruby (Backend) ->> GitLab.com Snowplow Collector: BE Tracking event
    loop Process using Kinesis Stream
      GitLab.com Snowplow Collector ->> GitLab.com Snowplow Collector: Log raw events
      GitLab.com Snowplow Collector ->> GitLab.com Snowplow Collector: Enrich events
      GitLab.com Snowplow Collector ->> GitLab.com Snowplow Collector: Write to disk
    end
    GitLab.com Snowplow Collector ->> S3 Bucket: Kinesis Firehose
    Note over GitLab.com Snowplow Collector, S3 Bucket: Pseudonymization
    S3 Bucket->>Snowflake DW: Import data
    Snowflake DW->>Snowflake DW: Transform data using dbt
    Snowflake DW->>Sisense Dashboards: Data available for querying
```

## Structured event taxonomy

Click events must be consistent. If each feature captures events differently, it can be difficult
to perform analysis.

Each click event provides attributes that describe the event.

| Attribute | Type    | Required | Description |
| --------- | ------- | -------- | ----------- |
| category  | text    | true     | The page or backend section of the application. Unless infeasible, use the Rails page attribute by default in the frontend, and namespace + class name on the backend. |
| action    | text    | true     | The action the user takes, or aspect that's being instrumented. The first word must describe the action or aspect. For example, clicks must be `click`, activations must be `activate`, creations must be `create`. Use underscores to describe what was acted on. For example, activating a form field is `activate_form_input`, an interface action like clicking on a dropdown is `click_dropdown`, a behavior like creating a project record from the backend is `create_project`. |
| label     | text    | false    | The specific element or object to act on. This can be one of the following: the label of the element, for example, a tab labeled 'Create from template' for `create_from_template`; a unique identifier if no text is available, for example, `groups_dropdown_close` for closing the Groups dropdown in the top bar; or the name or title attribute of a record being created. |
| property  | text    | false    | Any additional property of the element, or object being acted on. |
| value     | decimal | false    | Describes a numeric value (decimal) directly related to the event. This could be the value of an input. For example, `10` when clicking `internal` visibility. |

### Examples

| Category* | Label            | Action                | Property** | Value |
|-------------|------------------|-----------------------|----------|:-----:|
| `[root:index]` | `main_navigation`            | `click_navigation_link` | `[link_label]`   | - |
| `[groups:boards:show]` | `toggle_swimlanes` | `click_toggle_button` | - | `[is_active]` |
| `[projects:registry:index]` | `registry_delete` | `click_button` | - | - |
| `[projects:registry:index]` | `registry_delete` | `confirm_deletion` | - | - |
| `[projects:blob:show]` | `congratulate_first_pipeline` | `click_button` | `[human_access]` | - |
| `[projects:clusters:new]` | `chart_options` | `generate_link` | `[chart_link]` | - |
| `[projects:clusters:new]` | `chart_options` | `click_add_label_button` | `[label_id]` | - |

_* If you choose to omit the category you can use the default._<br>
_** Use property for variable strings._

### Reference SQL

#### Last 20 `reply_comment_button` events

```sql
SELECT
  session_id,
  event_id,
  event_label,
  event_action,
  event_property,
  event_value,
  event_category,
  contexts
FROM legacy.snowplow_structured_events_all
WHERE
  event_label = 'reply_comment_button'
  AND event_action = 'click_button'
  -- AND event_category = 'projects:issues:show'
  -- AND event_value = 1
ORDER BY collector_tstamp DESC
LIMIT 20
```

#### Last 100 page view events

```sql
SELECT
  -- page_url,
  -- page_title,
  -- referer_url,
  -- marketing_medium,
  -- marketing_source,
  -- marketing_campaign,
  -- browser_window_width,
  -- device_is_mobile
  *
FROM legacy.snowplow_page_views_30
ORDER BY page_view_start DESC
LIMIT 100
```

#### Top 20 users who fired `reply_comment_button` in the last 30 days

```sql
SELECT
  count(*) as hits,
  se_action,
  se_category,
  gsc_pseudonymized_user_id
FROM legacy.snowplow_gitlab_events_30
WHERE
  se_label = 'reply_comment_button'
  AND gsc_pseudonymized_user_id IS NOT NULL
GROUP BY gsc_pseudonymized_user_id, se_category, se_action
ORDER BY count(*) DESC
LIMIT 20
```

#### Query JSON formatted data

```sql
SELECT
  derived_tstamp,
  contexts:data[0]:data:extra:old_format as CURRENT_FORMAT,
  contexts:data[0]:data:extra:value as UPDATED_FORMAT
FROM legacy.snowplow_structured_events_all
WHERE event_action in ('wiki_format_updated')
ORDER BY derived_tstamp DESC
LIMIT 100
```

### Web-specific parameters

Snowplow JavaScript adds [web-specific parameters](https://docs.snowplowanalytics.com/docs/collecting-data/collecting-from-own-applications/snowplow-tracker-protocol/#Web-specific_parameters) to all web events by default.

## Snowplow monitoring

For different stages in the processing pipeline, there are several tools that monitor Snowplow events tracking:

- [Product Intelligence Grafana dashboard](https://dashboards.gitlab.net/d/product-intelligence-main/product-intelligence-product-intelligence?orgId=1) monitors backend events sent from GitLab.com instance to collectors fleet. This dashboard provides information about:
  - The number of events that successfully reach Snowplow collectors.
  - The number of events that failed to reach Snowplow collectors.
  - The number of backend events that were sent.
- [AWS CloudWatch dashboard](https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=SnowPlow;start=P3D) monitors the state of the events processing pipeline. The pipeline starts from Snowplow collectors, through to enrichers and pseudonymization, and up to persistence on S3 bucket from which events are imported to Snowflake Data Warehouse. To view this dashboard AWS access is required, follow this [instruction](https://gitlab.com/gitlab-org/growth/product-intelligence/snowplow-pseudonymization#monitoring) if you are interested in getting one.
- [SiSense dashboard](https://app.periscopedata.com/app/gitlab/417669/Snowplow-Summary-Dashboard) provides information about the number of good and bad events imported into the Data Warehouse, in addition to the total number of imported Snowplow events.

For more information, see this [video walk-through](https://www.youtube.com/watch?v=NxPS0aKa_oU).

## Related topics

- [Snowplow data structure](https://docs.snowplowanalytics.com/docs/understanding-your-pipeline/canonical-event/)
- [Our Iglu schema registry](https://gitlab.com/gitlab-org/iglu)
- [List of events used in our codebase (Event Dictionary)](https://metrics.gitlab.com/snowplow/)
- [Product Intelligence Guide](https://about.gitlab.com/handbook/product/product-intelligence-guide/)
- [Service Ping Guide](../service_ping/index.md)
- [Product Intelligence Direction](https://about.gitlab.com/direction/product-intelligence/)
- [Data Analysis Process](https://about.gitlab.com/handbook/business-technology/data-team/#data-analysis-process/)
- [Data for Product Managers](https://about.gitlab.com/handbook/business-technology/data-team/programs/data-for-product-managers/)
- [Data Infrastructure](https://about.gitlab.com/handbook/business-technology/data-team/platform/infrastructure/)