summaryrefslogtreecommitdiff
path: root/doc/development/reference_processing.md
blob: cf587043cae52da466f4920a190c8758d4741913 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
description: 'An introduction to reference parsers and reference filters, and a guide to their implementation.'
---

# Reference processing

[GitLab Flavored Markdown](../user/markdown.md) includes the ability to process
references to a range of GitLab domain objects. This is implemented by two
abstractions in the `Banzai` pipeline: `ReferenceFilter` and `ReferenceParser`.
This page explains what these are, how they are used, and how you would
implement a new filter/parser pair.

NOTE: **Note:**
Each `ReferenceFilter` must have a corresponding `ReferenceParser`.

It is possible to share reference parsers between filters - if two filters find
and link the same type of objects (as specified by the `data-reference-type`
attribute), then we only need one reference parser for that type of domain
object.

## Banzai pipeline

`Banzai` pipeline returns the `result` Hash after being filtered by the Pipeline.

The `result` Hash is passed to each filter for modification. This is where Filters store extracted information from the content.
It contains:

- An `:output` key with the DocumentFragment or String HTML markup based on the output of the last filter in the pipeline.
- A `:reference_filter_nodes` key with the list of DocumentFragment `nodes` that are ready for processing, updated by each filter in the pipeline.

## Reference filters

The first way that references are handled is by reference filters. These are
the tools that identify short-code and URI references from markup documents and
transform them into structured links to the resources they represent.

For example, the class
[`Banzai::Filter::IssueReferenceFilter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/filter/issue_reference_filter.rb)
is responsible for handling references to issues, such as
`gitlab-org/gitlab#123` and `https://gitlab.com/gitlab-org/gitlab/-/issues/200048`.

All reference filters are instances of [`HTML::Pipeline::Filter`](https://www.rubydoc.info/github/jch/html-pipeline/v1.11.0/HTML/Pipeline/Filter),
and inherit (often indirectly) from [`Banzai::Filter::ReferenceFilter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/filter/reference_filter.rb).

`HTML::Pipeline::Filter` has a simple interface consisting of `#call`, a void
method that mutates the current document. `ReferenceFilter` provides methods
that make defining suitable `#call` methods easier. Most reference filters
however do not inherit from either of these classes directly, but from
[`AbstractReferenceFilter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/filter/abstract_reference_filter.rb),
which provides a higher-level interface.

Subclasses of `AbstractReferenceFilter` generally do not override `#call`; instead,
a minimum implementation of `AbstractReferenceFilter` should define:

- `.reference_type`: The type of domain object.

  This is usually a keyword, and is used to set the `data-reference-type` attribute
  on the generated link, and is an important part of the interaction with the
  corresponding `ReferenceParser` (see below).

- `.object_class`: a reference to the class of the objects a filter refers to.

  This is used to:

  - Find the regular expressions used to find references. The class should
    include [`Referable`](https://gitlab.com/gitlab-org/gitlab/blob/master/app/models/concerns/referable.rb)
    and thus define two regular expressions: `.link_reference_pattern` and
    `.reference_pattern`, both of which should contain a named capture group
    named the value of `ReferenceFilter.object_sym`.
  - Compute the `.object_name`.
  - Compute the `.object_sym` (the group name in the reference patterns).

- `.parse_symbol(string)`: parse the text value to an object identifier (`#to_i` by default).
- `#record_identifier(record)`: the inverse of `.parse_symbol`, that is, transform a domain object to an identifier (`#id` by default).
- `#url_for_object(object, parent_object)`: generate the URL for a domain object.
- `#find_object(parent_object, id)`: given the parent (usually a [`Project`](https://gitlab.com/gitlab-org/gitlab/blob/master/app/models/project.rb))
 and an identifier, find the object. For example, this in a reference filter for
 merge requests, this might be `project.merge_requests.where(iid: iid)`.

### Add a new reference prefix and filter

For reference filters for new objects, use a prefix format following the pattern
`^<object_type>#`, because:

1. Varied single-character prefixes are hard for users to track. Especially for
   lower-use object types, this can diminish value for the feature.
1. Suitable single-character prefixes are limited.
1. Following a consistent pattern allows users to infer the existence of new features.

To add a reference prefix for a new object `apple`,which has both a name and ID,
format the reference as:

- `^apple#123` for identification by ID.
- `^apple#"Granny Smith"` for identification by name.

### Performance

#### Find object optimization

This default implementation is not very efficient, because we need to call
`#find_object` for each reference, which may require issuing a DB query every
time. For this reason, most reference filter implementations will instead use an
optimization included in `AbstractReferenceFilter`:

> `AbstractReferenceFilter` provides a lazily initialized value
> `#records_per_parent`, which is a mapping from parent object to a collection
> of domain objects.

To use this mechanism, the reference filter must implement the
method: `#parent_records(parent, set_of_identifiers)`, which must return an
enumerable of domain objects.

This allows such classes to define `#find_object` (as
[`IssuableReferenceFilter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/filter/issuable_reference_filter.rb)
does) as:

```ruby
def find_object(parent, iid)
  records_per_parent[parent][iid]
end
```

This makes the number of queries linear in the number of projects. We only need
to implement `parent_records` method when we call `records_per_parent` in our
reference filter.

#### Filtering nodes optimization

Each `ReferenceFilter` would iterate over all `<a>` and `text()` nodes in a document.

Not all nodes are processed, document is filtered only for nodes that we want to process.
We are skipping:

- Link tags already processed by some previous filter (if they have a `gfm` class).
- Nodes with the ancestor node that we want to ignore (`ignore_ancestor_query`).
- Empty line.
- Link tags with the empty `href` attribute.

To avoid filtering such nodes for each `ReferenceFilter`, we do it only once and store the result in the result Hash of the pipeline as `result[:reference_filter_nodes]`.

Pipeline `result` is passed to each filter for modification, so every time when `ReferenceFilter` replaces text or link tag, filtered list (`reference_filter_nodes`) will be updated for the next filter to use.

## Reference parsers

In a number of cases, as a performance optimization, we render Markdown to HTML
once, cache the result and then present it to users from the cached value. For
example this happens for notes, issue descriptions, and merge request
descriptions. A consequence of this is that a rendered document might refer to
a resource that some subsequent readers should not be able to see.

For example, you might create an issue, and refer to a confidential issue `#1234`,
which you have access to. This is rendered in the cached HTML as a link to
that confidential issue, with data attributes containing its ID, the ID of the
project and other confidential data. A later reader, who has access to your issue
might not have permission to read issue `#1234`, and so we need to redact
these sensitive pieces of data. This is what `ReferenceParser` classes do.

A reference parser is linked to the object that it handles by the link
advertising this relationship in the `data-reference-type` attribute (set by the
reference filter). This is used by the
[`ReferenceRedactor`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/reference_redactor.rb)
to compute which nodes should be visible to users:

```ruby
def nodes_visible_to_user(nodes)
  per_type = Hash.new { |h, k| h[k] = [] }
  visible = Set.new

  nodes.each do |node|
    per_type[node.attr('data-reference-type')] << node
  end

  per_type.each do |type, nodes|
    parser = Banzai::ReferenceParser[type].new(context)

    visible.merge(parser.nodes_visible_to_user(user, nodes))
  end

  visible
end
```

The key part here is `Banzai::ReferenceParser[type]`, which is used to look up
the correct reference parser for each type of domain object. This requires that
each reference parser must:

- Be placed in the `Banzai::ReferenceParser` namespace.
- Implement the `.nodes_visible_to_user(user, nodes)` method.

In practice, all reference parsers inherit from [`BaseParser`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/reference_parser/base_parser.rb), and are implemented by defining:

- `.reference_type`, which should equal `ReferenceFilter.reference_type`.
- And by implementing one or more of:
  - `#nodes_visible_to_user(user, nodes)` for finest grain control.
  - `#can_read_reference?` needed if `nodes_visible_to_user` is not overridden.
  - `#references_relation` an active record relation for objects by ID.
  - `#nodes_user_can_reference(user, nodes)` to filter nodes directly.

NOTE: **Note:**
A failure to implement this class for each reference type means that the
application will raise exceptions during Markdown processing.