1 files changed, 157 insertions, 0 deletions
diff --git a/doc/development/reference_processing.md b/doc/development/reference_processing.md
new file mode 100644
index 00000000000..c6c629f3314
--- /dev/null
+++ b/doc/development/reference_processing.md
@@ -0,0 +1,157 @@
+---
+description: 'An introduction to reference parsers and reference filters, and a guide to their implementation.'
+---
+
+# Reference processing
+
+[GitLab Flavored Markdown](../user/markdown.md) includes the ability to process
+references to a range of GitLab domain objects. This is implemented by two
+abstractions in the `Banzai` pipeline: `ReferenceFilter` and `ReferenceParser`.
+This page explains what these are, how they are used, and how you would
+implement a new filter/parser pair.
+
+NOTE: **Note:**
+Each `ReferenceFilter` must have a corresponding `ReferenceParser`.
+
+It is possible to share reference parsers between filters - if two filters find
+and link the same type of objects (as specified by the `data-reference-type`
+attribute), then we only need one reference parser for that type of domain
+object.
+
+## Reference filters
+
+The first way that references are handled is by reference filters. These are
+the tools that identify short-code and URI references from markup documents and
+transform them into structured links to the resources they represent.
+
+For example, the class
+[`Banzai::Filter::IssueReferenceFilter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/filter/issue_reference_filter.rb)
+is responsible for handling references to issues, such as
+`gitlab-org/gitlab#123` and `https://gitlab.com/gitlab-org/gitlab/issues/200048`.
+
+All reference filters are instances of [`HTML::Pipeline::Filter`](https://www.rubydoc.info/github/jch/html-pipeline/v1.11.0/HTML/Pipeline/Filter),
+and inherit (often indirectly) from [`Banzai::Filter::ReferenceFilter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/filter/reference_filter.rb).
+
+`HTML::Pipeline::Filter` has a simple interface consisting of `#call`, a void
+method that mutates the current document. `ReferenceFilter` provides methods
+that make defining suitable `#call` methods easier. Most reference filters
+however do not inherit from either of these classes directly, but from
+[`AbstractReferenceFilter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/filter/abstract_reference_filter.rb),
+which provides a higher-level interface.
+
+Subclasses of `AbstractReferenceFilter` generally do not override `#call`; instead,
+a minimum implementation of `AbstractReferenceFilter` should define:
+
+- `.reference_type`: The type of domain object.
+  
+  This is usually a keyword, and is used to set the `data-reference-type` attribute
+  on the generated link, and is an important part of the interaction with the
+  corresponding `ReferenceParser` (see below).
+
+- `.object_class`: a reference to the class of the objects a filter refers to.
+
+  This is used to:
+
+  - Find the regular expressions used to find references. The class should
+    include [`Referable`](https://gitlab.com/gitlab-org/gitlab/blob/master/app/models/concerns/referable.rb)
+    and thus define two regular expressions: `.link_reference_pattern` and
+    `.reference_pattern`, both of which should contain a named capture group
+    named the value of `ReferenceFilter.object_sym`.
+  - Compute the `.object_name`.
+  - Compute the `.object_sym` (the group name in the reference patterns).
+
+- `.parse_symbol(string)`: parse the text value to an object identifier (`#to_i` by default).
+- `#record_identifier(record)`: the inverse of `.parse_symbol`, that is, transform a domain object to an identifier (`#id` by default).
+- `#url_for_object(object, parent_object)`: generate the URL for a domain object.
+- `#find_object(parent_object, id)`: given the parent (usually a [`Project`](https://gitlab.com/gitlab-org/gitlab/blob/master/app/models/project.rb))
+ and an identifier, find the object. For example, this in a reference filter for
+ merge requests, this might be `project.merge_requests.where(iid: iid)`.
+
+### Performance
+
+This default implementation is not very efficient, because we need to call
+`#find_object` for each reference, which may require issuing a DB query every
+time. For this reason, most reference filter implementations will instead use an
+optimization included in `AbstractReferenceFilter`:
+
+> `AbstractReferenceFilter` provides a lazily initialized value
+> `#records_per_parent`, which is a mapping from parent object to a collection
+> of domain objects.
+
+To use this mechanism, the reference filter must implement the
+method: `#parent_records(parent, set_of_identifiers)`, which must return an
+enumerable of domain objects.
+
+This allows such classes to define `#find_object` (as
+[`IssuableReferenceFilter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/filter/issuable_reference_filter.rb)
+does) as:
+
+```ruby
+def find_object(parent, iid)
+  records_per_parent[parent][iid]
+end
+```
+
+This makes the number of queries linear in the number of projects. We only need
+to implement `parent_records` method when we call `records_per_parent` in our
+reference filter.
+
+## Reference parsers
+
+In a number of cases, as a performance optimization, we render Markdown to HTML
+once, cache the result and then present it to users from the cached value. For
+example this happens for notes, issue descriptions, and merge request
+descriptions. A consequence of this is that a rendered document might refer to
+a resource that some subsequent readers should not be able to see.
+
+For example, you might create an issue, and refer to a confidential issue `#1234`,
+which you have access to. This is rendered in the cached HTML as a link to
+that confidential issue, with data attributes containing its ID, the ID of the
+project and other confidential data. A later reader, who has access to your issue
+might not have permission to read issue `#1234`, and so we need to redact
+these sensitive pieces of data. This is what `ReferenceParser` classes do.
+
+A reference parser is linked to the object that it handles by the link
+advertising this relationship in the `data-reference-type` attribute (set by the
+reference filter). This is used by the
+[`ReferenceRedactor`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/reference_redactor.rb)
+to compute which nodes should be visible to users:
+
+```ruby
+def nodes_visible_to_user(nodes)
+  per_type = Hash.new { |h, k| h[k] = [] }
+  visible = Set.new
+
+  nodes.each do |node|
+    per_type[node.attr('data-reference-type')] << node
+  end
+
+  per_type.each do |type, nodes|
+    parser = Banzai::ReferenceParser[type].new(context)
+
+    visible.merge(parser.nodes_visible_to_user(user, nodes))
+  end
+
+  visible
+end
+```
+
+The key part here is `Banzai::ReferenceParser[type]`, which is used to look up
+the correct reference parser for each type of domain object. This requires that
+each reference parser must:
+
+- Be placed in the `Banzai::ReferenceParser` namespace.
+- Implement the `.nodes_visible_to_user(user, nodes)` method.
+
+In practice, all reference parsers inherit from [`BaseParser`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/reference_parser/base_parser.rb), and are implemented by defining:
+
+- `.reference_type`, which should equal `ReferenceFilter.reference_type`.
+- And by implementing one or more of:
+  - `#nodes_visible_to_user(user, nodes)` for finest grain control.
+  - `#can_read_reference?` needed if `nodes_visible_to_user` is not overridden.
+  - `#references_relation` an active record relation for objects by ID.
+  - `#nodes_user_can_reference(user, nodes)` to filter nodes directly.
+
+NOTE: **Note:**
+A failure to implement this class for each reference type means that the
+application will raise exceptions during Markdown processing.