diff options
Diffstat (limited to 'doc/development/reference_processing.md')
-rw-r--r-- | doc/development/reference_processing.md | 157 |
1 files changed, 157 insertions, 0 deletions
diff --git a/doc/development/reference_processing.md b/doc/development/reference_processing.md new file mode 100644 index 00000000000..c6c629f3314 --- /dev/null +++ b/doc/development/reference_processing.md @@ -0,0 +1,157 @@ +--- +description: 'An introduction to reference parsers and reference filters, and a guide to their implementation.' +--- + +# Reference processing + +[GitLab Flavored Markdown](../user/markdown.md) includes the ability to process +references to a range of GitLab domain objects. This is implemented by two +abstractions in the `Banzai` pipeline: `ReferenceFilter` and `ReferenceParser`. +This page explains what these are, how they are used, and how you would +implement a new filter/parser pair. + +NOTE: **Note:** +Each `ReferenceFilter` must have a corresponding `ReferenceParser`. + +It is possible to share reference parsers between filters - if two filters find +and link the same type of objects (as specified by the `data-reference-type` +attribute), then we only need one reference parser for that type of domain +object. + +## Reference filters + +The first way that references are handled is by reference filters. These are +the tools that identify short-code and URI references from markup documents and +transform them into structured links to the resources they represent. + +For example, the class +[`Banzai::Filter::IssueReferenceFilter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/filter/issue_reference_filter.rb) +is responsible for handling references to issues, such as +`gitlab-org/gitlab#123` and `https://gitlab.com/gitlab-org/gitlab/issues/200048`. + +All reference filters are instances of [`HTML::Pipeline::Filter`](https://www.rubydoc.info/github/jch/html-pipeline/v1.11.0/HTML/Pipeline/Filter), +and inherit (often indirectly) from [`Banzai::Filter::ReferenceFilter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/filter/reference_filter.rb). + +`HTML::Pipeline::Filter` has a simple interface consisting of `#call`, a void +method that mutates the current document. `ReferenceFilter` provides methods +that make defining suitable `#call` methods easier. Most reference filters +however do not inherit from either of these classes directly, but from +[`AbstractReferenceFilter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/filter/abstract_reference_filter.rb), +which provides a higher-level interface. + +Subclasses of `AbstractReferenceFilter` generally do not override `#call`; instead, +a minimum implementation of `AbstractReferenceFilter` should define: + +- `.reference_type`: The type of domain object. + + This is usually a keyword, and is used to set the `data-reference-type` attribute + on the generated link, and is an important part of the interaction with the + corresponding `ReferenceParser` (see below). + +- `.object_class`: a reference to the class of the objects a filter refers to. + + This is used to: + + - Find the regular expressions used to find references. The class should + include [`Referable`](https://gitlab.com/gitlab-org/gitlab/blob/master/app/models/concerns/referable.rb) + and thus define two regular expressions: `.link_reference_pattern` and + `.reference_pattern`, both of which should contain a named capture group + named the value of `ReferenceFilter.object_sym`. + - Compute the `.object_name`. + - Compute the `.object_sym` (the group name in the reference patterns). + +- `.parse_symbol(string)`: parse the text value to an object identifier (`#to_i` by default). +- `#record_identifier(record)`: the inverse of `.parse_symbol`, that is, transform a domain object to an identifier (`#id` by default). +- `#url_for_object(object, parent_object)`: generate the URL for a domain object. +- `#find_object(parent_object, id)`: given the parent (usually a [`Project`](https://gitlab.com/gitlab-org/gitlab/blob/master/app/models/project.rb)) + and an identifier, find the object. For example, this in a reference filter for + merge requests, this might be `project.merge_requests.where(iid: iid)`. + +### Performance + +This default implementation is not very efficient, because we need to call +`#find_object` for each reference, which may require issuing a DB query every +time. For this reason, most reference filter implementations will instead use an +optimization included in `AbstractReferenceFilter`: + +> `AbstractReferenceFilter` provides a lazily initialized value +> `#records_per_parent`, which is a mapping from parent object to a collection +> of domain objects. + +To use this mechanism, the reference filter must implement the +method: `#parent_records(parent, set_of_identifiers)`, which must return an +enumerable of domain objects. + +This allows such classes to define `#find_object` (as +[`IssuableReferenceFilter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/filter/issuable_reference_filter.rb) +does) as: + +```ruby +def find_object(parent, iid) + records_per_parent[parent][iid] +end +``` + +This makes the number of queries linear in the number of projects. We only need +to implement `parent_records` method when we call `records_per_parent` in our +reference filter. + +## Reference parsers + +In a number of cases, as a performance optimization, we render Markdown to HTML +once, cache the result and then present it to users from the cached value. For +example this happens for notes, issue descriptions, and merge request +descriptions. A consequence of this is that a rendered document might refer to +a resource that some subsequent readers should not be able to see. + +For example, you might create an issue, and refer to a confidential issue `#1234`, +which you have access to. This is rendered in the cached HTML as a link to +that confidential issue, with data attributes containing its ID, the ID of the +project and other confidential data. A later reader, who has access to your issue +might not have permission to read issue `#1234`, and so we need to redact +these sensitive pieces of data. This is what `ReferenceParser` classes do. + +A reference parser is linked to the object that it handles by the link +advertising this relationship in the `data-reference-type` attribute (set by the +reference filter). This is used by the +[`ReferenceRedactor`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/reference_redactor.rb) +to compute which nodes should be visible to users: + +```ruby +def nodes_visible_to_user(nodes) + per_type = Hash.new { |h, k| h[k] = [] } + visible = Set.new + + nodes.each do |node| + per_type[node.attr('data-reference-type')] << node + end + + per_type.each do |type, nodes| + parser = Banzai::ReferenceParser[type].new(context) + + visible.merge(parser.nodes_visible_to_user(user, nodes)) + end + + visible +end +``` + +The key part here is `Banzai::ReferenceParser[type]`, which is used to look up +the correct reference parser for each type of domain object. This requires that +each reference parser must: + +- Be placed in the `Banzai::ReferenceParser` namespace. +- Implement the `.nodes_visible_to_user(user, nodes)` method. + +In practice, all reference parsers inherit from [`BaseParser`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/banzai/reference_parser/base_parser.rb), and are implemented by defining: + +- `.reference_type`, which should equal `ReferenceFilter.reference_type`. +- And by implementing one or more of: + - `#nodes_visible_to_user(user, nodes)` for finest grain control. + - `#can_read_reference?` needed if `nodes_visible_to_user` is not overridden. + - `#references_relation` an active record relation for objects by ID. + - `#nodes_user_can_reference(user, nodes)` to filter nodes directly. + +NOTE: **Note:** +A failure to implement this class for each reference type means that the +application will raise exceptions during Markdown processing. |