summaryrefslogtreecommitdiff
path: root/lib/stdlib/doc/src/uri_string_usage.xml
diff options
context:
space:
mode:
Diffstat (limited to 'lib/stdlib/doc/src/uri_string_usage.xml')
-rw-r--r--lib/stdlib/doc/src/uri_string_usage.xml370
1 files changed, 370 insertions, 0 deletions
diff --git a/lib/stdlib/doc/src/uri_string_usage.xml b/lib/stdlib/doc/src/uri_string_usage.xml
new file mode 100644
index 0000000000..72851096b7
--- /dev/null
+++ b/lib/stdlib/doc/src/uri_string_usage.xml
@@ -0,0 +1,370 @@
+<?xml version="1.0" encoding="utf-8" ?>
+<!DOCTYPE chapter SYSTEM "chapter.dtd">
+
+<chapter>
+ <header>
+ <copyright>
+ <year>2020</year>
+ <year>2020</year>
+ <holder>Ericsson AB. All Rights Reserved.</holder>
+ </copyright>
+ <legalnotice>
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+ </legalnotice>
+
+ <title>Uniform Resource Identifiers</title>
+ <prepared>Péter Dimitrov</prepared>
+ <responsible></responsible>
+ <docno></docno>
+ <approved></approved>
+ <checked></checked>
+ <date>2020-09-30</date>
+ <rev>PA1</rev>
+ <file>uri_string_usage.xml</file>
+ </header>
+ <section>
+ <title>Basics</title>
+ <p>At the time of writing this document, in October 2020, there are
+ two major standards concerning Universal Resource Identifiers and
+ Universal Resource Locators:</p>
+ <list type="bulleted">
+ <item><p>
+ <url href="https://www.ietf.org/rfc/rfc3986.txt">RFC 3986 - Uniform Resource
+ Identifier (URI): Generic Syntax</url></p></item>
+ <item><p>
+ <url href="https://url.spec.whatwg.org/">WHAT WG URL - Living standard</url>
+ </p></item>
+ </list>
+ <p>
+ The former is a classical standard with a proper formal syntax, using the so
+ called <url href="https://www.ietf.org/rfc/rfc2234.txt">Augmented Backus-Naur Form
+ (ABNF)</url> for describing
+ the grammar, while the latter is a living document describing the current pratice,
+ that is, how a majority of Web browsers work with URIs. WHAT WG URL is Web focused
+ and it has no formal grammar but a plain english description of the algorithms
+ that should be followed.</p>
+ <p>What is the difference between them, if any? They provide an overlapping
+ definition for resource identifiers and they are not compatible.
+ The <seeerl marker="stdlib:uri_string"><c>uri_string</c></seeerl> module implements
+ <url href="https://www.ietf.org/rfc/rfc3986.txt">RFC 3986</url> and the term URI will
+ be used throughout this document. A URI is an identifier, a string of characters
+ that identifies a particular resource.</p>
+ <p>
+ For a more complete problem
+ statement regarding the URIs check the
+ <url href="https://tools.ietf.org/html/draft-ruby-url-problem-01">URL Problem
+ Statement and Directions</url>.</p>
+ </section>
+
+ <section>
+ <title>What is a URI?</title>
+ <p>Let's start with what it is not. It is not the text that you type in the address
+ bar in your Web browser. Web browsers do all possible heuristics to convert the
+ input into a valid URI that could be sent over the network.</p>
+ <p>A URI is an identifier consisting of a sequence of characters matching the syntax
+ rule named <c>URI</c> in
+ <url href="https://www.ietf.org/rfc/rfc3986.txt">RFC 3986</url>.
+ </p>
+ <p>It is crucial to clarify that a <i>character</i> is a symbol that is displayed on
+ a terminal or written to paper and should not be confused with its internal
+ representation.</p>
+ <p>A URI more specifically, is a sequence of characters from a
+ subset of the US ASCII character set. The generic URI syntax consists of a
+ hierarchical sequence of components referred to as the scheme, authority,
+ path, query, and fragment. There is a formal description for
+ each of these components in
+ <url href="https://www.ietf.org/rfc/rfc2234.txt">ABNF</url> notation in
+ <url href="https://www.ietf.org/rfc/rfc3986.txt">RFC 3986</url>:</p>
+ <pre>
+ URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
+ hier-part = "//" authority path-abempty
+ / path-absolute
+ / path-rootless
+ / path-empty
+ scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
+ authority = [ userinfo "@" ] host [ ":" port ]
+ userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
+
+ reserved = gen-delims / sub-delims
+ gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
+ sub-delims = "!" / "$" / "&amp;" / "'" / "(" / ")"
+ / "*" / "+" / "," / ";" / "="
+
+ unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
+ </pre>
+ </section>
+
+ <section>
+ <title>The uri_string module</title>
+ <p>As producing and consuming standard URIs can get quite complex, Erlang/OTP
+ provides
+ a module, <seeerl marker="stdlib:uri_string"><c>uri_string</c></seeerl>, to handle all the most difficult operations such as parsing,
+ recomposing, normalizing and resolving URIs against a base URI.
+ </p>
+ <p>The API functions in <seeerl marker="stdlib:uri_string"><c>uri_string</c></seeerl>
+ work on two basic data types
+ <seetype marker="uri_string#uri_string"><c>uri_string()</c></seetype> and
+ <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>.
+ <seetype marker="uri_string#uri_string"><c>uri_string()</c></seetype> represents a
+ standard URI, while
+ <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype> is a wider datatype,
+ that can represent URI components using
+ <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide> characters.
+ <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>
+ is a convenient choice for enabling
+ operations such as producing standard compliant URIs out of components that have
+ special or <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide>
+ characters. It is easier to explain this by an example.
+ </p>
+ <p>Let's say that we would like to create the following URI and send it over the
+ network: <c>http://cities/örebro?foo bar</c>. This is not a valid URI as it contains
+ characters that are not allowed in a URI such as "ö" and the space. We can verify
+ this by parsing the URI:
+ </p>
+ <pre>
+ 1> uri_string:parse("http://cities/örebro?foo bar").
+ {error,invalid_uri,":"}
+ </pre>
+ <p>The URI parser tries all possible combinations to interpret the input and fails
+ at the last attempt when it encounters the colon character <c>":"</c>. Note, that
+ the inital fault occurs when the parser attempts to interpret the character
+ <c>"ö"</c> and after a failure back-tracks to the point where it has another
+ possible parsing alternative.</p>
+ <p>The proper way to solve this problem is to use
+ <seemfa marker="uri_string#recompose/1"><c>uri_string:recompose/1</c></seemfa>
+ with a <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype> as input:</p>
+ <pre>
+ 2> uri_string:recompose(#{scheme => "http", host => "cities", path => "/örebro",
+ query => "foo bar"}).
+ "http://cities/%C3%B6rebro?foo%20bar"
+ </pre>
+ <p>The result is a valid URI where all the special characters are encoded as defined
+ by the standard. Applying
+ <seemfa marker="uri_string#parse/1"><c>uri_string:parse/1</c></seemfa> and
+ <seemfa marker="uri_string#percent_decode/1"><c>uri_string:percent_decode/1</c></seemfa>
+ on the URI returns the original input:
+ </p>
+ <pre>
+ 3> uri_string:percent_decode(uri_string:parse("http://cities/%C3%B6rebro?foo%20bar")).
+ #{host => "cities",path => "/örebro",query => "foo bar",
+ scheme => "http"}
+ </pre>
+ <p>This symmetric property is heavily used in our property test suite.
+ </p>
+ </section>
+
+ <section>
+ <title>Percent-encoding</title>
+ <p>As you have seen in the previous chapter, a standard URI can only contain a strict
+ subset of the US ASCII character set, moreover the allowed set of characters is not
+ the same in the different URI components. Percent-encoding is a mechanism to
+ represent a data octet in a component when that octet's corresponding character
+ is outside of
+ the allowed set or is being used as a delimiter. This is what you see when <c>"ö"</c>
+ is encoded as <c>%C3%B6</c> and <c>space</c> as <c>%20</c>.
+ Most of the API functions are
+ expecting UTF-8 encoding when handling percent-encoded triplets. The UTF-8 encoding
+ of the <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide>
+ character <c>"ö"</c> is two octets: <c>OxC3 0xB6</c>.
+ The character <c>space</c> is in the first 128 characters of
+ <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide> and it is encoded
+ using a single octet <c>0x20</c>.</p>
+ <note><p><seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide>
+ is backward compatible with ASCII, the encoding of the first 128
+ characters is the same binary value as in ASCII.
+ </p></note>
+ <p><marker id="percent_encoding"></marker>
+ It is a major source of confusion exactly which characters will be
+ percent-encoded. In order to make it easier to answer this question the library
+ provides a utility function,
+ <seemfa marker="uri_string#allowed_characters/0"><c>uri_string:allowed_characters/0
+ </c></seemfa>,
+ that lists the allowed set of characters in each major
+ URI component, and also in the most important standard character sets.
+ </p>
+ <pre>
+ 1> uri_string:allowed_characters().
+ <![CDATA[{scheme,
+ "+-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"},
+ {userinfo,
+ "!$%&'()*+,-.0123456789:;=ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
+ {host,
+ "!$&'()*+,-.0123456789:;=ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
+ {ipv4,".0123456789"},
+ {ipv6,".0123456789:ABCDEFabcdef"},
+ {regname,
+ "!$%&'()*+,-.0123456789;=ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
+ {path,
+ "!$%&'()*+,-./0123456789:;=@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
+ {query,
+ "!$%&'()*+,-./0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
+ {fragment,
+ "!$%&'()*+,-./0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
+ {reserved,"!#$&'()*+,/:;=?@[]"},
+ {unreserved,
+ "-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"}] ]]>
+ </pre>
+ <p>If a URI component has a character that is not allowed, it will be
+ percent-encoded when the URI is produced:
+ </p>
+ <pre>
+ 2> uri_string:recompose(#{scheme => "https", host => "local#host", path => ""}).
+ "https://local%23host"
+ </pre>
+ <p>Consuming a URI containing percent-encoded triplets can take many steps. The
+ following example shows how to handle an input URI that is not normalized and
+ contains multiple percent-encoded triplets.
+ First, the input <seetype marker="uri_string#uri_string"><c>uri_string()</c></seetype>
+ is to be parsed into a <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>.
+ The parsing only splits the URI into its components without doing any decoding:
+ </p>
+ <pre>
+ 3> uri_string:parse("http://%6C%6Fcal%23host/%F6re%26bro%20").
+ #{host => "%6C%6Fcal%23host",path => "/%F6re%26bro%20",
+ scheme => "http"}}
+ </pre>
+ <p>The input is a valid URI but how can you decode those
+ percent-encoded octets? You can try to normalize the input with
+ <seemfa marker="uri_string#normalize/1"><c>uri_string:normalize/1</c></seemfa>. The
+ normalize operation decodes those
+ percent-encoded triplets that correspond to a character in the unreserved set.
+ Normalization is a safe, idempotent operation that converts a URI into its
+ canonical form:</p>
+ <pre>
+ 4> uri_string:normalize("http://%6C%6Fcal%23host/%F6re%26bro%20").
+ "http://local%23host/%F6re%26bro%20"
+ 5> uri_string:normalize("http://%6C%6Fcal%23host/%F6re%26bro%20", [return_map]).
+ #{host => "local%23host",path => "/%F6re%26bro%20",
+ scheme => "http"}
+ </pre>
+ <p>There are still a few percent-encoded triplets left in the output. At this point,
+ when the URI is already parsed, it is safe to apply application specific decoding on
+ the remaining character triplets. Erlang/OTP provides a function,
+ <seemfa marker="uri_string#percent_decode/1"><c>uri_string:percent_decode/1</c></seemfa>
+ for raw percent decoding
+ that you can use on the host and path components, or on the whole map:
+ </p>
+ <pre>
+ 6> uri_string:percent_decode("local%23host").
+ "local#host"
+ 7> uri_string:percent_decode("/%F6re%26bro%20").
+ <![CDATA[{error,invalid_utf8,<<"/öre&bro ">>}]]>
+ 8> uri_string:percent_decode(#{host => "local%23host",path => "/%F6re%26bro%20",
+ scheme => "http"}).
+ <![CDATA[{error,{invalid,{path,{invalid_utf8,<<"/öre&bro ">>}}}}]]>
+ </pre>
+ <p>The <c>host</c> was successfully decoded but the path contains at least one
+ character with
+ non-UTF-8 encoding. In order to be able to decode this, you have to make assumptions
+ about the encoding used in these triplets. The most obvious choice is
+ <i>latin-1</i>, so you can try
+ <seemfa marker="uri_string#transcode/2"><c>uri_string:transcode/2</c></seemfa>, to
+ transcode the path to UTF-8 and run the percent-decode operation on the
+ transcoded string:
+ </p>
+ <pre>
+ 9> uri_string:transcode("/%F6re%26bro%20", [{in_encoding, latin1}]).
+ "/%C3%B6re%26bro%20"
+ 10> uri_string:percent_decode("/%C3%B6re%26bro%20").
+ <![CDATA["/öre&bro "]]>
+ </pre>
+ <p>It is important to emphasize that it is not safe to apply
+ <seemfa marker="uri_string#percent_decode/1"><c>uri_string:percent_decode/1</c></seemfa>
+ directly on an input URI:
+ </p>
+ <pre>
+ 11> uri_string:percent_decode("http://%6C%6Fcal%23host/%C3%B6re%26bro%20").
+ <![CDATA["http://local#host/öre&bro "
+ 12> uri_string:parse("http://local#host/öre&bro ").]]>
+ {error,invalid_uri,":"}
+ </pre>
+ <note><p>Percent-encoding is implemented in
+ <seemfa marker="uri_string#recompose/1"><c>uri_string:recompose/1</c></seemfa>
+ and it happens when converting a
+ <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>
+ into a <seetype marker="uri_string#uri_string"><c>uri_string()</c></seetype>.
+ There is no equivalent to a raw percent-encoding function as percent-encoding
+ shall be applied on the component level using different sets of allowed characters.
+ Applying percent-encoding directly on an input URI would not be safe just as in
+ the case of
+ <seemfa marker="uri_string#percent_decode/1"><c>uri_string:percent_decode/1</c></seemfa>,
+ the output could be an invalid URI.
+ </p>
+ </note>
+ </section>
+
+ <section>
+ <title>Normalization</title>
+ <p>Normalization is the operation of converting the input URI into a <i>canonical</i>
+ form and keeping the reference to the same underlying resource. The most common
+ application of normalization is determining whether two URIs are equivalent
+ without accessing their referenced resources.</p>
+ <p>Normalization has 6 distinct steps. First the input URI is parsed into an
+ intermediate form that can handle
+ <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide> characters.
+ This datatype is the
+ <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>, that can hold the
+ components of the URI in map elements of type
+ <seetype marker="unicode#chardata"><c>unicode:chardata()</c></seetype>.
+ After having the intermediate form, a sequence of
+ normalization algorithms are applied to the individual URI components:</p>
+ <taglist>
+ <tag>Case normalization</tag>
+ <item>
+ <p>Converts the <c>scheme</c> and <c>host</c> components
+ to lower case as they are not case sensitive.</p>
+ </item>
+ <tag>Percent-encoding normalization</tag>
+ <item>
+ <p>Decodes percent-encoded triplets that
+ correspond to characters in the unreserved set.</p>
+ </item>
+ <tag>Scheme-based normalization</tag>
+ <item>
+ <p>Applying rules for the schemes http, https,
+ ftp, ssh, sftp and tftp.</p>
+ </item>
+ <tag>Path segment normalization</tag>
+ <item>
+ <p>Converts the path into a canonical form.</p>
+ </item>
+ </taglist>
+ <p>After these steps, the intermediate data structure, an
+ <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>,
+ is fully normalized. The last step is applying
+ <seemfa marker="uri_string#recompose/1"><c>uri_string:recompose/1</c></seemfa>
+ that converts the intermediate structure into a valid canonical URI string.</p>
+ <p>Notice the order, the
+ <seemfa marker="uri_string#normalize/2"><c>uri_string:normalize(URIMap, [return_map])</c></seemfa> that we
+ used many times in this user guide is a shortcut in the normalization process
+ returning the intermediate datastructure, and allowing us to inspect and apply
+ further decoding on the remaining percent-encoded triplets.</p>
+ <pre>
+ 13> uri_string:normalize("hTTp://LocalHost:80/%c3%B6rebro/a/../b").
+ "http://localhost/%C3%B6rebro/b"
+ 14> uri_string:normalize("hTTp://LocalHost:80/%c3%B6rebro/a/../b", [return_map]).
+ #{host => "localhost",path => "/%C3%B6rebro/b",
+ scheme => "http"}
+ </pre>
+ </section>
+
+ <section>
+ <title>Special considerations</title>
+ <p>The current URI implementation provides support for producing and consuming
+ standard URIs. The API is not meant to be directly exposed in a Web
+ browser's address bar where users can basically enter free text. Application
+ designers shall implement proper heuristics to map the input into a parsable URI.</p>
+ </section>
+
+</chapter>