1 files changed, 370 insertions, 0 deletions
diff --git a/lib/stdlib/doc/src/uri_string_usage.xml b/lib/stdlib/doc/src/uri_string_usage.xml
new file mode 100644
index 0000000000..72851096b7
--- /dev/null
+++ b/lib/stdlib/doc/src/uri_string_usage.xml
@@ -0,0 +1,370 @@
+<?xml version="1.0" encoding="utf-8" ?>
+<!DOCTYPE chapter SYSTEM "chapter.dtd">
+
+<chapter>
+  <header>
+    <copyright>
+      <year>2020</year>
+      <year>2020</year>
+      <holder>Ericsson AB. All Rights Reserved.</holder>
+    </copyright>
+    <legalnotice>
+      Licensed under the Apache License, Version 2.0 (the "License");
+      you may not use this file except in compliance with the License.
+      You may obtain a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+      See the License for the specific language governing permissions and
+      limitations under the License.
+
+    </legalnotice>
+
+    <title>Uniform Resource Identifiers</title>
+    <prepared>Péter Dimitrov</prepared>
+    <responsible></responsible>
+    <docno></docno>
+    <approved></approved>
+    <checked></checked>
+    <date>2020-09-30</date>
+    <rev>PA1</rev>
+    <file>uri_string_usage.xml</file>
+  </header>
+  <section>
+    <title>Basics</title>
+    <p>At the time of writing this document, in October 2020, there are
+    two major standards concerning Universal Resource Identifiers and
+    Universal Resource Locators:</p>
+    <list type="bulleted">
+      <item><p>
+	<url href="https://www.ietf.org/rfc/rfc3986.txt">RFC 3986 - Uniform Resource
+      Identifier (URI): Generic Syntax</url></p></item>
+      <item><p>
+	<url href="https://url.spec.whatwg.org/">WHAT WG URL - Living standard</url>
+      </p></item>
+    </list>
+    <p>
+    The former is a classical standard with a proper formal syntax, using the so
+    called <url href="https://www.ietf.org/rfc/rfc2234.txt">Augmented Backus-Naur Form
+    (ABNF)</url> for describing
+    the grammar, while the latter is a living document describing the current pratice,
+    that is, how a majority of Web browsers work with URIs. WHAT WG URL is Web focused
+    and it has no formal grammar but a plain english description of the algorithms
+    that should be followed.</p>
+    <p>What is the difference between them, if any? They provide an overlapping
+    definition for resource identifiers and they are not compatible.
+    The <seeerl marker="stdlib:uri_string"><c>uri_string</c></seeerl> module implements
+    <url href="https://www.ietf.org/rfc/rfc3986.txt">RFC 3986</url> and the term URI will
+    be used throughout this document. A URI is an identifier, a string of characters
+    that identifies a particular resource.</p>
+    <p>
+    For a more complete problem
+    statement regarding the URIs check the
+    <url href="https://tools.ietf.org/html/draft-ruby-url-problem-01">URL Problem
+    Statement and Directions</url>.</p>
+  </section>
+
+  <section>
+    <title>What is a URI?</title>
+    <p>Let's start with what it is not. It is not the text that you type in the address
+    bar in your Web browser. Web browsers do all possible heuristics to convert the
+    input into a valid URI that could be sent over the network.</p>
+    <p>A URI is an identifier consisting of a sequence of characters matching the syntax
+    rule named <c>URI</c> in
+    <url href="https://www.ietf.org/rfc/rfc3986.txt">RFC 3986</url>.
+    </p>
+    <p>It is crucial to clarify that a <i>character</i> is a symbol that is displayed on
+    a terminal or written to paper and should not be confused with its internal
+    representation.</p>
+    <p>A URI more specifically, is a sequence of characters from a
+    subset of the US ASCII character set. The generic URI syntax consists of a
+    hierarchical sequence of components referred to as the scheme, authority,
+    path, query, and fragment. There is a formal description for
+    each of these components in
+    <url href="https://www.ietf.org/rfc/rfc2234.txt">ABNF</url> notation in
+    <url href="https://www.ietf.org/rfc/rfc3986.txt">RFC 3986</url>:</p>
+    <pre>
+    URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
+    hier-part   = "//" authority path-abempty
+                   / path-absolute
+                   / path-rootless
+                   / path-empty
+    scheme      = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
+    authority   = [ userinfo "@" ] host [ ":" port ]
+    userinfo    = *( unreserved / pct-encoded / sub-delims / ":" )
+
+    reserved    = gen-delims / sub-delims
+    gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
+    sub-delims  = "!" / "$" / "&amp;" / "'" / "(" / ")"
+                / "*" / "+" / "," / ";" / "="
+
+    unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
+    </pre>
+  </section>
+
+  <section>
+    <title>The uri_string module</title>
+    <p>As producing and consuming standard URIs can get quite complex, Erlang/OTP
+    provides
+    a module, <seeerl marker="stdlib:uri_string"><c>uri_string</c></seeerl>, to handle all the most difficult operations such as parsing,
+    recomposing, normalizing and resolving URIs against a base URI.
+    </p>
+    <p>The API functions in <seeerl marker="stdlib:uri_string"><c>uri_string</c></seeerl>
+    work on two basic data types
+    <seetype marker="uri_string#uri_string"><c>uri_string()</c></seetype> and
+    <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>.
+    <seetype marker="uri_string#uri_string"><c>uri_string()</c></seetype> represents a
+    standard URI, while
+    <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype> is a wider datatype,
+    that can represent URI components using
+    <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide> characters.
+    <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>
+    is a convenient choice for enabling
+    operations such as producing standard compliant URIs out of components that have
+    special or <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide>
+    characters. It is easier to explain this by an example.
+    </p>
+    <p>Let's say that we would like to create the following URI and send it over the
+    network: <c>http://cities/örebro?foo bar</c>. This is not a valid URI as it contains
+    characters that are not allowed in a URI such as "ö" and the space. We can verify
+    this by parsing the URI:
+  </p>
+  <pre>
+  1> uri_string:parse("http://cities/örebro?foo bar").
+  {error,invalid_uri,":"}
+  </pre>
+  <p>The URI parser tries all possible combinations to interpret the input and fails
+  at the last attempt when it encounters the colon character <c>":"</c>. Note, that
+  the inital fault occurs when the parser attempts to interpret the character
+  <c>"ö"</c> and after a failure back-tracks to the point where it has another
+  possible parsing alternative.</p>
+  <p>The proper way to solve this problem is to use
+  <seemfa marker="uri_string#recompose/1"><c>uri_string:recompose/1</c></seemfa>
+  with a <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype> as input:</p>
+  <pre>
+  2> uri_string:recompose(#{scheme => "http", host => "cities", path => "/örebro",
+  query => "foo bar"}).
+  "http://cities/%C3%B6rebro?foo%20bar"
+  </pre>
+  <p>The result is a valid URI where all the special characters are encoded as defined
+  by the standard. Applying
+  <seemfa marker="uri_string#parse/1"><c>uri_string:parse/1</c></seemfa> and
+  <seemfa marker="uri_string#percent_decode/1"><c>uri_string:percent_decode/1</c></seemfa>
+  on the URI returns the original input:
+  </p>
+  <pre>
+  3> uri_string:percent_decode(uri_string:parse("http://cities/%C3%B6rebro?foo%20bar")).
+  #{host => "cities",path => "/örebro",query => "foo bar",
+  scheme => "http"}
+  </pre>
+  <p>This symmetric property is heavily used in our property test suite.
+  </p>
+  </section>
+
+  <section>
+    <title>Percent-encoding</title>
+    <p>As you have seen in the previous chapter, a standard URI can only contain a strict
+    subset of the US ASCII character set, moreover the allowed set of characters is not
+    the same in the different URI components. Percent-encoding is a mechanism to
+    represent a data octet in a component when that octet's corresponding character
+    is outside of
+    the allowed set or is being used as a delimiter. This is what you see when <c>"ö"</c>
+    is encoded as <c>%C3%B6</c> and <c>space</c> as <c>%20</c>.
+    Most of the API functions are
+    expecting UTF-8 encoding when handling percent-encoded triplets. The UTF-8 encoding
+    of the <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide>
+    character <c>"ö"</c> is two octets: <c>OxC3 0xB6</c>.
+    The character <c>space</c> is in the first 128 characters of
+    <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide> and it is encoded
+    using a single octet <c>0x20</c>.</p>
+    <note><p><seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide>
+    is backward compatible with ASCII, the encoding of the first 128
+    characters is the same binary value as in ASCII.
+    </p></note>
+    <p><marker id="percent_encoding"></marker>
+    It is a major source of confusion exactly which characters will be
+    percent-encoded. In order to make it easier to answer this question the library
+    provides a utility function,
+    <seemfa marker="uri_string#allowed_characters/0"><c>uri_string:allowed_characters/0
+    </c></seemfa>,
+    that lists the allowed set of characters in each major
+    URI component, and also in the most important standard character sets.
+    </p>
+    <pre>
+    1> uri_string:allowed_characters().
+    <![CDATA[{scheme,
+     "+-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"},
+    {userinfo,
+     "!$%&'()*+,-.0123456789:;=ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
+    {host,
+     "!$&'()*+,-.0123456789:;=ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
+    {ipv4,".0123456789"},
+    {ipv6,".0123456789:ABCDEFabcdef"},
+    {regname,
+     "!$%&'()*+,-.0123456789;=ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
+    {path,
+     "!$%&'()*+,-./0123456789:;=@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
+    {query,
+     "!$%&'()*+,-./0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
+    {fragment,
+     "!$%&'()*+,-./0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
+    {reserved,"!#$&'()*+,/:;=?@[]"},
+    {unreserved,
+     "-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"}] ]]>
+    </pre>
+    <p>If a URI component has a character that is not allowed, it will be
+    percent-encoded when the URI is produced:
+    </p>
+    <pre>
+    2> uri_string:recompose(#{scheme => "https", host => "local#host", path => ""}).
+    "https://local%23host"
+    </pre>
+    <p>Consuming a URI containing percent-encoded triplets can take many steps. The
+    following example shows how to handle an input URI that is not normalized and
+    contains multiple percent-encoded triplets.
+    First, the input <seetype marker="uri_string#uri_string"><c>uri_string()</c></seetype>
+    is to be parsed into a <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>.
+    The parsing only splits the URI into its components without doing any decoding:
+    </p>
+    <pre>
+    3> uri_string:parse("http://%6C%6Fcal%23host/%F6re%26bro%20").
+    #{host => "%6C%6Fcal%23host",path => "/%F6re%26bro%20",
+      scheme => "http"}}
+    </pre>
+    <p>The input is a valid URI but how can you decode those
+    percent-encoded octets? You can try to normalize the input with
+    <seemfa marker="uri_string#normalize/1"><c>uri_string:normalize/1</c></seemfa>. The
+    normalize operation decodes those
+    percent-encoded triplets that correspond to a character in the unreserved set.
+    Normalization is a safe, idempotent operation that converts a URI into its
+    canonical form:</p>
+    <pre>
+    4> uri_string:normalize("http://%6C%6Fcal%23host/%F6re%26bro%20").
+    "http://local%23host/%F6re%26bro%20"
+    5> uri_string:normalize("http://%6C%6Fcal%23host/%F6re%26bro%20", [return_map]).
+    #{host => "local%23host",path => "/%F6re%26bro%20",
+      scheme => "http"}
+    </pre>
+    <p>There are still a few percent-encoded triplets left in the output. At this point,
+    when the URI is already parsed, it is safe to apply application specific decoding on
+    the remaining character triplets. Erlang/OTP provides a function,
+    <seemfa marker="uri_string#percent_decode/1"><c>uri_string:percent_decode/1</c></seemfa>
+    for raw percent decoding
+    that you can use on the host and path components, or on the whole map:
+    </p>
+    <pre>
+    6> uri_string:percent_decode("local%23host").
+    "local#host"
+    7> uri_string:percent_decode("/%F6re%26bro%20").
+    <![CDATA[{error,invalid_utf8,<<"/öre&bro ">>}]]>
+    8> uri_string:percent_decode(#{host => "local%23host",path => "/%F6re%26bro%20",
+    scheme => "http"}).
+    <![CDATA[{error,{invalid,{path,{invalid_utf8,<<"/öre&bro ">>}}}}]]>
+    </pre>
+    <p>The <c>host</c> was successfully decoded but the path contains at least one
+    character with
+    non-UTF-8 encoding. In order to be able to decode this, you have to make assumptions
+    about the encoding used in these triplets. The most obvious choice is
+    <i>latin-1</i>, so you can try
+    <seemfa marker="uri_string#transcode/2"><c>uri_string:transcode/2</c></seemfa>, to
+    transcode the path to UTF-8 and run the percent-decode operation on the
+    transcoded string:
+    </p>
+    <pre>
+    9> uri_string:transcode("/%F6re%26bro%20", [{in_encoding, latin1}]).
+    "/%C3%B6re%26bro%20"
+    10> uri_string:percent_decode("/%C3%B6re%26bro%20").
+    <![CDATA["/öre&bro "]]>
+    </pre>
+    <p>It is important to emphasize that it is not safe to apply
+    <seemfa marker="uri_string#percent_decode/1"><c>uri_string:percent_decode/1</c></seemfa>
+    directly on an input URI:
+    </p>
+    <pre>
+    11> uri_string:percent_decode("http://%6C%6Fcal%23host/%C3%B6re%26bro%20").
+    <![CDATA["http://local#host/öre&bro "
+    12> uri_string:parse("http://local#host/öre&bro ").]]>
+    {error,invalid_uri,":"}
+    </pre>
+    <note><p>Percent-encoding is implemented in
+    <seemfa marker="uri_string#recompose/1"><c>uri_string:recompose/1</c></seemfa>
+    and it happens when converting a
+    <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>
+    into a <seetype marker="uri_string#uri_string"><c>uri_string()</c></seetype>.
+    There is no equivalent to a raw percent-encoding function as percent-encoding
+    shall be applied on the component level using different sets of allowed characters.
+    Applying percent-encoding directly on an input URI would not be safe just as in
+    the case of
+    <seemfa marker="uri_string#percent_decode/1"><c>uri_string:percent_decode/1</c></seemfa>,
+    the output could be an invalid URI.
+    </p>
+    </note>
+  </section>
+
+  <section>
+    <title>Normalization</title>
+    <p>Normalization is the operation of converting the input URI into a <i>canonical</i>
+    form and keeping the reference to the same underlying resource. The most common
+    application of normalization is determining whether two URIs are equivalent
+    without accessing their referenced resources.</p>
+    <p>Normalization has 6 distinct steps. First the input URI is parsed into an
+    intermediate form that can handle
+    <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide> characters.
+    This datatype is the
+    <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>, that can hold the
+    components of the URI in map elements of type
+    <seetype marker="unicode#chardata"><c>unicode:chardata()</c></seetype>.
+    After having the intermediate form, a sequence of
+    normalization algorithms are applied to the individual URI components:</p>
+    <taglist>
+      <tag>Case normalization</tag>
+      <item>
+	<p>Converts the <c>scheme</c> and <c>host</c> components
+	to lower case as they are not case sensitive.</p>
+      </item>
+      <tag>Percent-encoding normalization</tag>
+      <item>
+	<p>Decodes percent-encoded triplets that
+	correspond to characters in the unreserved set.</p>
+      </item>
+      <tag>Scheme-based normalization</tag>
+      <item>
+	<p>Applying rules for the schemes http, https,
+	ftp, ssh, sftp and tftp.</p>
+      </item>
+      <tag>Path segment normalization</tag>
+      <item>
+	<p>Converts the path into a canonical form.</p>
+      </item>
+    </taglist>
+    <p>After these steps, the intermediate data structure, an
+    <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>,
+    is fully normalized. The last step is applying
+    <seemfa marker="uri_string#recompose/1"><c>uri_string:recompose/1</c></seemfa>
+    that converts the intermediate structure into a valid canonical URI string.</p>
+    <p>Notice the order, the
+    <seemfa marker="uri_string#normalize/2"><c>uri_string:normalize(URIMap, [return_map])</c></seemfa> that we
+    used many times in this user guide is a shortcut in the normalization process
+    returning the intermediate datastructure, and allowing us to inspect and apply
+    further decoding on the remaining percent-encoded triplets.</p>
+    <pre>
+    13> uri_string:normalize("hTTp://LocalHost:80/%c3%B6rebro/a/../b").
+    "http://localhost/%C3%B6rebro/b"
+    14> uri_string:normalize("hTTp://LocalHost:80/%c3%B6rebro/a/../b", [return_map]).
+    #{host => "localhost",path => "/%C3%B6rebro/b",
+      scheme => "http"}
+    </pre>
+  </section>
+
+ <section>
+   <title>Special considerations</title>
+   <p>The current URI implementation provides support for producing and consuming
+   standard URIs. The API is not meant to be directly exposed in a Web
+   browser's address bar where users can basically enter free text. Application
+   designers shall implement proper heuristics to map the input into a parsable URI.</p>
+ </section>
+
+</chapter>