diff options
Diffstat (limited to 'lib/stdlib/doc/src/uri_string_usage.xml')
-rw-r--r-- | lib/stdlib/doc/src/uri_string_usage.xml | 370 |
1 files changed, 370 insertions, 0 deletions
diff --git a/lib/stdlib/doc/src/uri_string_usage.xml b/lib/stdlib/doc/src/uri_string_usage.xml new file mode 100644 index 0000000000..72851096b7 --- /dev/null +++ b/lib/stdlib/doc/src/uri_string_usage.xml @@ -0,0 +1,370 @@ +<?xml version="1.0" encoding="utf-8" ?> +<!DOCTYPE chapter SYSTEM "chapter.dtd"> + +<chapter> + <header> + <copyright> + <year>2020</year> + <year>2020</year> + <holder>Ericsson AB. All Rights Reserved.</holder> + </copyright> + <legalnotice> + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + + </legalnotice> + + <title>Uniform Resource Identifiers</title> + <prepared>Péter Dimitrov</prepared> + <responsible></responsible> + <docno></docno> + <approved></approved> + <checked></checked> + <date>2020-09-30</date> + <rev>PA1</rev> + <file>uri_string_usage.xml</file> + </header> + <section> + <title>Basics</title> + <p>At the time of writing this document, in October 2020, there are + two major standards concerning Universal Resource Identifiers and + Universal Resource Locators:</p> + <list type="bulleted"> + <item><p> + <url href="https://www.ietf.org/rfc/rfc3986.txt">RFC 3986 - Uniform Resource + Identifier (URI): Generic Syntax</url></p></item> + <item><p> + <url href="https://url.spec.whatwg.org/">WHAT WG URL - Living standard</url> + </p></item> + </list> + <p> + The former is a classical standard with a proper formal syntax, using the so + called <url href="https://www.ietf.org/rfc/rfc2234.txt">Augmented Backus-Naur Form + (ABNF)</url> for describing + the grammar, while the latter is a living document describing the current pratice, + that is, how a majority of Web browsers work with URIs. WHAT WG URL is Web focused + and it has no formal grammar but a plain english description of the algorithms + that should be followed.</p> + <p>What is the difference between them, if any? They provide an overlapping + definition for resource identifiers and they are not compatible. + The <seeerl marker="stdlib:uri_string"><c>uri_string</c></seeerl> module implements + <url href="https://www.ietf.org/rfc/rfc3986.txt">RFC 3986</url> and the term URI will + be used throughout this document. A URI is an identifier, a string of characters + that identifies a particular resource.</p> + <p> + For a more complete problem + statement regarding the URIs check the + <url href="https://tools.ietf.org/html/draft-ruby-url-problem-01">URL Problem + Statement and Directions</url>.</p> + </section> + + <section> + <title>What is a URI?</title> + <p>Let's start with what it is not. It is not the text that you type in the address + bar in your Web browser. Web browsers do all possible heuristics to convert the + input into a valid URI that could be sent over the network.</p> + <p>A URI is an identifier consisting of a sequence of characters matching the syntax + rule named <c>URI</c> in + <url href="https://www.ietf.org/rfc/rfc3986.txt">RFC 3986</url>. + </p> + <p>It is crucial to clarify that a <i>character</i> is a symbol that is displayed on + a terminal or written to paper and should not be confused with its internal + representation.</p> + <p>A URI more specifically, is a sequence of characters from a + subset of the US ASCII character set. The generic URI syntax consists of a + hierarchical sequence of components referred to as the scheme, authority, + path, query, and fragment. There is a formal description for + each of these components in + <url href="https://www.ietf.org/rfc/rfc2234.txt">ABNF</url> notation in + <url href="https://www.ietf.org/rfc/rfc3986.txt">RFC 3986</url>:</p> + <pre> + URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] + hier-part = "//" authority path-abempty + / path-absolute + / path-rootless + / path-empty + scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) + authority = [ userinfo "@" ] host [ ":" port ] + userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) + + reserved = gen-delims / sub-delims + gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" + sub-delims = "!" / "$" / "&" / "'" / "(" / ")" + / "*" / "+" / "," / ";" / "=" + + unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" + </pre> + </section> + + <section> + <title>The uri_string module</title> + <p>As producing and consuming standard URIs can get quite complex, Erlang/OTP + provides + a module, <seeerl marker="stdlib:uri_string"><c>uri_string</c></seeerl>, to handle all the most difficult operations such as parsing, + recomposing, normalizing and resolving URIs against a base URI. + </p> + <p>The API functions in <seeerl marker="stdlib:uri_string"><c>uri_string</c></seeerl> + work on two basic data types + <seetype marker="uri_string#uri_string"><c>uri_string()</c></seetype> and + <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>. + <seetype marker="uri_string#uri_string"><c>uri_string()</c></seetype> represents a + standard URI, while + <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype> is a wider datatype, + that can represent URI components using + <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide> characters. + <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype> + is a convenient choice for enabling + operations such as producing standard compliant URIs out of components that have + special or <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide> + characters. It is easier to explain this by an example. + </p> + <p>Let's say that we would like to create the following URI and send it over the + network: <c>http://cities/örebro?foo bar</c>. This is not a valid URI as it contains + characters that are not allowed in a URI such as "ö" and the space. We can verify + this by parsing the URI: + </p> + <pre> + 1> uri_string:parse("http://cities/örebro?foo bar"). + {error,invalid_uri,":"} + </pre> + <p>The URI parser tries all possible combinations to interpret the input and fails + at the last attempt when it encounters the colon character <c>":"</c>. Note, that + the inital fault occurs when the parser attempts to interpret the character + <c>"ö"</c> and after a failure back-tracks to the point where it has another + possible parsing alternative.</p> + <p>The proper way to solve this problem is to use + <seemfa marker="uri_string#recompose/1"><c>uri_string:recompose/1</c></seemfa> + with a <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype> as input:</p> + <pre> + 2> uri_string:recompose(#{scheme => "http", host => "cities", path => "/örebro", + query => "foo bar"}). + "http://cities/%C3%B6rebro?foo%20bar" + </pre> + <p>The result is a valid URI where all the special characters are encoded as defined + by the standard. Applying + <seemfa marker="uri_string#parse/1"><c>uri_string:parse/1</c></seemfa> and + <seemfa marker="uri_string#percent_decode/1"><c>uri_string:percent_decode/1</c></seemfa> + on the URI returns the original input: + </p> + <pre> + 3> uri_string:percent_decode(uri_string:parse("http://cities/%C3%B6rebro?foo%20bar")). + #{host => "cities",path => "/örebro",query => "foo bar", + scheme => "http"} + </pre> + <p>This symmetric property is heavily used in our property test suite. + </p> + </section> + + <section> + <title>Percent-encoding</title> + <p>As you have seen in the previous chapter, a standard URI can only contain a strict + subset of the US ASCII character set, moreover the allowed set of characters is not + the same in the different URI components. Percent-encoding is a mechanism to + represent a data octet in a component when that octet's corresponding character + is outside of + the allowed set or is being used as a delimiter. This is what you see when <c>"ö"</c> + is encoded as <c>%C3%B6</c> and <c>space</c> as <c>%20</c>. + Most of the API functions are + expecting UTF-8 encoding when handling percent-encoded triplets. The UTF-8 encoding + of the <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide> + character <c>"ö"</c> is two octets: <c>OxC3 0xB6</c>. + The character <c>space</c> is in the first 128 characters of + <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide> and it is encoded + using a single octet <c>0x20</c>.</p> + <note><p><seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide> + is backward compatible with ASCII, the encoding of the first 128 + characters is the same binary value as in ASCII. + </p></note> + <p><marker id="percent_encoding"></marker> + It is a major source of confusion exactly which characters will be + percent-encoded. In order to make it easier to answer this question the library + provides a utility function, + <seemfa marker="uri_string#allowed_characters/0"><c>uri_string:allowed_characters/0 + </c></seemfa>, + that lists the allowed set of characters in each major + URI component, and also in the most important standard character sets. + </p> + <pre> + 1> uri_string:allowed_characters(). + <![CDATA[{scheme, + "+-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"}, + {userinfo, + "!$%&'()*+,-.0123456789:;=ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"}, + {host, + "!$&'()*+,-.0123456789:;=ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"}, + {ipv4,".0123456789"}, + {ipv6,".0123456789:ABCDEFabcdef"}, + {regname, + "!$%&'()*+,-.0123456789;=ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"}, + {path, + "!$%&'()*+,-./0123456789:;=@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"}, + {query, + "!$%&'()*+,-./0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"}, + {fragment, + "!$%&'()*+,-./0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"}, + {reserved,"!#$&'()*+,/:;=?@[]"}, + {unreserved, + "-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"}] ]]> + </pre> + <p>If a URI component has a character that is not allowed, it will be + percent-encoded when the URI is produced: + </p> + <pre> + 2> uri_string:recompose(#{scheme => "https", host => "local#host", path => ""}). + "https://local%23host" + </pre> + <p>Consuming a URI containing percent-encoded triplets can take many steps. The + following example shows how to handle an input URI that is not normalized and + contains multiple percent-encoded triplets. + First, the input <seetype marker="uri_string#uri_string"><c>uri_string()</c></seetype> + is to be parsed into a <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>. + The parsing only splits the URI into its components without doing any decoding: + </p> + <pre> + 3> uri_string:parse("http://%6C%6Fcal%23host/%F6re%26bro%20"). + #{host => "%6C%6Fcal%23host",path => "/%F6re%26bro%20", + scheme => "http"}} + </pre> + <p>The input is a valid URI but how can you decode those + percent-encoded octets? You can try to normalize the input with + <seemfa marker="uri_string#normalize/1"><c>uri_string:normalize/1</c></seemfa>. The + normalize operation decodes those + percent-encoded triplets that correspond to a character in the unreserved set. + Normalization is a safe, idempotent operation that converts a URI into its + canonical form:</p> + <pre> + 4> uri_string:normalize("http://%6C%6Fcal%23host/%F6re%26bro%20"). + "http://local%23host/%F6re%26bro%20" + 5> uri_string:normalize("http://%6C%6Fcal%23host/%F6re%26bro%20", [return_map]). + #{host => "local%23host",path => "/%F6re%26bro%20", + scheme => "http"} + </pre> + <p>There are still a few percent-encoded triplets left in the output. At this point, + when the URI is already parsed, it is safe to apply application specific decoding on + the remaining character triplets. Erlang/OTP provides a function, + <seemfa marker="uri_string#percent_decode/1"><c>uri_string:percent_decode/1</c></seemfa> + for raw percent decoding + that you can use on the host and path components, or on the whole map: + </p> + <pre> + 6> uri_string:percent_decode("local%23host"). + "local#host" + 7> uri_string:percent_decode("/%F6re%26bro%20"). + <![CDATA[{error,invalid_utf8,<<"/öre&bro ">>}]]> + 8> uri_string:percent_decode(#{host => "local%23host",path => "/%F6re%26bro%20", + scheme => "http"}). + <![CDATA[{error,{invalid,{path,{invalid_utf8,<<"/öre&bro ">>}}}}]]> + </pre> + <p>The <c>host</c> was successfully decoded but the path contains at least one + character with + non-UTF-8 encoding. In order to be able to decode this, you have to make assumptions + about the encoding used in these triplets. The most obvious choice is + <i>latin-1</i>, so you can try + <seemfa marker="uri_string#transcode/2"><c>uri_string:transcode/2</c></seemfa>, to + transcode the path to UTF-8 and run the percent-decode operation on the + transcoded string: + </p> + <pre> + 9> uri_string:transcode("/%F6re%26bro%20", [{in_encoding, latin1}]). + "/%C3%B6re%26bro%20" + 10> uri_string:percent_decode("/%C3%B6re%26bro%20"). + <![CDATA["/öre&bro "]]> + </pre> + <p>It is important to emphasize that it is not safe to apply + <seemfa marker="uri_string#percent_decode/1"><c>uri_string:percent_decode/1</c></seemfa> + directly on an input URI: + </p> + <pre> + 11> uri_string:percent_decode("http://%6C%6Fcal%23host/%C3%B6re%26bro%20"). + <![CDATA["http://local#host/öre&bro " + 12> uri_string:parse("http://local#host/öre&bro ").]]> + {error,invalid_uri,":"} + </pre> + <note><p>Percent-encoding is implemented in + <seemfa marker="uri_string#recompose/1"><c>uri_string:recompose/1</c></seemfa> + and it happens when converting a + <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype> + into a <seetype marker="uri_string#uri_string"><c>uri_string()</c></seetype>. + There is no equivalent to a raw percent-encoding function as percent-encoding + shall be applied on the component level using different sets of allowed characters. + Applying percent-encoding directly on an input URI would not be safe just as in + the case of + <seemfa marker="uri_string#percent_decode/1"><c>uri_string:percent_decode/1</c></seemfa>, + the output could be an invalid URI. + </p> + </note> + </section> + + <section> + <title>Normalization</title> + <p>Normalization is the operation of converting the input URI into a <i>canonical</i> + form and keeping the reference to the same underlying resource. The most common + application of normalization is determining whether two URIs are equivalent + without accessing their referenced resources.</p> + <p>Normalization has 6 distinct steps. First the input URI is parsed into an + intermediate form that can handle + <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide> characters. + This datatype is the + <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>, that can hold the + components of the URI in map elements of type + <seetype marker="unicode#chardata"><c>unicode:chardata()</c></seetype>. + After having the intermediate form, a sequence of + normalization algorithms are applied to the individual URI components:</p> + <taglist> + <tag>Case normalization</tag> + <item> + <p>Converts the <c>scheme</c> and <c>host</c> components + to lower case as they are not case sensitive.</p> + </item> + <tag>Percent-encoding normalization</tag> + <item> + <p>Decodes percent-encoded triplets that + correspond to characters in the unreserved set.</p> + </item> + <tag>Scheme-based normalization</tag> + <item> + <p>Applying rules for the schemes http, https, + ftp, ssh, sftp and tftp.</p> + </item> + <tag>Path segment normalization</tag> + <item> + <p>Converts the path into a canonical form.</p> + </item> + </taglist> + <p>After these steps, the intermediate data structure, an + <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>, + is fully normalized. The last step is applying + <seemfa marker="uri_string#recompose/1"><c>uri_string:recompose/1</c></seemfa> + that converts the intermediate structure into a valid canonical URI string.</p> + <p>Notice the order, the + <seemfa marker="uri_string#normalize/2"><c>uri_string:normalize(URIMap, [return_map])</c></seemfa> that we + used many times in this user guide is a shortcut in the normalization process + returning the intermediate datastructure, and allowing us to inspect and apply + further decoding on the remaining percent-encoded triplets.</p> + <pre> + 13> uri_string:normalize("hTTp://LocalHost:80/%c3%B6rebro/a/../b"). + "http://localhost/%C3%B6rebro/b" + 14> uri_string:normalize("hTTp://LocalHost:80/%c3%B6rebro/a/../b", [return_map]). + #{host => "localhost",path => "/%C3%B6rebro/b", + scheme => "http"} + </pre> + </section> + + <section> + <title>Special considerations</title> + <p>The current URI implementation provides support for producing and consuming + standard URIs. The API is not meant to be directly exposed in a Web + browser's address bar where users can basically enter free text. Application + designers shall implement proper heuristics to map the input into a parsable URI.</p> + </section> + +</chapter> |