diff options
author | Alexander Larsson <alexl@src.gnome.org> | 2007-09-13 09:06:02 +0000 |
---|---|---|
committer | Alexander Larsson <alexl@src.gnome.org> | 2007-09-13 09:06:02 +0000 |
commit | af9983daf921cb57927b4fa8551f53b2f69ad83d (patch) | |
tree | ba8e8c70385a0abc243f12b9d79195c27abe3fc5 /txt | |
parent | 7b829c191ed354a0d6cd1b7022d8389f3554d71f (diff) | |
download | gvfs-af9983daf921cb57927b4fa8551f53b2f69ad83d.tar.gz |
New txt/ files
Original git commit by Alexander Larsson <alex@greebo.(none)> at 1160058629 +0200
svn path=/trunk/; revision=96
Diffstat (limited to 'txt')
-rw-r--r-- | txt/gvfs_dbus.txt | 65 | ||||
-rw-r--r-- | txt/rfc3986.txt | 3419 | ||||
-rw-r--r-- | txt/rfc3987.txt | 2579 |
3 files changed, 6063 insertions, 0 deletions
diff --git a/txt/gvfs_dbus.txt b/txt/gvfs_dbus.txt new file mode 100644 index 00000000..86ac19ac --- /dev/null +++ b/txt/gvfs_dbus.txt @@ -0,0 +1,65 @@ +how to chain to simple stuff + +how to parse uris (i.e. map to mounts) + +what connections do we have: +shared dbus connection +connection to main daemom +connection to each mount daemon + +"fast ops" (uri->gfile) vs blocking ops (read, open etc) and how to avoid slow blocking fast + + +each thread has, on demand: +connection to main daemon +connection to some mount daemons + +global state: +cache of previously used mountpoints + + +how to mount + +how to store/restore permanent mounts with the session => store as drives (mountpoints), not volumes! + +Don't always want to log in to all mounts on login? (mounpoints!) + +computer:// handled in main daemon? + +No volume monitor in public API, only computer:// ? +Problems: +* mounted (desktop/computer:, trash dir) +* unmounted/pre_unmount (desktop/computer:, close windows on unmounted volumes, trash dir) +* map path to volume (close windows on unmounted volumes, check for readonly mount, get volume name) +* get all drives/volumes (detecting where to show eject, mount, unmount menu items, + tree view, places sidebar, display volume icon in pathbar) +* eject/unmount ops +* needs eject + +unmounted URI => return a mountpoint object? + +GMountOperation, async mount operation object +signals => passwd, question, keyring? + +GFile mountpoint => GMountOperation + +What process calls gnome-keyring? + + + +-------------------- + +GFile creation => decompose URI, no i/o + +on i/o: + * figure out mountpoint (for now, always toplevel uri location) + * if we have a local dbus connection to that, use it, otherwise: + + create (if needed) local session dbus connection + + ask for mount daemon for new session + - If not existing, error on i/o, return mountpoint type on get_info + + set up new local connection with the mount daemon + * send dbus message + * recieve answer, if has magic flag, followed by fd sendmsg() (created by socketpair()) + + + diff --git a/txt/rfc3986.txt b/txt/rfc3986.txt new file mode 100644 index 00000000..c56ed4eb --- /dev/null +++ b/txt/rfc3986.txt @@ -0,0 +1,3419 @@ + + + + + + +Network Working Group T. Berners-Lee +Request for Comments: 3986 W3C/MIT +STD: 66 R. Fielding +Updates: 1738 Day Software +Obsoletes: 2732, 2396, 1808 L. Masinter +Category: Standards Track Adobe Systems + January 2005 + + + Uniform Resource Identifier (URI): Generic Syntax + +Status of This Memo + + This document specifies an Internet standards track protocol for the + Internet community, and requests discussion and suggestions for + improvements. Please refer to the current edition of the "Internet + Official Protocol Standards" (STD 1) for the standardization state + and status of this protocol. Distribution of this memo is unlimited. + +Copyright Notice + + Copyright (C) The Internet Society (2005). + +Abstract + + A Uniform Resource Identifier (URI) is a compact sequence of + characters that identifies an abstract or physical resource. This + specification defines the generic URI syntax and a process for + resolving URI references that might be in relative form, along with + guidelines and security considerations for the use of URIs on the + Internet. The URI syntax defines a grammar that is a superset of all + valid URIs, allowing an implementation to parse the common components + of a URI reference without knowing the scheme-specific requirements + of every possible identifier. This specification does not define a + generative grammar for URIs; that task is performed by the individual + specifications of each URI scheme. + + + + + + + + + + + + + + + +Berners-Lee, et al. Standards Track [Page 1] + +RFC 3986 URI Generic Syntax January 2005 + + +Table of Contents + + 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 + 1.1. Overview of URIs . . . . . . . . . . . . . . . . . . . . 4 + 1.1.1. Generic Syntax . . . . . . . . . . . . . . . . . 6 + 1.1.2. Examples . . . . . . . . . . . . . . . . . . . . 7 + 1.1.3. URI, URL, and URN . . . . . . . . . . . . . . . 7 + 1.2. Design Considerations . . . . . . . . . . . . . . . . . 8 + 1.2.1. Transcription . . . . . . . . . . . . . . . . . 8 + 1.2.2. Separating Identification from Interaction . . . 9 + 1.2.3. Hierarchical Identifiers . . . . . . . . . . . . 10 + 1.3. Syntax Notation . . . . . . . . . . . . . . . . . . . . 11 + 2. Characters . . . . . . . . . . . . . . . . . . . . . . . . . . 11 + 2.1. Percent-Encoding . . . . . . . . . . . . . . . . . . . . 12 + 2.2. Reserved Characters . . . . . . . . . . . . . . . . . . 12 + 2.3. Unreserved Characters . . . . . . . . . . . . . . . . . 13 + 2.4. When to Encode or Decode . . . . . . . . . . . . . . . . 14 + 2.5. Identifying Data . . . . . . . . . . . . . . . . . . . . 14 + 3. Syntax Components . . . . . . . . . . . . . . . . . . . . . . 16 + 3.1. Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 17 + 3.2. Authority . . . . . . . . . . . . . . . . . . . . . . . 17 + 3.2.1. User Information . . . . . . . . . . . . . . . . 18 + 3.2.2. Host . . . . . . . . . . . . . . . . . . . . . . 18 + 3.2.3. Port . . . . . . . . . . . . . . . . . . . . . . 22 + 3.3. Path . . . . . . . . . . . . . . . . . . . . . . . . . . 22 + 3.4. Query . . . . . . . . . . . . . . . . . . . . . . . . . 23 + 3.5. Fragment . . . . . . . . . . . . . . . . . . . . . . . . 24 + 4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 + 4.1. URI Reference . . . . . . . . . . . . . . . . . . . . . 25 + 4.2. Relative Reference . . . . . . . . . . . . . . . . . . . 26 + 4.3. Absolute URI . . . . . . . . . . . . . . . . . . . . . . 27 + 4.4. Same-Document Reference . . . . . . . . . . . . . . . . 27 + 4.5. Suffix Reference . . . . . . . . . . . . . . . . . . . . 27 + 5. Reference Resolution . . . . . . . . . . . . . . . . . . . . . 28 + 5.1. Establishing a Base URI . . . . . . . . . . . . . . . . 28 + 5.1.1. Base URI Embedded in Content . . . . . . . . . . 29 + 5.1.2. Base URI from the Encapsulating Entity . . . . . 29 + 5.1.3. Base URI from the Retrieval URI . . . . . . . . 30 + 5.1.4. Default Base URI . . . . . . . . . . . . . . . . 30 + 5.2. Relative Resolution . . . . . . . . . . . . . . . . . . 30 + 5.2.1. Pre-parse the Base URI . . . . . . . . . . . . . 31 + 5.2.2. Transform References . . . . . . . . . . . . . . 31 + 5.2.3. Merge Paths . . . . . . . . . . . . . . . . . . 32 + 5.2.4. Remove Dot Segments . . . . . . . . . . . . . . 33 + 5.3. Component Recomposition . . . . . . . . . . . . . . . . 35 + 5.4. Reference Resolution Examples . . . . . . . . . . . . . 35 + 5.4.1. Normal Examples . . . . . . . . . . . . . . . . 36 + 5.4.2. Abnormal Examples . . . . . . . . . . . . . . . 36 + + + +Berners-Lee, et al. Standards Track [Page 2] + +RFC 3986 URI Generic Syntax January 2005 + + + 6. Normalization and Comparison . . . . . . . . . . . . . . . . . 38 + 6.1. Equivalence . . . . . . . . . . . . . . . . . . . . . . 38 + 6.2. Comparison Ladder . . . . . . . . . . . . . . . . . . . 39 + 6.2.1. Simple String Comparison . . . . . . . . . . . . 39 + 6.2.2. Syntax-Based Normalization . . . . . . . . . . . 40 + 6.2.3. Scheme-Based Normalization . . . . . . . . . . . 41 + 6.2.4. Protocol-Based Normalization . . . . . . . . . . 42 + 7. Security Considerations . . . . . . . . . . . . . . . . . . . 43 + 7.1. Reliability and Consistency . . . . . . . . . . . . . . 43 + 7.2. Malicious Construction . . . . . . . . . . . . . . . . . 43 + 7.3. Back-End Transcoding . . . . . . . . . . . . . . . . . . 44 + 7.4. Rare IP Address Formats . . . . . . . . . . . . . . . . 45 + 7.5. Sensitive Information . . . . . . . . . . . . . . . . . 45 + 7.6. Semantic Attacks . . . . . . . . . . . . . . . . . . . . 45 + 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 46 + 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 46 + 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 46 + 10.1. Normative References . . . . . . . . . . . . . . . . . . 46 + 10.2. Informative References . . . . . . . . . . . . . . . . . 47 + A. Collected ABNF for URI . . . . . . . . . . . . . . . . . . . . 49 + B. Parsing a URI Reference with a Regular Expression . . . . . . 50 + C. Delimiting a URI in Context . . . . . . . . . . . . . . . . . 51 + D. Changes from RFC 2396 . . . . . . . . . . . . . . . . . . . . 53 + D.1. Additions . . . . . . . . . . . . . . . . . . . . . . . 53 + D.2. Modifications . . . . . . . . . . . . . . . . . . . . . 53 + Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 60 + Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 61 + + + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et al. Standards Track [Page 3] + +RFC 3986 URI Generic Syntax January 2005 + + +1. Introduction + + A Uniform Resource Identifier (URI) provides a simple and extensible + means for identifying a resource. This specification of URI syntax + and semantics is derived from concepts introduced by the World Wide + Web global information initiative, whose use of these identifiers + dates from 1990 and is described in "Universal Resource Identifiers + in WWW" [RFC1630]. The syntax is designed to meet the + recommendations laid out in "Functional Recommendations for Internet + Resource Locators" [RFC1736] and "Functional Requirements for Uniform + Resource Names" [RFC1737]. + + This document obsoletes [RFC2396], which merged "Uniform Resource + Locators" [RFC1738] and "Relative Uniform Resource Locators" + [RFC1808] in order to define a single, generic syntax for all URIs. + It obsoletes [RFC2732], which introduced syntax for an IPv6 address. + It excludes portions of RFC 1738 that defined the specific syntax of + individual URI schemes; those portions will be updated as separate + documents. The process for registration of new URI schemes is + defined separately by [BCP35]. Advice for designers of new URI + schemes can be found in [RFC2718]. All significant changes from RFC + 2396 are noted in Appendix D. + + This specification uses the terms "character" and "coded character + set" in accordance with the definitions provided in [BCP19], and + "character encoding" in place of what [BCP19] refers to as a + "charset". + +1.1. Overview of URIs + + URIs are characterized as follows: + + Uniform + + Uniformity provides several benefits. It allows different types + of resource identifiers to be used in the same context, even when + the mechanisms used to access those resources may differ. It + allows uniform semantic interpretation of common syntactic + conventions across different types of resource identifiers. It + allows introduction of new types of resource identifiers without + interfering with the way that existing identifiers are used. It + allows the identifiers to be reused in many different contexts, + thus permitting new applications or protocols to leverage a pre- + existing, large, and widely used set of resource identifiers. + + + + + + + +Berners-Lee, et al. Standards Track [Page 4] + +RFC 3986 URI Generic Syntax January 2005 + + + Resource + + This specification does not limit the scope of what might be a + resource; rather, the term "resource" is used in a general sense + for whatever might be identified by a URI. Familiar examples + include an electronic document, an image, a source of information + with a consistent purpose (e.g., "today's weather report for Los + Angeles"), a service (e.g., an HTTP-to-SMS gateway), and a + collection of other resources. A resource is not necessarily + accessible via the Internet; e.g., human beings, corporations, and + bound books in a library can also be resources. Likewise, + abstract concepts can be resources, such as the operators and + operands of a mathematical equation, the types of a relationship + (e.g., "parent" or "employee"), or numeric values (e.g., zero, + one, and infinity). + + Identifier + + An identifier embodies the information required to distinguish + what is being identified from all other things within its scope of + identification. Our use of the terms "identify" and "identifying" + refer to this purpose of distinguishing one resource from all + other resources, regardless of how that purpose is accomplished + (e.g., by name, address, or context). These terms should not be + mistaken as an assumption that an identifier defines or embodies + the identity of what is referenced, though that may be the case + for some identifiers. Nor should it be assumed that a system + using URIs will access the resource identified: in many cases, + URIs are used to denote resources without any intention that they + be accessed. Likewise, the "one" resource identified might not be + singular in nature (e.g., a resource might be a named set or a + mapping that varies over time). + + A URI is an identifier consisting of a sequence of characters + matching the syntax rule named <URI> in Section 3. It enables + uniform identification of resources via a separately defined + extensible set of naming schemes (Section 3.1). How that + identification is accomplished, assigned, or enabled is delegated to + each scheme specification. + + This specification does not place any limits on the nature of a + resource, the reasons why an application might seek to refer to a + resource, or the kinds of systems that might use URIs for the sake of + identifying resources. This specification does not require that a + URI persists in identifying the same resource over time, though that + is a common goal of all URI schemes. Nevertheless, nothing in this + + + + + +Berners-Lee, et al. Standards Track [Page 5] + +RFC 3986 URI Generic Syntax January 2005 + + + specification prevents an application from limiting itself to + particular types of resources, or to a subset of URIs that maintains + characteristics desired by that application. + + URIs have a global scope and are interpreted consistently regardless + of context, though the result of that interpretation may be in + relation to the end-user's context. For example, "http://localhost/" + has the same interpretation for every user of that reference, even + though the network interface corresponding to "localhost" may be + different for each end-user: interpretation is independent of access. + However, an action made on the basis of that reference will take + place in relation to the end-user's context, which implies that an + action intended to refer to a globally unique thing must use a URI + that distinguishes that resource from all other things. URIs that + identify in relation to the end-user's local context should only be + used when the context itself is a defining aspect of the resource, + such as when an on-line help manual refers to a file on the end- + user's file system (e.g., "file:///etc/hosts"). + +1.1.1. Generic Syntax + + Each URI begins with a scheme name, as defined in Section 3.1, that + refers to a specification for assigning identifiers within that + scheme. As such, the URI syntax is a federated and extensible naming + system wherein each scheme's specification may further restrict the + syntax and semantics of identifiers using that scheme. + + This specification defines those elements of the URI syntax that are + required of all URI schemes or are common to many URI schemes. It + thus defines the syntax and semantics needed to implement a scheme- + independent parsing mechanism for URI references, by which the + scheme-dependent handling of a URI can be postponed until the + scheme-dependent semantics are needed. Likewise, protocols and data + formats that make use of URI references can refer to this + specification as a definition for the range of syntax allowed for all + URIs, including those schemes that have yet to be defined. This + decouples the evolution of identification schemes from the evolution + of protocols, data formats, and implementations that make use of + URIs. + + A parser of the generic URI syntax can parse any URI reference into + its major components. Once the scheme is determined, further + scheme-specific parsing can be performed on the components. In other + words, the URI generic syntax is a superset of the syntax of all URI + schemes. + + + + + + +Berners-Lee, et al. Standards Track [Page 6] + +RFC 3986 URI Generic Syntax January 2005 + + +1.1.2. Examples + + The following example URIs illustrate several URI schemes and + variations in their common syntax components: + + ftp://ftp.is.co.za/rfc/rfc1808.txt + + http://www.ietf.org/rfc/rfc2396.txt + + ldap://[2001:db8::7]/c=GB?objectClass?one + + mailto:John.Doe@example.com + + news:comp.infosystems.www.servers.unix + + tel:+1-816-555-1212 + + telnet://192.0.2.16:80/ + + urn:oasis:names:specification:docbook:dtd:xml:4.1.2 + + +1.1.3. URI, URL, and URN + + A URI can be further classified as a locator, a name, or both. The + term "Uniform Resource Locator" (URL) refers to the subset of URIs + that, in addition to identifying a resource, provide a means of + locating the resource by describing its primary access mechanism + (e.g., its network "location"). The term "Uniform Resource Name" + (URN) has been used historically to refer to both URIs under the + "urn" scheme [RFC2141], which are required to remain globally unique + and persistent even when the resource ceases to exist or becomes + unavailable, and to any other URI with the properties of a name. + + An individual scheme does not have to be classified as being just one + of "name" or "locator". Instances of URIs from any given scheme may + have the characteristics of names or locators or both, often + depending on the persistence and care in the assignment of + identifiers by the naming authority, rather than on any quality of + the scheme. Future specifications and related documentation should + use the general term "URI" rather than the more restrictive terms + "URL" and "URN" [RFC3305]. + + + + + + + + + +Berners-Lee, et al. Standards Track [Page 7] + +RFC 3986 URI Generic Syntax January 2005 + + +1.2. Design Considerations + +1.2.1. Transcription + + The URI syntax has been designed with global transcription as one of + its main considerations. A URI is a sequence of characters from a + very limited set: the letters of the basic Latin alphabet, digits, + and a few special characters. A URI may be represented in a variety + of ways; e.g., ink on paper, pixels on a screen, or a sequence of + character encoding octets. The interpretation of a URI depends only + on the characters used and not on how those characters are + represented in a network protocol. + + The goal of transcription can be described by a simple scenario. + Imagine two colleagues, Sam and Kim, sitting in a pub at an + international conference and exchanging research ideas. Sam asks Kim + for a location to get more information, so Kim writes the URI for the + research site on a napkin. Upon returning home, Sam takes out the + napkin and types the URI into a computer, which then retrieves the + information to which Kim referred. + + There are several design considerations revealed by the scenario: + + o A URI is a sequence of characters that is not always represented + as a sequence of octets. + + o A URI might be transcribed from a non-network source and thus + should consist of characters that are most likely able to be + entered into a computer, within the constraints imposed by + keyboards (and related input devices) across languages and + locales. + + o A URI often has to be remembered by people, and it is easier for + people to remember a URI when it consists of meaningful or + familiar components. + + These design considerations are not always in alignment. For + example, it is often the case that the most meaningful name for a URI + component would require characters that cannot be typed into some + systems. The ability to transcribe a resource identifier from one + medium to another has been considered more important than having a + URI consist of the most meaningful of components. + + In local or regional contexts and with improving technology, users + might benefit from being able to use a wider range of characters; + such use is not defined by this specification. Percent-encoded + octets (Section 2.1) may be used within a URI to represent characters + outside the range of the US-ASCII coded character set if this + + + +Berners-Lee, et al. Standards Track [Page 8] + +RFC 3986 URI Generic Syntax January 2005 + + + representation is allowed by the scheme or by the protocol element in + which the URI is referenced. Such a definition should specify the + character encoding used to map those characters to octets prior to + being percent-encoded for the URI. + +1.2.2. Separating Identification from Interaction + + A common misunderstanding of URIs is that they are only used to refer + to accessible resources. The URI itself only provides + identification; access to the resource is neither guaranteed nor + implied by the presence of a URI. Instead, any operation associated + with a URI reference is defined by the protocol element, data format + attribute, or natural language text in which it appears. + + Given a URI, a system may attempt to perform a variety of operations + on the resource, as might be characterized by words such as "access", + "update", "replace", or "find attributes". Such operations are + defined by the protocols that make use of URIs, not by this + specification. However, we do use a few general terms for describing + common operations on URIs. URI "resolution" is the process of + determining an access mechanism and the appropriate parameters + necessary to dereference a URI; this resolution may require several + iterations. To use that access mechanism to perform an action on the + URI's resource is to "dereference" the URI. + + When URIs are used within information retrieval systems to identify + sources of information, the most common form of URI dereference is + "retrieval": making use of a URI in order to retrieve a + representation of its associated resource. A "representation" is a + sequence of octets, along with representation metadata describing + those octets, that constitutes a record of the state of the resource + at the time when the representation is generated. Retrieval is + achieved by a process that might include using the URI as a cache key + to check for a locally cached representation, resolution of the URI + to determine an appropriate access mechanism (if any), and + dereference of the URI for the sake of applying a retrieval + operation. Depending on the protocols used to perform the retrieval, + additional information might be supplied about the resource (resource + metadata) and its relation to other resources. + + URI references in information retrieval systems are designed to be + late-binding: the result of an access is generally determined when it + is accessed and may vary over time or due to other aspects of the + interaction. These references are created in order to be used in the + future: what is being identified is not some specific result that was + obtained in the past, but rather some characteristic that is expected + to be true for future results. In such cases, the resource referred + to by the URI is actually a sameness of characteristics as observed + + + +Berners-Lee, et al. Standards Track [Page 9] + +RFC 3986 URI Generic Syntax January 2005 + + + over time, perhaps elucidated by additional comments or assertions + made by the resource provider. + + Although many URI schemes are named after protocols, this does not + imply that use of these URIs will result in access to the resource + via the named protocol. URIs are often used simply for the sake of + identification. Even when a URI is used to retrieve a representation + of a resource, that access might be through gateways, proxies, + caches, and name resolution services that are independent of the + protocol associated with the scheme name. The resolution of some + URIs may require the use of more than one protocol (e.g., both DNS + and HTTP are typically used to access an "http" URI's origin server + when a representation isn't found in a local cache). + +1.2.3. Hierarchical Identifiers + + The URI syntax is organized hierarchically, with components listed in + order of decreasing significance from left to right. For some URI + schemes, the visible hierarchy is limited to the scheme itself: + everything after the scheme component delimiter (":") is considered + opaque to URI processing. Other URI schemes make the hierarchy + explicit and visible to generic parsing algorithms. + + The generic syntax uses the slash ("/"), question mark ("?"), and + number sign ("#") characters to delimit components that are + significant to the generic parser's hierarchical interpretation of an + identifier. In addition to aiding the readability of such + identifiers through the consistent use of familiar syntax, this + uniform representation of hierarchy across naming schemes allows + scheme-independent references to be made relative to that hierarchy. + + It is often the case that a group or "tree" of documents has been + constructed to serve a common purpose, wherein the vast majority of + URI references in these documents point to resources within the tree + rather than outside it. Similarly, documents located at a particular + site are much more likely to refer to other resources at that site + than to resources at remote sites. Relative referencing of URIs + allows document trees to be partially independent of their location + and access scheme. For instance, it is possible for a single set of + hypertext documents to be simultaneously accessible and traversable + via each of the "file", "http", and "ftp" schemes if the documents + refer to each other with relative references. Furthermore, such + document trees can be moved, as a whole, without changing any of the + relative references. + + A relative reference (Section 4.2) refers to a resource by describing + the difference within a hierarchical name space between the reference + context and the target URI. The reference resolution algorithm, + + + +Berners-Lee, et al. Standards Track [Page 10] + +RFC 3986 URI Generic Syntax January 2005 + + + presented in Section 5, defines how such a reference is transformed + to the target URI. As relative references can only be used within + the context of a hierarchical URI, designers of new URI schemes + should use a syntax consistent with the generic syntax's hierarchical + components unless there are compelling reasons to forbid relative + referencing within that scheme. + + NOTE: Previous specifications used the terms "partial URI" and + "relative URI" to denote a relative reference to a URI. As some + readers misunderstood those terms to mean that relative URIs are a + subset of URIs rather than a method of referencing URIs, this + specification simply refers to them as relative references. + + All URI references are parsed by generic syntax parsers when used. + However, because hierarchical processing has no effect on an absolute + URI used in a reference unless it contains one or more dot-segments + (complete path segments of "." or "..", as described in Section 3.3), + URI scheme specifications can define opaque identifiers by + disallowing use of slash characters, question mark characters, and + the URIs "scheme:." and "scheme:..". + +1.3. Syntax Notation + + This specification uses the Augmented Backus-Naur Form (ABNF) + notation of [RFC2234], including the following core ABNF syntax rules + defined by that specification: ALPHA (letters), CR (carriage return), + DIGIT (decimal digits), DQUOTE (double quote), HEXDIG (hexadecimal + digits), LF (line feed), and SP (space). The complete URI syntax is + collected in Appendix A. + +2. Characters + + The URI syntax provides a method of encoding data, presumably for the + sake of identifying a resource, as a sequence of characters. The URI + characters are, in turn, frequently encoded as octets for transport + or presentation. This specification does not mandate any particular + character encoding for mapping between URI characters and the octets + used to store or transmit those characters. When a URI appears in a + protocol element, the character encoding is defined by that protocol; + without such a definition, a URI is assumed to be in the same + character encoding as the surrounding text. + + The ABNF notation defines its terminal values to be non-negative + integers (codepoints) based on the US-ASCII coded character set + [ASCII]. Because a URI is a sequence of characters, we must invert + that relation in order to understand the URI syntax. Therefore, the + + + + + +Berners-Lee, et al. Standards Track [Page 11] + +RFC 3986 URI Generic Syntax January 2005 + + + integer values used by the ABNF must be mapped back to their + corresponding characters via US-ASCII in order to complete the syntax + rules. + + A URI is composed from a limited set of characters consisting of + digits, letters, and a few graphic symbols. A reserved subset of + those characters may be used to delimit syntax components within a + URI while the remaining characters, including both the unreserved set + and those reserved characters not acting as delimiters, define each + component's identifying data. + +2.1. Percent-Encoding + + A percent-encoding mechanism is used to represent a data octet in a + component when that octet's corresponding character is outside the + allowed set or is being used as a delimiter of, or within, the + component. A percent-encoded octet is encoded as a character + triplet, consisting of the percent character "%" followed by the two + hexadecimal digits representing that octet's numeric value. For + example, "%20" is the percent-encoding for the binary octet + "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space + character (SP). Section 2.4 describes when percent-encoding and + decoding is applied. + + pct-encoded = "%" HEXDIG HEXDIG + + The uppercase hexadecimal digits 'A' through 'F' are equivalent to + the lowercase digits 'a' through 'f', respectively. If two URIs + differ only in the case of hexadecimal digits used in percent-encoded + octets, they are equivalent. For consistency, URI producers and + normalizers should use uppercase hexadecimal digits for all percent- + encodings. + +2.2. Reserved Characters + + URIs include components and subcomponents that are delimited by + characters in the "reserved" set. These characters are called + "reserved" because they may (or may not) be defined as delimiters by + the generic syntax, by each scheme-specific syntax, or by the + implementation-specific syntax of a URI's dereferencing algorithm. + If data for a URI component would conflict with a reserved + character's purpose as a delimiter, then the conflicting data must be + percent-encoded before the URI is formed. + + + + + + + + +Berners-Lee, et al. Standards Track [Page 12] + +RFC 3986 URI Generic Syntax January 2005 + + + reserved = gen-delims / sub-delims + + gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" + + sub-delims = "!" / "$" / "&" / "'" / "(" / ")" + / "*" / "+" / "," / ";" / "=" + + The purpose of reserved characters is to provide a set of delimiting + characters that are distinguishable from other data within a URI. + URIs that differ in the replacement of a reserved character with its + corresponding percent-encoded octet are not equivalent. Percent- + encoding a reserved character, or decoding a percent-encoded octet + that corresponds to a reserved character, will change how the URI is + interpreted by most applications. Thus, characters in the reserved + set are protected from normalization and are therefore safe to be + used by scheme-specific and producer-specific algorithms for + delimiting data subcomponents within a URI. + + A subset of the reserved characters (gen-delims) is used as + delimiters of the generic URI components described in Section 3. A + component's ABNF syntax rule will not use the reserved or gen-delims + rule names directly; instead, each syntax rule lists the characters + allowed within that component (i.e., not delimiting it), and any of + those characters that are also in the reserved set are "reserved" for + use as subcomponent delimiters within the component. Only the most + common subcomponents are defined by this specification; other + subcomponents may be defined by a URI scheme's specification, or by + the implementation-specific syntax of a URI's dereferencing + algorithm, provided that such subcomponents are delimited by + characters in the reserved set allowed within that component. + + URI producing applications should percent-encode data octets that + correspond to characters in the reserved set unless these characters + are specifically allowed by the URI scheme to represent data in that + component. If a reserved character is found in a URI component and + no delimiting role is known for that character, then it must be + interpreted as representing the data octet corresponding to that + character's encoding in US-ASCII. + +2.3. Unreserved Characters + + Characters that are allowed in a URI but do not have a reserved + purpose are called unreserved. These include uppercase and lowercase + letters, decimal digits, hyphen, period, underscore, and tilde. + + unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" + + + + + +Berners-Lee, et al. Standards Track [Page 13] + +RFC 3986 URI Generic Syntax January 2005 + + + URIs that differ in the replacement of an unreserved character with + its corresponding percent-encoded US-ASCII octet are equivalent: they + identify the same resource. However, URI comparison implementations + do not always perform normalization prior to comparison (see Section + 6). For consistency, percent-encoded octets in the ranges of ALPHA + (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), + underscore (%5F), or tilde (%7E) should not be created by URI + producers and, when found in a URI, should be decoded to their + corresponding unreserved characters by URI normalizers. + +2.4. When to Encode or Decode + + Under normal circumstances, the only time when octets within a URI + are percent-encoded is during the process of producing the URI from + its component parts. This is when an implementation determines which + of the reserved characters are to be used as subcomponent delimiters + and which can be safely used as data. Once produced, a URI is always + in its percent-encoded form. + + When a URI is dereferenced, the components and subcomponents + significant to the scheme-specific dereferencing process (if any) + must be parsed and separated before the percent-encoded octets within + those components can be safely decoded, as otherwise the data may be + mistaken for component delimiters. The only exception is for + percent-encoded octets corresponding to characters in the unreserved + set, which can be decoded at any time. For example, the octet + corresponding to the tilde ("~") character is often encoded as "%7E" + by older URI processing implementations; the "%7E" can be replaced by + "~" without changing its interpretation. + + Because the percent ("%") character serves as the indicator for + percent-encoded octets, it must be percent-encoded as "%25" for that + octet to be used as data within a URI. Implementations must not + percent-encode or decode the same string more than once, as decoding + an already decoded string might lead to misinterpreting a percent + data octet as the beginning of a percent-encoding, or vice versa in + the case of percent-encoding an already percent-encoded string. + +2.5. Identifying Data + + URI characters provide identifying data for each of the URI + components, serving as an external interface for identification + between systems. Although the presence and nature of the URI + production interface is hidden from clients that use its URIs (and is + thus beyond the scope of the interoperability requirements defined by + this specification), it is a frequent source of confusion and errors + in the interpretation of URI character issues. Implementers have to + be aware that there are multiple character encodings involved in the + + + +Berners-Lee, et al. Standards Track [Page 14] + +RFC 3986 URI Generic Syntax January 2005 + + + production and transmission of URIs: local name and data encoding, + public interface encoding, URI character encoding, data format + encoding, and protocol encoding. + + Local names, such as file system names, are stored with a local + character encoding. URI producing applications (e.g., origin + servers) will typically use the local encoding as the basis for + producing meaningful names. The URI producer will transform the + local encoding to one that is suitable for a public interface and + then transform the public interface encoding into the restricted set + of URI characters (reserved, unreserved, and percent-encodings). + Those characters are, in turn, encoded as octets to be used as a + reference within a data format (e.g., a document charset), and such + data formats are often subsequently encoded for transmission over + Internet protocols. + + For most systems, an unreserved character appearing within a URI + component is interpreted as representing the data octet corresponding + to that character's encoding in US-ASCII. Consumers of URIs assume + that the letter "X" corresponds to the octet "01011000", and even + when that assumption is incorrect, there is no harm in making it. A + system that internally provides identifiers in the form of a + different character encoding, such as EBCDIC, will generally perform + character translation of textual identifiers to UTF-8 [STD63] (or + some other superset of the US-ASCII character encoding) at an + internal interface, thereby providing more meaningful identifiers + than those resulting from simply percent-encoding the original + octets. + + For example, consider an information service that provides data, + stored locally using an EBCDIC-based file system, to clients on the + Internet through an HTTP server. When an author creates a file with + the name "Laguna Beach" on that file system, the "http" URI + corresponding to that resource is expected to contain the meaningful + string "Laguna%20Beach". If, however, that server produces URIs by + using an overly simplistic raw octet mapping, then the result would + be a URI containing "%D3%81%87%A4%95%81@%C2%85%81%83%88". An + internal transcoding interface fixes this problem by transcoding the + local name to a superset of US-ASCII prior to producing the URI. + Naturally, proper interpretation of an incoming URI on such an + interface requires that percent-encoded octets be decoded (e.g., + "%20" to SP) before the reverse transcoding is applied to obtain the + local name. + + In some cases, the internal interface between a URI component and the + identifying data that it has been crafted to represent is much less + direct than a character encoding translation. For example, portions + of a URI might reflect a query on non-ASCII data, or numeric + + + +Berners-Lee, et al. Standards Track [Page 15] + +RFC 3986 URI Generic Syntax January 2005 + + + coordinates on a map. Likewise, a URI scheme may define components + with additional encoding requirements that are applied prior to + forming the component and producing the URI. + + When a new URI scheme defines a component that represents textual + data consisting of characters from the Universal Character Set [UCS], + the data should first be encoded as octets according to the UTF-8 + character encoding [STD63]; then only those octets that do not + correspond to characters in the unreserved set should be percent- + encoded. For example, the character A would be represented as "A", + the character LATIN CAPITAL LETTER A WITH GRAVE would be represented + as "%C3%80", and the character KATAKANA LETTER A would be represented + as "%E3%82%A2". + +3. Syntax Components + + The generic URI syntax consists of a hierarchical sequence of + components referred to as the scheme, authority, path, query, and + fragment. + + URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] + + hier-part = "//" authority path-abempty + / path-absolute + / path-rootless + / path-empty + + The scheme and path components are required, though the path may be + empty (no characters). When authority is present, the path must + either be empty or begin with a slash ("/") character. When + authority is not present, the path cannot begin with two slash + characters ("//"). These restrictions result in five different ABNF + rules for a path (Section 3.3), only one of which will match any + given URI reference. + + The following are two example URIs and their component parts: + + foo://example.com:8042/over/there?name=ferret#nose + \_/ \______________/\_________/ \_________/ \__/ + | | | | | + scheme authority path query fragment + | _____________________|__ + / \ / \ + urn:example:animal:ferret:nose + + + + + + + +Berners-Lee, et al. Standards Track [Page 16] + +RFC 3986 URI Generic Syntax January 2005 + + +3.1. Scheme + + Each URI begins with a scheme name that refers to a specification for + assigning identifiers within that scheme. As such, the URI syntax is + a federated and extensible naming system wherein each scheme's + specification may further restrict the syntax and semantics of + identifiers using that scheme. + + Scheme names consist of a sequence of characters beginning with a + letter and followed by any combination of letters, digits, plus + ("+"), period ("."), or hyphen ("-"). Although schemes are case- + insensitive, the canonical form is lowercase and documents that + specify schemes must do so with lowercase letters. An implementation + should accept uppercase letters as equivalent to lowercase in scheme + names (e.g., allow "HTTP" as well as "http") for the sake of + robustness but should only produce lowercase scheme names for + consistency. + + scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) + + Individual schemes are not specified by this document. The process + for registration of new URI schemes is defined separately by [BCP35]. + The scheme registry maintains the mapping between scheme names and + their specifications. Advice for designers of new URI schemes can be + found in [RFC2718]. URI scheme specifications must define their own + syntax so that all strings matching their scheme-specific syntax will + also match the <absolute-URI> grammar, as described in Section 4.3. + + When presented with a URI that violates one or more scheme-specific + restrictions, the scheme-specific resolution process should flag the + reference as an error rather than ignore the unused parts; doing so + reduces the number of equivalent URIs and helps detect abuses of the + generic syntax, which might indicate that the URI has been + constructed to mislead the user (Section 7.6). + +3.2. Authority + + Many URI schemes include a hierarchical element for a naming + authority so that governance of the name space defined by the + remainder of the URI is delegated to that authority (which may, in + turn, delegate it further). The generic syntax provides a common + means for distinguishing an authority based on a registered name or + server address, along with optional port and user information. + + The authority component is preceded by a double slash ("//") and is + terminated by the next slash ("/"), question mark ("?"), or number + sign ("#") character, or by the end of the URI. + + + + +Berners-Lee, et al. Standards Track [Page 17] + +RFC 3986 URI Generic Syntax January 2005 + + + authority = [ userinfo "@" ] host [ ":" port ] + + URI producers and normalizers should omit the ":" delimiter that + separates host from port if the port component is empty. Some + schemes do not allow the userinfo and/or port subcomponents. + + If a URI contains an authority component, then the path component + must either be empty or begin with a slash ("/") character. Non- + validating parsers (those that merely separate a URI reference into + its major components) will often ignore the subcomponent structure of + authority, treating it as an opaque string from the double-slash to + the first terminating delimiter, until such time as the URI is + dereferenced. + +3.2.1. User Information + + The userinfo subcomponent may consist of a user name and, optionally, + scheme-specific information about how to gain authorization to access + the resource. The user information, if present, is followed by a + commercial at-sign ("@") that delimits it from the host. + + userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) + + Use of the format "user:password" in the userinfo field is + deprecated. Applications should not render as clear text any data + after the first colon (":") character found within a userinfo + subcomponent unless the data after the colon is the empty string + (indicating no password). Applications may choose to ignore or + reject such data when it is received as part of a reference and + should reject the storage of such data in unencrypted form. The + passing of authentication information in clear text has proven to be + a security risk in almost every case where it has been used. + + Applications that render a URI for the sake of user feedback, such as + in graphical hypertext browsing, should render userinfo in a way that + is distinguished from the rest of a URI, when feasible. Such + rendering will assist the user in cases where the userinfo has been + misleadingly crafted to look like a trusted domain name + (Section 7.6). + +3.2.2. Host + + The host subcomponent of authority is identified by an IP literal + encapsulated within square brackets, an IPv4 address in dotted- + decimal form, or a registered name. The host subcomponent is case- + insensitive. The presence of a host subcomponent within a URI does + not imply that the scheme requires access to the given host on the + Internet. In many cases, the host syntax is used only for the sake + + + +Berners-Lee, et al. Standards Track [Page 18] + +RFC 3986 URI Generic Syntax January 2005 + + + of reusing the existing registration process created and deployed for + DNS, thus obtaining a globally unique name without the cost of + deploying another registry. However, such use comes with its own + costs: domain name ownership may change over time for reasons not + anticipated by the URI producer. In other cases, the data within the + host component identifies a registered name that has nothing to do + with an Internet host. We use the name "host" for the ABNF rule + because that is its most common purpose, not its only purpose. + + host = IP-literal / IPv4address / reg-name + + The syntax rule for host is ambiguous because it does not completely + distinguish between an IPv4address and a reg-name. In order to + disambiguate the syntax, we apply the "first-match-wins" algorithm: + If host matches the rule for IPv4address, then it should be + considered an IPv4 address literal and not a reg-name. Although host + is case-insensitive, producers and normalizers should use lowercase + for registered names and hexadecimal addresses for the sake of + uniformity, while only using uppercase letters for percent-encodings. + + A host identified by an Internet Protocol literal address, version 6 + [RFC3513] or later, is distinguished by enclosing the IP literal + within square brackets ("[" and "]"). This is the only place where + square bracket characters are allowed in the URI syntax. In + anticipation of future, as-yet-undefined IP literal address formats, + an implementation may use an optional version flag to indicate such a + format explicitly rather than rely on heuristic determination. + + IP-literal = "[" ( IPv6address / IPvFuture ) "]" + + IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) + + The version flag does not indicate the IP version; rather, it + indicates future versions of the literal format. As such, + implementations must not provide the version flag for the existing + IPv4 and IPv6 literal address forms described below. If a URI + containing an IP-literal that starts with "v" (case-insensitive), + indicating that the version flag is present, is dereferenced by an + application that does not know the meaning of that version flag, then + the application should return an appropriate error for "address + mechanism not supported". + + A host identified by an IPv6 literal address is represented inside + the square brackets without a preceding version flag. The ABNF + provided here is a translation of the text definition of an IPv6 + literal address provided in [RFC3513]. This syntax does not support + IPv6 scoped addressing zone identifiers. + + + + +Berners-Lee, et al. Standards Track [Page 19] + +RFC 3986 URI Generic Syntax January 2005 + + + A 128-bit IPv6 address is divided into eight 16-bit pieces. Each + piece is represented numerically in case-insensitive hexadecimal, + using one to four hexadecimal digits (leading zeroes are permitted). + The eight encoded pieces are given most-significant first, separated + by colon characters. Optionally, the least-significant two pieces + may instead be represented in IPv4 address textual format. A + sequence of one or more consecutive zero-valued 16-bit pieces within + the address may be elided, omitting all their digits and leaving + exactly two consecutive colons in their place to mark the elision. + + IPv6address = 6( h16 ":" ) ls32 + / "::" 5( h16 ":" ) ls32 + / [ h16 ] "::" 4( h16 ":" ) ls32 + / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 + / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 + / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 + / [ *4( h16 ":" ) h16 ] "::" ls32 + / [ *5( h16 ":" ) h16 ] "::" h16 + / [ *6( h16 ":" ) h16 ] "::" + + ls32 = ( h16 ":" h16 ) / IPv4address + ; least-significant 32 bits of address + + h16 = 1*4HEXDIG + ; 16 bits of address represented in hexadecimal + + A host identified by an IPv4 literal address is represented in + dotted-decimal notation (a sequence of four decimal numbers in the + range 0 to 255, separated by "."), as described in [RFC1123] by + reference to [RFC0952]. Note that other forms of dotted notation may + be interpreted on some platforms, as described in Section 7.4, but + only the dotted-decimal form of four octets is allowed by this + grammar. + + IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet + + dec-octet = DIGIT ; 0-9 + / %x31-39 DIGIT ; 10-99 + / "1" 2DIGIT ; 100-199 + / "2" %x30-34 DIGIT ; 200-249 + / "25" %x30-35 ; 250-255 + + A host identified by a registered name is a sequence of characters + usually intended for lookup within a locally defined host or service + name registry, though the URI's scheme-specific semantics may require + that a specific registry (or fixed name table) be used instead. The + most common name registry mechanism is the Domain Name System (DNS). + A registered name intended for lookup in the DNS uses the syntax + + + +Berners-Lee, et al. Standards Track [Page 20] + +RFC 3986 URI Generic Syntax January 2005 + + + defined in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123]. + Such a name consists of a sequence of domain labels separated by ".", + each domain label starting and ending with an alphanumeric character + and possibly also containing "-" characters. The rightmost domain + label of a fully qualified domain name in DNS may be followed by a + single "." and should be if it is necessary to distinguish between + the complete domain name and some local domain. + + reg-name = *( unreserved / pct-encoded / sub-delims ) + + If the URI scheme defines a default for host, then that default + applies when the host subcomponent is undefined or when the + registered name is empty (zero length). For example, the "file" URI + scheme is defined so that no authority, an empty host, and + "localhost" all mean the end-user's machine, whereas the "http" + scheme considers a missing authority or empty host invalid. + + This specification does not mandate a particular registered name + lookup technology and therefore does not restrict the syntax of reg- + name beyond what is necessary for interoperability. Instead, it + delegates the issue of registered name syntax conformance to the + operating system of each application performing URI resolution, and + that operating system decides what it will allow for the purpose of + host identification. A URI resolution implementation might use DNS, + host tables, yellow pages, NetInfo, WINS, or any other system for + lookup of registered names. However, a globally scoped naming + system, such as DNS fully qualified domain names, is necessary for + URIs intended to have global scope. URI producers should use names + that conform to the DNS syntax, even when use of DNS is not + immediately apparent, and should limit these names to no more than + 255 characters in length. + + The reg-name syntax allows percent-encoded octets in order to + represent non-ASCII registered names in a uniform way that is + independent of the underlying name resolution technology. Non-ASCII + characters must first be encoded according to UTF-8 [STD63], and then + each octet of the corresponding UTF-8 sequence must be percent- + encoded to be represented as URI characters. URI producing + applications must not use percent-encoding in host unless it is used + to represent a UTF-8 character sequence. When a non-ASCII registered + name represents an internationalized domain name intended for + resolution via the DNS, the name must be transformed to the IDNA + encoding [RFC3490] prior to name lookup. URI producers should + provide these registered names in the IDNA encoding, rather than a + percent-encoding, if they wish to maximize interoperability with + legacy URI resolvers. + + + + + +Berners-Lee, et al. Standards Track [Page 21] + +RFC 3986 URI Generic Syntax January 2005 + + +3.2.3. Port + + The port subcomponent of authority is designated by an optional port + number in decimal following the host and delimited from it by a + single colon (":") character. + + port = *DIGIT + + A scheme may define a default port. For example, the "http" scheme + defines a default port of "80", corresponding to its reserved TCP + port number. The type of port designated by the port number (e.g., + TCP, UDP, SCTP) is defined by the URI scheme. URI producers and + normalizers should omit the port component and its ":" delimiter if + port is empty or if its value would be the same as that of the + scheme's default. + +3.3. Path + + The path component contains data, usually organized in hierarchical + form, that, along with data in the non-hierarchical query component + (Section 3.4), serves to identify a resource within the scope of the + URI's scheme and naming authority (if any). The path is terminated + by the first question mark ("?") or number sign ("#") character, or + by the end of the URI. + + If a URI contains an authority component, then the path component + must either be empty or begin with a slash ("/") character. If a URI + does not contain an authority component, then the path cannot begin + with two slash characters ("//"). In addition, a URI reference + (Section 4.1) may be a relative-path reference, in which case the + first path segment cannot contain a colon (":") character. The ABNF + requires five separate rules to disambiguate these cases, only one of + which will match the path substring within a given URI reference. We + use the generic term "path component" to describe the URI substring + matched by the parser to one of these rules. + + path = path-abempty ; begins with "/" or is empty + / path-absolute ; begins with "/" but not "//" + / path-noscheme ; begins with a non-colon segment + / path-rootless ; begins with a segment + / path-empty ; zero characters + + path-abempty = *( "/" segment ) + path-absolute = "/" [ segment-nz *( "/" segment ) ] + path-noscheme = segment-nz-nc *( "/" segment ) + path-rootless = segment-nz *( "/" segment ) + path-empty = 0<pchar> + + + + +Berners-Lee, et al. Standards Track [Page 22] + +RFC 3986 URI Generic Syntax January 2005 + + + segment = *pchar + segment-nz = 1*pchar + segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) + ; non-zero-length segment without any colon ":" + + pchar = unreserved / pct-encoded / sub-delims / ":" / "@" + + A path consists of a sequence of path segments separated by a slash + ("/") character. A path is always defined for a URI, though the + defined path may be empty (zero length). Use of the slash character + to indicate hierarchy is only required when a URI will be used as the + context for relative references. For example, the URI + <mailto:fred@example.com> has a path of "fred@example.com", whereas + the URI <foo://info.example.com?fred> has an empty path. + + The path segments "." and "..", also known as dot-segments, are + defined for relative reference within the path name hierarchy. They + are intended for use at the beginning of a relative-path reference + (Section 4.2) to indicate relative position within the hierarchical + tree of names. This is similar to their role within some operating + systems' file directory structures to indicate the current directory + and parent directory, respectively. However, unlike in a file + system, these dot-segments are only interpreted within the URI path + hierarchy and are removed as part of the resolution process (Section + 5.2). + + Aside from dot-segments in hierarchical paths, a path segment is + considered opaque by the generic syntax. URI producing applications + often use the reserved characters allowed in a segment to delimit + scheme-specific or dereference-handler-specific subcomponents. For + example, the semicolon (";") and equals ("=") reserved characters are + often used to delimit parameters and parameter values applicable to + that segment. The comma (",") reserved character is often used for + similar purposes. For example, one URI producer might use a segment + such as "name;v=1.1" to indicate a reference to version 1.1 of + "name", whereas another might use a segment such as "name,1.1" to + indicate the same. Parameter types may be defined by scheme-specific + semantics, but in most cases the syntax of a parameter is specific to + the implementation of the URI's dereferencing algorithm. + +3.4. Query + + The query component contains non-hierarchical data that, along with + data in the path component (Section 3.3), serves to identify a + resource within the scope of the URI's scheme and naming authority + (if any). The query component is indicated by the first question + mark ("?") character and terminated by a number sign ("#") character + or by the end of the URI. + + + +Berners-Lee, et al. Standards Track [Page 23] + +RFC 3986 URI Generic Syntax January 2005 + + + query = *( pchar / "/" / "?" ) + + The characters slash ("/") and question mark ("?") may represent data + within the query component. Beware that some older, erroneous + implementations may not handle such data correctly when it is used as + the base URI for relative references (Section 5.1), apparently + because they fail to distinguish query data from path data when + looking for hierarchical separators. However, as query components + are often used to carry identifying information in the form of + "key=value" pairs and one frequently used value is a reference to + another URI, it is sometimes better for usability to avoid percent- + encoding those characters. + +3.5. Fragment + + The fragment identifier component of a URI allows indirect + identification of a secondary resource by reference to a primary + resource and additional identifying information. The identified + secondary resource may be some portion or subset of the primary + resource, some view on representations of the primary resource, or + some other resource defined or described by those representations. A + fragment identifier component is indicated by the presence of a + number sign ("#") character and terminated by the end of the URI. + + fragment = *( pchar / "/" / "?" ) + + The semantics of a fragment identifier are defined by the set of + representations that might result from a retrieval action on the + primary resource. The fragment's format and resolution is therefore + dependent on the media type [RFC2046] of a potentially retrieved + representation, even though such a retrieval is only performed if the + URI is dereferenced. If no such representation exists, then the + semantics of the fragment are considered unknown and are effectively + unconstrained. Fragment identifier semantics are independent of the + URI scheme and thus cannot be redefined by scheme specifications. + + Individual media types may define their own restrictions on or + structures within the fragment identifier syntax for specifying + different types of subsets, views, or external references that are + identifiable as secondary resources by that media type. If the + primary resource has multiple representations, as is often the case + for resources whose representation is selected based on attributes of + the retrieval request (a.k.a., content negotiation), then whatever is + identified by the fragment should be consistent across all of those + representations. Each representation should either define the + fragment so that it corresponds to the same secondary resource, + regardless of how it is represented, or should leave the fragment + undefined (i.e., not found). + + + +Berners-Lee, et al. Standards Track [Page 24] + +RFC 3986 URI Generic Syntax January 2005 + + + As with any URI, use of a fragment identifier component does not + imply that a retrieval action will take place. A URI with a fragment + identifier may be used to refer to the secondary resource without any + implication that the primary resource is accessible or will ever be + accessed. + + Fragment identifiers have a special role in information retrieval + systems as the primary form of client-side indirect referencing, + allowing an author to specifically identify aspects of an existing + resource that are only indirectly provided by the resource owner. As + such, the fragment identifier is not used in the scheme-specific + processing of a URI; instead, the fragment identifier is separated + from the rest of the URI prior to a dereference, and thus the + identifying information within the fragment itself is dereferenced + solely by the user agent, regardless of the URI scheme. Although + this separate handling is often perceived to be a loss of + information, particularly for accurate redirection of references as + resources move over time, it also serves to prevent information + providers from denying reference authors the right to refer to + information within a resource selectively. Indirect referencing also + provides additional flexibility and extensibility to systems that use + URIs, as new media types are easier to define and deploy than new + schemes of identification. + + The characters slash ("/") and question mark ("?") are allowed to + represent data within the fragment identifier. Beware that some + older, erroneous implementations may not handle this data correctly + when it is used as the base URI for relative references (Section + 5.1). + +4. Usage + + When applications make reference to a URI, they do not always use the + full form of reference defined by the "URI" syntax rule. To save + space and take advantage of hierarchical locality, many Internet + protocol elements and media type formats allow an abbreviation of a + URI, whereas others restrict the syntax to a particular form of URI. + We define the most common forms of reference syntax in this + specification because they impact and depend upon the design of the + generic syntax, requiring a uniform parsing algorithm in order to be + interpreted consistently. + +4.1. URI Reference + + URI-reference is used to denote the most common usage of a resource + identifier. + + URI-reference = URI / relative-ref + + + +Berners-Lee, et al. Standards Track [Page 25] + +RFC 3986 URI Generic Syntax January 2005 + + + A URI-reference is either a URI or a relative reference. If the + URI-reference's prefix does not match the syntax of a scheme followed + by its colon separator, then the URI-reference is a relative + reference. + + A URI-reference is typically parsed first into the five URI + components, in order to determine what components are present and + whether the reference is relative. Then, each component is parsed + for its subparts and their validation. The ABNF of URI-reference, + along with the "first-match-wins" disambiguation rule, is sufficient + to define a validating parser for the generic syntax. Readers + familiar with regular expressions should see Appendix B for an + example of a non-validating URI-reference parser that will take any + given string and extract the URI components. + +4.2. Relative Reference + + A relative reference takes advantage of the hierarchical syntax + (Section 1.2.3) to express a URI reference relative to the name space + of another hierarchical URI. + + relative-ref = relative-part [ "?" query ] [ "#" fragment ] + + relative-part = "//" authority path-abempty + / path-absolute + / path-noscheme + / path-empty + + The URI referred to by a relative reference, also known as the target + URI, is obtained by applying the reference resolution algorithm of + Section 5. + + A relative reference that begins with two slash characters is termed + a network-path reference; such references are rarely used. A + relative reference that begins with a single slash character is + termed an absolute-path reference. A relative reference that does + not begin with a slash character is termed a relative-path reference. + + A path segment that contains a colon character (e.g., "this:that") + cannot be used as the first segment of a relative-path reference, as + it would be mistaken for a scheme name. Such a segment must be + preceded by a dot-segment (e.g., "./this:that") to make a relative- + path reference. + + + + + + + + +Berners-Lee, et al. Standards Track [Page 26] + +RFC 3986 URI Generic Syntax January 2005 + + +4.3. Absolute URI + + Some protocol elements allow only the absolute form of a URI without + a fragment identifier. For example, defining a base URI for later + use by relative references calls for an absolute-URI syntax rule that + does not allow a fragment. + + absolute-URI = scheme ":" hier-part [ "?" query ] + + URI scheme specifications must define their own syntax so that all + strings matching their scheme-specific syntax will also match the + <absolute-URI> grammar. Scheme specifications will not define + fragment identifier syntax or usage, regardless of its applicability + to resources identifiable via that scheme, as fragment identification + is orthogonal to scheme definition. However, scheme specifications + are encouraged to include a wide range of examples, including + examples that show use of the scheme's URIs with fragment identifiers + when such usage is appropriate. + +4.4. Same-Document Reference + + When a URI reference refers to a URI that is, aside from its fragment + component (if any), identical to the base URI (Section 5.1), that + reference is called a "same-document" reference. The most frequent + examples of same-document references are relative references that are + empty or include only the number sign ("#") separator followed by a + fragment identifier. + + When a same-document reference is dereferenced for a retrieval + action, the target of that reference is defined to be within the same + entity (representation, document, or message) as the reference; + therefore, a dereference should not result in a new retrieval action. + + Normalization of the base and target URIs prior to their comparison, + as described in Sections 6.2.2 and 6.2.3, is allowed but rarely + performed in practice. Normalization may increase the set of same- + document references, which may be of benefit to some caching + applications. As such, reference authors should not assume that a + slightly different, though equivalent, reference URI will (or will + not) be interpreted as a same-document reference by any given + application. + +4.5. Suffix Reference + + The URI syntax is designed for unambiguous reference to resources and + extensibility via the URI scheme. However, as URI identification and + usage have become commonplace, traditional media (television, radio, + newspapers, billboards, etc.) have increasingly used a suffix of the + + + +Berners-Lee, et al. Standards Track [Page 27] + +RFC 3986 URI Generic Syntax January 2005 + + + URI as a reference, consisting of only the authority and path + portions of the URI, such as + + www.w3.org/Addressing/ + + or simply a DNS registered name on its own. Such references are + primarily intended for human interpretation rather than for machines, + with the assumption that context-based heuristics are sufficient to + complete the URI (e.g., most registered names beginning with "www" + are likely to have a URI prefix of "http://"). Although there is no + standard set of heuristics for disambiguating a URI suffix, many + client implementations allow them to be entered by the user and + heuristically resolved. + + Although this practice of using suffix references is common, it + should be avoided whenever possible and should never be used in + situations where long-term references are expected. The heuristics + noted above will change over time, particularly when a new URI scheme + becomes popular, and are often incorrect when used out of context. + Furthermore, they can lead to security issues along the lines of + those described in [RFC1535]. + + As a URI suffix has the same syntax as a relative-path reference, a + suffix reference cannot be used in contexts where a relative + reference is expected. As a result, suffix references are limited to + places where there is no defined base URI, such as dialog boxes and + off-line advertisements. + +5. Reference Resolution + + This section defines the process of resolving a URI reference within + a context that allows relative references so that the result is a + string matching the <URI> syntax rule of Section 3. + +5.1. Establishing a Base URI + + The term "relative" implies that a "base URI" exists against which + the relative reference is applied. Aside from fragment-only + references (Section 4.4), relative references are only usable when a + base URI is known. A base URI must be established by the parser + prior to parsing URI references that might be relative. A base URI + must conform to the <absolute-URI> syntax rule (Section 4.3). If the + base URI is obtained from a URI reference, then that reference must + be converted to absolute form and stripped of any fragment component + prior to its use as a base URI. + + + + + + +Berners-Lee, et al. Standards Track [Page 28] + +RFC 3986 URI Generic Syntax January 2005 + + + The base URI of a reference can be established in one of four ways, + discussed below in order of precedence. The order of precedence can + be thought of in terms of layers, where the innermost defined base + URI has the highest precedence. This can be visualized graphically + as follows: + + .----------------------------------------------------------. + | .----------------------------------------------------. | + | | .----------------------------------------------. | | + | | | .----------------------------------------. | | | + | | | | .----------------------------------. | | | | + | | | | | <relative-reference> | | | | | + | | | | `----------------------------------' | | | | + | | | | (5.1.1) Base URI embedded in content | | | | + | | | `----------------------------------------' | | | + | | | (5.1.2) Base URI of the encapsulating entity | | | + | | | (message, representation, or none) | | | + | | `----------------------------------------------' | | + | | (5.1.3) URI used to retrieve the entity | | + | `----------------------------------------------------' | + | (5.1.4) Default Base URI (application-dependent) | + `----------------------------------------------------------' + +5.1.1. Base URI Embedded in Content + + Within certain media types, a base URI for relative references can be + embedded within the content itself so that it can be readily obtained + by a parser. This can be useful for descriptive documents, such as + tables of contents, which may be transmitted to others through + protocols other than their usual retrieval context (e.g., email or + USENET news). + + It is beyond the scope of this specification to specify how, for each + media type, a base URI can be embedded. The appropriate syntax, when + available, is described by the data format specification associated + with each media type. + +5.1.2. Base URI from the Encapsulating Entity + + If no base URI is embedded, the base URI is defined by the + representation's retrieval context. For a document that is enclosed + within another entity, such as a message or archive, the retrieval + context is that entity. Thus, the default base URI of a + representation is the base URI of the entity in which the + representation is encapsulated. + + + + + + +Berners-Lee, et al. Standards Track [Page 29] + +RFC 3986 URI Generic Syntax January 2005 + + + A mechanism for embedding a base URI within MIME container types + (e.g., the message and multipart types) is defined by MHTML + [RFC2557]. Protocols that do not use the MIME message header syntax, + but that do allow some form of tagged metadata to be included within + messages, may define their own syntax for defining a base URI as part + of a message. + +5.1.3. Base URI from the Retrieval URI + + If no base URI is embedded and the representation is not encapsulated + within some other entity, then, if a URI was used to retrieve the + representation, that URI shall be considered the base URI. Note that + if the retrieval was the result of a redirected request, the last URI + used (i.e., the URI that resulted in the actual retrieval of the + representation) is the base URI. + +5.1.4. Default Base URI + + If none of the conditions described above apply, then the base URI is + defined by the context of the application. As this definition is + necessarily application-dependent, failing to define a base URI by + using one of the other methods may result in the same content being + interpreted differently by different types of applications. + + A sender of a representation containing relative references is + responsible for ensuring that a base URI for those references can be + established. Aside from fragment-only references, relative + references can only be used reliably in situations where the base URI + is well defined. + +5.2. Relative Resolution + + This section describes an algorithm for converting a URI reference + that might be relative to a given base URI into the parsed components + of the reference's target. The components can then be recomposed, as + described in Section 5.3, to form the target URI. This algorithm + provides definitive results that can be used to test the output of + other implementations. Applications may implement relative reference + resolution by using some other algorithm, provided that the results + match what would be given by this one. + + + + + + + + + + + +Berners-Lee, et al. Standards Track [Page 30] + +RFC 3986 URI Generic Syntax January 2005 + + +5.2.1. Pre-parse the Base URI + + The base URI (Base) is established according to the procedure of + Section 5.1 and parsed into the five main components described in + Section 3. Note that only the scheme component is required to be + present in a base URI; the other components may be empty or + undefined. A component is undefined if its associated delimiter does + not appear in the URI reference; the path component is never + undefined, though it may be empty. + + Normalization of the base URI, as described in Sections 6.2.2 and + 6.2.3, is optional. A URI reference must be transformed to its + target URI before it can be normalized. + +5.2.2. Transform References + + For each URI reference (R), the following pseudocode describes an + algorithm for transforming R into its target URI (T): + + -- The URI reference is parsed into the five URI components + -- + (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R); + + -- A non-strict parser may ignore a scheme in the reference + -- if it is identical to the base URI's scheme. + -- + if ((not strict) and (R.scheme == Base.scheme)) then + undefine(R.scheme); + endif; + + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et al. Standards Track [Page 31] + +RFC 3986 URI Generic Syntax January 2005 + + + if defined(R.scheme) then + T.scheme = R.scheme; + T.authority = R.authority; + T.path = remove_dot_segments(R.path); + T.query = R.query; + else + if defined(R.authority) then + T.authority = R.authority; + T.path = remove_dot_segments(R.path); + T.query = R.query; + else + if (R.path == "") then + T.path = Base.path; + if defined(R.query) then + T.query = R.query; + else + T.query = Base.query; + endif; + else + if (R.path starts-with "/") then + T.path = remove_dot_segments(R.path); + else + T.path = merge(Base.path, R.path); + T.path = remove_dot_segments(T.path); + endif; + T.query = R.query; + endif; + T.authority = Base.authority; + endif; + T.scheme = Base.scheme; + endif; + + T.fragment = R.fragment; + +5.2.3. Merge Paths + + The pseudocode above refers to a "merge" routine for merging a + relative-path reference with the path of the base URI. This is + accomplished as follows: + + o If the base URI has a defined authority component and an empty + path, then return a string consisting of "/" concatenated with the + reference's path; otherwise, + + + + + + + + +Berners-Lee, et al. Standards Track [Page 32] + +RFC 3986 URI Generic Syntax January 2005 + + + o return a string consisting of the reference's path component + appended to all but the last segment of the base URI's path (i.e., + excluding any characters after the right-most "/" in the base URI + path, or excluding the entire base URI path if it does not contain + any "/" characters). + +5.2.4. Remove Dot Segments + + The pseudocode also refers to a "remove_dot_segments" routine for + interpreting and removing the special "." and ".." complete path + segments from a referenced path. This is done after the path is + extracted from a reference, whether or not the path was relative, in + order to remove any invalid or extraneous dot-segments prior to + forming the target URI. Although there are many ways to accomplish + this removal process, we describe a simple method using two string + buffers. + + 1. The input buffer is initialized with the now-appended path + components and the output buffer is initialized to the empty + string. + + 2. While the input buffer is not empty, loop as follows: + + A. If the input buffer begins with a prefix of "../" or "./", + then remove that prefix from the input buffer; otherwise, + + B. if the input buffer begins with a prefix of "/./" or "/.", + where "." is a complete path segment, then replace that + prefix with "/" in the input buffer; otherwise, + + C. if the input buffer begins with a prefix of "/../" or "/..", + where ".." is a complete path segment, then replace that + prefix with "/" in the input buffer and remove the last + segment and its preceding "/" (if any) from the output + buffer; otherwise, + + D. if the input buffer consists only of "." or "..", then remove + that from the input buffer; otherwise, + + E. move the first path segment in the input buffer to the end of + the output buffer, including the initial "/" character (if + any) and any subsequent characters up to, but not including, + the next "/" character or the end of the input buffer. + + 3. Finally, the output buffer is returned as the result of + remove_dot_segments. + + + + + +Berners-Lee, et al. Standards Track [Page 33] + +RFC 3986 URI Generic Syntax January 2005 + + + Note that dot-segments are intended for use in URI references to + express an identifier relative to the hierarchy of names in the base + URI. The remove_dot_segments algorithm respects that hierarchy by + removing extra dot-segments rather than treat them as an error or + leaving them to be misinterpreted by dereference implementations. + + The following illustrates how the above steps are applied for two + examples of merged paths, showing the state of the two buffers after + each step. + + STEP OUTPUT BUFFER INPUT BUFFER + + 1 : /a/b/c/./../../g + 2E: /a /b/c/./../../g + 2E: /a/b /c/./../../g + 2E: /a/b/c /./../../g + 2B: /a/b/c /../../g + 2C: /a/b /../g + 2C: /a /g + 2E: /a/g + + STEP OUTPUT BUFFER INPUT BUFFER + + 1 : mid/content=5/../6 + 2E: mid /content=5/../6 + 2E: mid/content=5 /../6 + 2C: mid /6 + 2E: mid/6 + + Some applications may find it more efficient to implement the + remove_dot_segments algorithm by using two segment stacks rather than + strings. + + Note: Beware that some older, erroneous implementations will fail + to separate a reference's query component from its path component + prior to merging the base and reference paths, resulting in an + interoperability failure if the query component contains the + strings "/../" or "/./". + + + + + + + + + + + + + +Berners-Lee, et al. Standards Track [Page 34] + +RFC 3986 URI Generic Syntax January 2005 + + +5.3. Component Recomposition + + Parsed URI components can be recomposed to obtain the corresponding + URI reference string. Using pseudocode, this would be: + + result = "" + + if defined(scheme) then + append scheme to result; + append ":" to result; + endif; + + if defined(authority) then + append "//" to result; + append authority to result; + endif; + + append path to result; + + if defined(query) then + append "?" to result; + append query to result; + endif; + + if defined(fragment) then + append "#" to result; + append fragment to result; + endif; + + return result; + + Note that we are careful to preserve the distinction between a + component that is undefined, meaning that its separator was not + present in the reference, and a component that is empty, meaning that + the separator was present and was immediately followed by the next + component separator or the end of the reference. + +5.4. Reference Resolution Examples + + Within a representation with a well defined base URI of + + http://a/b/c/d;p?q + + a relative reference is transformed to its target URI as follows. + + + + + + + +Berners-Lee, et al. Standards Track [Page 35] + +RFC 3986 URI Generic Syntax January 2005 + + +5.4.1. Normal Examples + + "g:h" = "g:h" + "g" = "http://a/b/c/g" + "./g" = "http://a/b/c/g" + "g/" = "http://a/b/c/g/" + "/g" = "http://a/g" + "//g" = "http://g" + "?y" = "http://a/b/c/d;p?y" + "g?y" = "http://a/b/c/g?y" + "#s" = "http://a/b/c/d;p?q#s" + "g#s" = "http://a/b/c/g#s" + "g?y#s" = "http://a/b/c/g?y#s" + ";x" = "http://a/b/c/;x" + "g;x" = "http://a/b/c/g;x" + "g;x?y#s" = "http://a/b/c/g;x?y#s" + "" = "http://a/b/c/d;p?q" + "." = "http://a/b/c/" + "./" = "http://a/b/c/" + ".." = "http://a/b/" + "../" = "http://a/b/" + "../g" = "http://a/b/g" + "../.." = "http://a/" + "../../" = "http://a/" + "../../g" = "http://a/g" + +5.4.2. Abnormal Examples + + Although the following abnormal examples are unlikely to occur in + normal practice, all URI parsers should be capable of resolving them + consistently. Each example uses the same base as that above. + + Parsers must be careful in handling cases where there are more ".." + segments in a relative-path reference than there are hierarchical + levels in the base URI's path. Note that the ".." syntax cannot be + used to change the authority component of a URI. + + "../../../g" = "http://a/g" + "../../../../g" = "http://a/g" + + + + + + + + + + + + +Berners-Lee, et al. Standards Track [Page 36] + +RFC 3986 URI Generic Syntax January 2005 + + + Similarly, parsers must remove the dot-segments "." and ".." when + they are complete components of a path, but not when they are only + part of a segment. + + "/./g" = "http://a/g" + "/../g" = "http://a/g" + "g." = "http://a/b/c/g." + ".g" = "http://a/b/c/.g" + "g.." = "http://a/b/c/g.." + "..g" = "http://a/b/c/..g" + + Less likely are cases where the relative reference uses unnecessary + or nonsensical forms of the "." and ".." complete path segments. + + "./../g" = "http://a/b/g" + "./g/." = "http://a/b/c/g/" + "g/./h" = "http://a/b/c/g/h" + "g/../h" = "http://a/b/c/h" + "g;x=1/./y" = "http://a/b/c/g;x=1/y" + "g;x=1/../y" = "http://a/b/c/y" + + Some applications fail to separate the reference's query and/or + fragment components from the path component before merging it with + the base path and removing dot-segments. This error is rarely + noticed, as typical usage of a fragment never includes the hierarchy + ("/") character and the query component is not normally used within + relative references. + + "g?y/./x" = "http://a/b/c/g?y/./x" + "g?y/../x" = "http://a/b/c/g?y/../x" + "g#s/./x" = "http://a/b/c/g#s/./x" + "g#s/../x" = "http://a/b/c/g#s/../x" + + Some parsers allow the scheme name to be present in a relative + reference if it is the same as the base URI scheme. This is + considered to be a loophole in prior specifications of partial URI + [RFC1630]. Its use should be avoided but is allowed for backward + compatibility. + + "http:g" = "http:g" ; for strict parsers + / "http://a/b/c/g" ; for backward compatibility + + + + + + + + + + +Berners-Lee, et al. Standards Track [Page 37] + +RFC 3986 URI Generic Syntax January 2005 + + +6. Normalization and Comparison + + One of the most common operations on URIs is simple comparison: + determining whether two URIs are equivalent without using the URIs to + access their respective resource(s). A comparison is performed every + time a response cache is accessed, a browser checks its history to + color a link, or an XML parser processes tags within a namespace. + Extensive normalization prior to comparison of URIs is often used by + spiders and indexing engines to prune a search space or to reduce + duplication of request actions and response storage. + + URI comparison is performed for some particular purpose. Protocols + or implementations that compare URIs for different purposes will + often be subject to differing design trade-offs in regards to how + much effort should be spent in reducing aliased identifiers. This + section describes various methods that may be used to compare URIs, + the trade-offs between them, and the types of applications that might + use them. + +6.1. Equivalence + + Because URIs exist to identify resources, presumably they should be + considered equivalent when they identify the same resource. However, + this definition of equivalence is not of much practical use, as there + is no way for an implementation to compare two resources unless it + has full knowledge or control of them. For this reason, + determination of equivalence or difference of URIs is based on string + comparison, perhaps augmented by reference to additional rules + provided by URI scheme definitions. We use the terms "different" and + "equivalent" to describe the possible outcomes of such comparisons, + but there are many application-dependent versions of equivalence. + + Even though it is possible to determine that two URIs are equivalent, + URI comparison is not sufficient to determine whether two URIs + identify different resources. For example, an owner of two different + domain names could decide to serve the same resource from both, + resulting in two different URIs. Therefore, comparison methods are + designed to minimize false negatives while strictly avoiding false + positives. + + In testing for equivalence, applications should not directly compare + relative references; the references should be converted to their + respective target URIs before comparison. When URIs are compared to + select (or avoid) a network action, such as retrieval of a + representation, fragment components (if any) should be excluded from + the comparison. + + + + + +Berners-Lee, et al. Standards Track [Page 38] + +RFC 3986 URI Generic Syntax January 2005 + + +6.2. Comparison Ladder + + A variety of methods are used in practice to test URI equivalence. + These methods fall into a range, distinguished by the amount of + processing required and the degree to which the probability of false + negatives is reduced. As noted above, false negatives cannot be + eliminated. In practice, their probability can be reduced, but this + reduction requires more processing and is not cost-effective for all + applications. + + If this range of comparison practices is considered as a ladder, the + following discussion will climb the ladder, starting with practices + that are cheap but have a relatively higher chance of producing false + negatives, and proceeding to those that have higher computational + cost and lower risk of false negatives. + +6.2.1. Simple String Comparison + + If two URIs, when considered as character strings, are identical, + then it is safe to conclude that they are equivalent. This type of + equivalence test has very low computational cost and is in wide use + in a variety of applications, particularly in the domain of parsing. + + Testing strings for equivalence requires some basic precautions. + This procedure is often referred to as "bit-for-bit" or + "byte-for-byte" comparison, which is potentially misleading. Testing + strings for equality is normally based on pair comparison of the + characters that make up the strings, starting from the first and + proceeding until both strings are exhausted and all characters are + found to be equal, until a pair of characters compares unequal, or + until one of the strings is exhausted before the other. + + This character comparison requires that each pair of characters be + put in comparable form. For example, should one URI be stored in a + byte array in EBCDIC encoding and the second in a Java String object + (UTF-16), bit-for-bit comparisons applied naively will produce + errors. It is better to speak of equality on a character-for- + character basis rather than on a byte-for-byte or bit-for-bit basis. + In practical terms, character-by-character comparisons should be done + codepoint-by-codepoint after conversion to a common character + encoding. + + False negatives are caused by the production and use of URI aliases. + Unnecessary aliases can be reduced, regardless of the comparison + method, by consistently providing URI references in an already- + normalized form (i.e., a form identical to what would be produced + after normalization is applied, as described below). + + + + +Berners-Lee, et al. Standards Track [Page 39] + +RFC 3986 URI Generic Syntax January 2005 + + + Protocols and data formats often limit some URI comparisons to simple + string comparison, based on the theory that people and + implementations will, in their own best interest, be consistent in + providing URI references, or at least consistent enough to negate any + efficiency that might be obtained from further normalization. + +6.2.2. Syntax-Based Normalization + + Implementations may use logic based on the definitions provided by + this specification to reduce the probability of false negatives. + This processing is moderately higher in cost than character-for- + character string comparison. For example, an application using this + approach could reasonably consider the following two URIs equivalent: + + example://a/b/c/%7Bfoo%7D + eXAMPLE://a/./b/../b/%63/%7bfoo%7d + + Web user agents, such as browsers, typically apply this type of URI + normalization when determining whether a cached response is + available. Syntax-based normalization includes such techniques as + case normalization, percent-encoding normalization, and removal of + dot-segments. + +6.2.2.1. Case Normalization + + For all URIs, the hexadecimal digits within a percent-encoding + triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore + should be normalized to use uppercase letters for the digits A-F. + + When a URI uses components of the generic syntax, the component + syntax equivalence rules always apply; namely, that the scheme and + host are case-insensitive and therefore should be normalized to + lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is + equivalent to <http://www.example.com/>. The other generic syntax + components are assumed to be case-sensitive unless specifically + defined otherwise by the scheme (see Section 6.2.3). + +6.2.2.2. Percent-Encoding Normalization + + The percent-encoding mechanism (Section 2.1) is a frequent source of + variance among otherwise identical URIs. In addition to the case + normalization issue noted above, some URI producers percent-encode + octets that do not require percent-encoding, resulting in URIs that + are equivalent to their non-encoded counterparts. These URIs should + be normalized by decoding any percent-encoded octet that corresponds + to an unreserved character, as described in Section 2.3. + + + + + +Berners-Lee, et al. Standards Track [Page 40] + +RFC 3986 URI Generic Syntax January 2005 + + +6.2.2.3. Path Segment Normalization + + The complete path segments "." and ".." are intended only for use + within relative references (Section 4.1) and are removed as part of + the reference resolution process (Section 5.2). However, some + deployed implementations incorrectly assume that reference resolution + is not necessary when the reference is already a URI and thus fail to + remove dot-segments when they occur in non-relative paths. URI + normalizers should remove dot-segments by applying the + remove_dot_segments algorithm to the path, as described in + Section 5.2.4. + +6.2.3. Scheme-Based Normalization + + The syntax and semantics of URIs vary from scheme to scheme, as + described by the defining specification for each scheme. + Implementations may use scheme-specific rules, at further processing + cost, to reduce the probability of false negatives. For example, + because the "http" scheme makes use of an authority component, has a + default port of "80", and defines an empty path to be equivalent to + "/", the following four URIs are equivalent: + + http://example.com + http://example.com/ + http://example.com:/ + http://example.com:80/ + + In general, a URI that uses the generic syntax for authority with an + empty path should be normalized to a path of "/". Likewise, an + explicit ":port", for which the port is empty or the default for the + scheme, is equivalent to one where the port and its ":" delimiter are + elided and thus should be removed by scheme-based normalization. For + example, the second URI above is the normal form for the "http" + scheme. + + Another case where normalization varies by scheme is in the handling + of an empty authority component or empty host subcomponent. For many + scheme specifications, an empty authority or host is considered an + error; for others, it is considered equivalent to "localhost" or the + end-user's host. When a scheme defines a default for authority and a + URI reference to that default is desired, the reference should be + normalized to an empty authority for the sake of uniformity, brevity, + and internationalization. If, however, either the userinfo or port + subcomponents are non-empty, then the host should be given explicitly + even if it matches the default. + + Normalization should not remove delimiters when their associated + component is empty unless licensed to do so by the scheme + + + +Berners-Lee, et al. Standards Track [Page 41] + +RFC 3986 URI Generic Syntax January 2005 + + + specification. For example, the URI "http://example.com/?" cannot be + assumed to be equivalent to any of the examples above. Likewise, the + presence or absence of delimiters within a userinfo subcomponent is + usually significant to its interpretation. The fragment component is + not subject to any scheme-based normalization; thus, two URIs that + differ only by the suffix "#" are considered different regardless of + the scheme. + + Some schemes define additional subcomponents that consist of case- + insensitive data, giving an implicit license to normalizers to + convert this data to a common case (e.g., all lowercase). For + example, URI schemes that define a subcomponent of path to contain an + Internet hostname, such as the "mailto" URI scheme, cause that + subcomponent to be case-insensitive and thus subject to case + normalization (e.g., "mailto:Joe@Example.COM" is equivalent to + "mailto:Joe@example.com", even though the generic syntax considers + the path component to be case-sensitive). + + Other scheme-specific normalizations are possible. + +6.2.4. Protocol-Based Normalization + + Substantial effort to reduce the incidence of false negatives is + often cost-effective for web spiders. Therefore, they implement even + more aggressive techniques in URI comparison. For example, if they + observe that a URI such as + + http://example.com/data + + redirects to a URI differing only in the trailing slash + + http://example.com/data/ + + they will likely regard the two as equivalent in the future. This + kind of technique is only appropriate when equivalence is clearly + indicated by both the result of accessing the resources and the + common conventions of their scheme's dereference algorithm (in this + case, use of redirection by HTTP origin servers to avoid problems + with relative references). + + + + + + + + + + + + +Berners-Lee, et al. Standards Track [Page 42] + +RFC 3986 URI Generic Syntax January 2005 + + +7. Security Considerations + + A URI does not in itself pose a security threat. However, as URIs + are often used to provide a compact set of instructions for access to + network resources, care must be taken to properly interpret the data + within a URI, to prevent that data from causing unintended access, + and to avoid including data that should not be revealed in plain + text. + +7.1. Reliability and Consistency + + There is no guarantee that once a URI has been used to retrieve + information, the same information will be retrievable by that URI in + the future. Nor is there any guarantee that the information + retrievable via that URI in the future will be observably similar to + that retrieved in the past. The URI syntax does not constrain how a + given scheme or authority apportions its namespace or maintains it + over time. Such guarantees can only be obtained from the person(s) + controlling that namespace and the resource in question. A specific + URI scheme may define additional semantics, such as name persistence, + if those semantics are required of all naming authorities for that + scheme. + +7.2. Malicious Construction + + It is sometimes possible to construct a URI so that an attempt to + perform a seemingly harmless, idempotent operation, such as the + retrieval of a representation, will in fact cause a possibly damaging + remote operation. The unsafe URI is typically constructed by + specifying a port number other than that reserved for the network + protocol in question. The client unwittingly contacts a site running + a different protocol service, and data within the URI contains + instructions that, when interpreted according to this other protocol, + cause an unexpected operation. A frequent example of such abuse has + been the use of a protocol-based scheme with a port component of + "25", thereby fooling user agent software into sending an unintended + or impersonating message via an SMTP server. + + Applications should prevent dereference of a URI that specifies a TCP + port number within the "well-known port" range (0 - 1023) unless the + protocol being used to dereference that URI is compatible with the + protocol expected on that well-known port. Although IANA maintains a + registry of well-known ports, applications should make such + restrictions user-configurable to avoid preventing the deployment of + new services. + + + + + + +Berners-Lee, et al. Standards Track [Page 43] + +RFC 3986 URI Generic Syntax January 2005 + + + When a URI contains percent-encoded octets that match the delimiters + for a given resolution or dereference protocol (for example, CR and + LF characters for the TELNET protocol), these percent-encodings must + not be decoded before transmission across that protocol. Transfer of + the percent-encoding, which might violate the protocol, is less + harmful than allowing decoded octets to be interpreted as additional + operations or parameters, perhaps triggering an unexpected and + possibly harmful remote operation. + +7.3. Back-End Transcoding + + When a URI is dereferenced, the data within it is often parsed by + both the user agent and one or more servers. In HTTP, for example, a + typical user agent will parse a URI into its five major components, + access the authority's server, and send it the data within the + authority, path, and query components. A typical server will take + that information, parse the path into segments and the query into + key/value pairs, and then invoke implementation-specific handlers to + respond to the request. As a result, a common security concern for + server implementations that handle a URI, either as a whole or split + into separate components, is proper interpretation of the octet data + represented by the characters and percent-encodings within that URI. + + Percent-encoded octets must be decoded at some point during the + dereference process. Applications must split the URI into its + components and subcomponents prior to decoding the octets, as + otherwise the decoded octets might be mistaken for delimiters. + Security checks of the data within a URI should be applied after + decoding the octets. Note, however, that the "%00" percent-encoding + (NUL) may require special handling and should be rejected if the + application is not expecting to receive raw data within a component. + + Special care should be taken when the URI path interpretation process + involves the use of a back-end file system or related system + functions. File systems typically assign an operational meaning to + special characters, such as the "/", "\", ":", "[", and "]" + characters, and to special device names like ".", "..", "...", "aux", + "lpt", etc. In some cases, merely testing for the existence of such + a name will cause the operating system to pause or invoke unrelated + system calls, leading to significant security concerns regarding + denial of service and unintended data transfer. It would be + impossible for this specification to list all such significant + characters and device names. Implementers should research the + reserved names and characters for the types of storage device that + may be attached to their applications and restrict the use of data + obtained from URI components accordingly. + + + + + +Berners-Lee, et al. Standards Track [Page 44] + +RFC 3986 URI Generic Syntax January 2005 + + +7.4. Rare IP Address Formats + + Although the URI syntax for IPv4address only allows the common + dotted-decimal form of IPv4 address literal, many implementations + that process URIs make use of platform-dependent system routines, + such as gethostbyname() and inet_aton(), to translate the string + literal to an actual IP address. Unfortunately, such system routines + often allow and process a much larger set of formats than those + described in Section 3.2.2. + + For example, many implementations allow dotted forms of three + numbers, wherein the last part is interpreted as a 16-bit quantity + and placed in the right-most two bytes of the network address (e.g., + a Class B network). Likewise, a dotted form of two numbers means + that the last part is interpreted as a 24-bit quantity and placed in + the right-most three bytes of the network address (Class A), and a + single number (without dots) is interpreted as a 32-bit quantity and + stored directly in the network address. Adding further to the + confusion, some implementations allow each dotted part to be + interpreted as decimal, octal, or hexadecimal, as specified in the C + language (i.e., a leading 0x or 0X implies hexadecimal; a leading 0 + implies octal; otherwise, the number is interpreted as decimal). + + These additional IP address formats are not allowed in the URI syntax + due to differences between platform implementations. However, they + can become a security concern if an application attempts to filter + access to resources based on the IP address in string literal format. + If this filtering is performed, literals should be converted to + numeric form and filtered based on the numeric value, and not on a + prefix or suffix of the string form. + +7.5. Sensitive Information + + URI producers should not provide a URI that contains a username or + password that is intended to be secret. URIs are frequently + displayed by browsers, stored in clear text bookmarks, and logged by + user agent history and intermediary applications (proxies). A + password appearing within the userinfo component is deprecated and + should be considered an error (or simply ignored) except in those + rare cases where the 'password' parameter is intended to be public. + +7.6. Semantic Attacks + + Because the userinfo subcomponent is rarely used and appears before + the host in the authority component, it can be used to construct a + URI intended to mislead a human user by appearing to identify one + (trusted) naming authority while actually identifying a different + authority hidden behind the noise. For example + + + +Berners-Lee, et al. Standards Track [Page 45] + +RFC 3986 URI Generic Syntax January 2005 + + + ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm + + might lead a human user to assume that the host is 'cnn.example.com', + whereas it is actually '10.0.0.1'. Note that a misleading userinfo + subcomponent could be much longer than the example above. + + A misleading URI, such as that above, is an attack on the user's + preconceived notions about the meaning of a URI rather than an attack + on the software itself. User agents may be able to reduce the impact + of such attacks by distinguishing the various components of the URI + when they are rendered, such as by using a different color or tone to + render userinfo if any is present, though there is no panacea. More + information on URI-based semantic attacks can be found in [Siedzik]. + +8. IANA Considerations + + URI scheme names, as defined by <scheme> in Section 3.1, form a + registered namespace that is managed by IANA according to the + procedures defined in [BCP35]. No IANA actions are required by this + document. + +9. Acknowledgements + + This specification is derived from RFC 2396 [RFC2396], RFC 1808 + [RFC1808], and RFC 1738 [RFC1738]; the acknowledgements in those + documents still apply. It also incorporates the update (with + corrections) for IPv6 literals in the host syntax, as defined by + Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in + [RFC2732]. In addition, contributions by Gisle Aas, Reese Anschultz, + Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll, + Dan Connolly, Adam M. Costello, John Cowan, Jason Diamond, Martin + Duerst, Stefan Eissing, Clive D.W. Feather, Al Gilman, Tony Hammond, + Elliotte Harold, Pat Hayes, Henry Holtzman, Ian B. Jacobs, Michael + Kay, John C. Klensin, Graham Klyne, Dan Kohn, Bruce Lilly, Andrew + Main, Dave McAlpin, Ira McDonald, Michael Mealling, Ray Merkert, + Stephen Pollei, Julian Reschke, Tomas Rokicki, Miles Sabin, Kai + Schaetzl, Mark Thomson, Ronald Tschalaer, Norm Walsh, Marc Warne, + Stuart Williams, and Henry Zongaro are gratefully acknowledged. + +10. References + +10.1. Normative References + + [ASCII] American National Standards Institute, "Coded Character + Set -- 7-bit American Standard Code for Information + Interchange", ANSI X3.4, 1986. + + + + + +Berners-Lee, et al. Standards Track [Page 46] + +RFC 3986 URI Generic Syntax January 2005 + + + [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax + Specifications: ABNF", RFC 2234, November 1997. + + [STD63] Yergeau, F., "UTF-8, a transformation format of + ISO 10646", STD 63, RFC 3629, November 2003. + + [UCS] International Organization for Standardization, + "Information Technology - Universal Multiple-Octet Coded + Character Set (UCS)", ISO/IEC 10646:2003, December 2003. + +10.2. Informative References + + [BCP19] Freed, N. and J. Postel, "IANA Charset Registration + Procedures", BCP 19, RFC 2978, October 2000. + + [BCP35] Petke, R. and I. King, "Registration Procedures for URL + Scheme Names", BCP 35, RFC 2717, November 1999. + + [RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet + host table specification", RFC 952, October 1985. + + [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", + STD 13, RFC 1034, November 1987. + + [RFC1123] Braden, R., "Requirements for Internet Hosts - Application + and Support", STD 3, RFC 1123, October 1989. + + [RFC1535] Gavron, E., "A Security Problem and Proposed Correction + With Widely Deployed DNS Software", RFC 1535, + October 1993. + + [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A + Unifying Syntax for the Expression of Names and Addresses + of Objects on the Network as used in the World-Wide Web", + RFC 1630, June 1994. + + [RFC1736] Kunze, J., "Functional Recommendations for Internet + Resource Locators", RFC 1736, February 1995. + + [RFC1737] Sollins, K. and L. Masinter, "Functional Requirements for + Uniform Resource Names", RFC 1737, December 1994. + + [RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform + Resource Locators (URL)", RFC 1738, December 1994. + + [RFC1808] Fielding, R., "Relative Uniform Resource Locators", + RFC 1808, June 1995. + + + + +Berners-Lee, et al. Standards Track [Page 47] + +RFC 3986 URI Generic Syntax January 2005 + + + [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail + Extensions (MIME) Part Two: Media Types", RFC 2046, + November 1996. + + [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. + + [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform + Resource Identifiers (URI): Generic Syntax", RFC 2396, + August 1998. + + [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S., and D. + Jensen, "HTTP Extensions for Distributed Authoring -- + WEBDAV", RFC 2518, February 1999. + + [RFC2557] Palme, J., Hopmann, A., and N. Shelness, "MIME + Encapsulation of Aggregate Documents, such as HTML + (MHTML)", RFC 2557, March 1999. + + [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D., and R. Petke, + "Guidelines for new URL Schemes", RFC 2718, November 1999. + + [RFC2732] Hinden, R., Carpenter, B., and L. Masinter, "Format for + Literal IPv6 Addresses in URL's", RFC 2732, December 1999. + + [RFC3305] Mealling, M. and R. Denenberg, "Report from the Joint + W3C/IETF URI Planning Interest Group: Uniform Resource + Identifiers (URIs), URLs, and Uniform Resource Names + (URNs): Clarifications and Recommendations", RFC 3305, + August 2002. + + [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, + "Internationalizing Domain Names in Applications (IDNA)", + RFC 3490, March 2003. + + [RFC3513] Hinden, R. and S. Deering, "Internet Protocol Version 6 + (IPv6) Addressing Architecture", RFC 3513, April 2003. + + [Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?", + April 2001, <http://www.giac.org/practical/gsec/ + Richard_Siedzik_GSEC.pdf>. + + + + + + + + + + + +Berners-Lee, et al. Standards Track [Page 48] + +RFC 3986 URI Generic Syntax January 2005 + + +Appendix A. Collected ABNF for URI + + URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] + + hier-part = "//" authority path-abempty + / path-absolute + / path-rootless + / path-empty + + URI-reference = URI / relative-ref + + absolute-URI = scheme ":" hier-part [ "?" query ] + + relative-ref = relative-part [ "?" query ] [ "#" fragment ] + + relative-part = "//" authority path-abempty + / path-absolute + / path-noscheme + / path-empty + + scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) + + authority = [ userinfo "@" ] host [ ":" port ] + userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) + host = IP-literal / IPv4address / reg-name + port = *DIGIT + + IP-literal = "[" ( IPv6address / IPvFuture ) "]" + + IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) + + IPv6address = 6( h16 ":" ) ls32 + / "::" 5( h16 ":" ) ls32 + / [ h16 ] "::" 4( h16 ":" ) ls32 + / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 + / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 + / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 + / [ *4( h16 ":" ) h16 ] "::" ls32 + / [ *5( h16 ":" ) h16 ] "::" h16 + / [ *6( h16 ":" ) h16 ] "::" + + h16 = 1*4HEXDIG + ls32 = ( h16 ":" h16 ) / IPv4address + IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet + + + + + + + +Berners-Lee, et al. Standards Track [Page 49] + +RFC 3986 URI Generic Syntax January 2005 + + + dec-octet = DIGIT ; 0-9 + / %x31-39 DIGIT ; 10-99 + / "1" 2DIGIT ; 100-199 + / "2" %x30-34 DIGIT ; 200-249 + / "25" %x30-35 ; 250-255 + + reg-name = *( unreserved / pct-encoded / sub-delims ) + + path = path-abempty ; begins with "/" or is empty + / path-absolute ; begins with "/" but not "//" + / path-noscheme ; begins with a non-colon segment + / path-rootless ; begins with a segment + / path-empty ; zero characters + + path-abempty = *( "/" segment ) + path-absolute = "/" [ segment-nz *( "/" segment ) ] + path-noscheme = segment-nz-nc *( "/" segment ) + path-rootless = segment-nz *( "/" segment ) + path-empty = 0<pchar> + + segment = *pchar + segment-nz = 1*pchar + segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) + ; non-zero-length segment without any colon ":" + + pchar = unreserved / pct-encoded / sub-delims / ":" / "@" + + query = *( pchar / "/" / "?" ) + + fragment = *( pchar / "/" / "?" ) + + pct-encoded = "%" HEXDIG HEXDIG + + unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" + reserved = gen-delims / sub-delims + gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" + sub-delims = "!" / "$" / "&" / "'" / "(" / ")" + / "*" / "+" / "," / ";" / "=" + +Appendix B. Parsing a URI Reference with a Regular Expression + + As the "first-match-wins" algorithm is identical to the "greedy" + disambiguation method used by POSIX regular expressions, it is + natural and commonplace to use a regular expression for parsing the + potential five components of a URI reference. + + The following line is the regular expression for breaking-down a + well-formed URI reference into its components. + + + +Berners-Lee, et al. Standards Track [Page 50] + +RFC 3986 URI Generic Syntax January 2005 + + + ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? + 12 3 4 5 6 7 8 9 + + The numbers in the second line above are only to assist readability; + they indicate the reference points for each subexpression (i.e., each + paired parenthesis). We refer to the value matched for subexpression + <n> as $<n>. For example, matching the above expression to + + http://www.ics.uci.edu/pub/ietf/uri/#Related + + results in the following subexpression matches: + + $1 = http: + $2 = http + $3 = //www.ics.uci.edu + $4 = www.ics.uci.edu + $5 = /pub/ietf/uri/ + $6 = <undefined> + $7 = <undefined> + $8 = #Related + $9 = Related + + where <undefined> indicates that the component is not present, as is + the case for the query component in the above example. Therefore, we + can determine the value of the five components as + + scheme = $2 + authority = $4 + path = $5 + query = $7 + fragment = $9 + + Going in the opposite direction, we can recreate a URI reference from + its components by using the algorithm of Section 5.3. + +Appendix C. Delimiting a URI in Context + + URIs are often transmitted through formats that do not provide a + clear context for their interpretation. For example, there are many + occasions when a URI is included in plain text; examples include text + sent in email, USENET news, and on printed paper. In such cases, it + is important to be able to delimit the URI from the rest of the text, + and in particular from punctuation marks that might be mistaken for + part of the URI. + + In practice, URIs are delimited in a variety of ways, but usually + within double-quotes "http://example.com/", angle brackets + <http://example.com/>, or just by using whitespace: + + + +Berners-Lee, et al. Standards Track [Page 51] + +RFC 3986 URI Generic Syntax January 2005 + + + http://example.com/ + + These wrappers do not form part of the URI. + + In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may + have to be added to break a long URI across lines. The whitespace + should be ignored when the URI is extracted. + + No whitespace should be introduced after a hyphen ("-") character. + Because some typesetters and printers may (erroneously) introduce a + hyphen at the end of line when breaking it, the interpreter of a URI + containing a line break immediately after a hyphen should ignore all + whitespace around the line break and should be aware that the hyphen + may or may not actually be part of the URI. + + Using <> angle brackets around each URI is especially recommended as + a delimiting style for a reference that contains embedded whitespace. + + The prefix "URL:" (with or without a trailing space) was formerly + recommended as a way to help distinguish a URI from other bracketed + designators, though it is not commonly used in practice and is no + longer recommended. + + For robustness, software that accepts user-typed URI should attempt + to recognize and strip both delimiters and embedded whitespace. + + For example, the text + + Yes, Jim, I found it under "http://www.w3.org/Addressing/", + but you can probably pick it up from <ftp://foo.example. + com/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/ + ietf/uri/historical.html#WARNING>. + + contains the URI references + + http://www.w3.org/Addressing/ + ftp://foo.example.com/rfc/ + http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING + + + + + + + + + + + + + +Berners-Lee, et al. Standards Track [Page 52] + +RFC 3986 URI Generic Syntax January 2005 + + +Appendix D. Changes from RFC 2396 + +D.1. Additions + + An ABNF rule for URI has been introduced to correspond to one common + usage of the term: an absolute URI with optional fragment. + + IPv6 (and later) literals have been added to the list of possible + identifiers for the host portion of an authority component, as + described by [RFC2732], with the addition of "[" and "]" to the + reserved set and a version flag to anticipate future versions of IP + literals. Square brackets are now specified as reserved within the + authority component and are not allowed outside their use as + delimiters for an IP literal within host. In order to make this + change without changing the technical definition of the path, query, + and fragment components, those rules were redefined to directly + specify the characters allowed. + + As [RFC2732] defers to [RFC3513] for definition of an IPv6 literal + address, which, unfortunately, lacks an ABNF description of + IPv6address, we created a new ABNF rule for IPv6address that matches + the text representations defined by Section 2.2 of [RFC3513]. + Likewise, the definition of IPv4address has been improved in order to + limit each decimal octet to the range 0-255. + + Section 6, on URI normalization and comparison, has been completely + rewritten and extended by using input from Tim Bray and discussion + within the W3C Technical Architecture Group. + +D.2. Modifications + + The ad-hoc BNF syntax of RFC 2396 has been replaced with the ABNF of + [RFC2234]. This change required all rule names that formerly + included underscore characters to be renamed with a dash instead. In + addition, a number of syntax rules have been eliminated or simplified + to make the overall grammar more comprehensible. Specifications that + refer to the obsolete grammar rules may be understood by replacing + those rules according to the following table: + + + + + + + + + + + + + +Berners-Lee, et al. Standards Track [Page 53] + +RFC 3986 URI Generic Syntax January 2005 + + + +----------------+--------------------------------------------------+ + | obsolete rule | translation | + +----------------+--------------------------------------------------+ + | absoluteURI | absolute-URI | + | relativeURI | relative-part [ "?" query ] | + | hier_part | ( "//" authority path-abempty / | + | | path-absolute ) [ "?" query ] | + | | | + | opaque_part | path-rootless [ "?" query ] | + | net_path | "//" authority path-abempty | + | abs_path | path-absolute | + | rel_path | path-rootless | + | rel_segment | segment-nz-nc | + | reg_name | reg-name | + | server | authority | + | hostport | host [ ":" port ] | + | hostname | reg-name | + | path_segments | path-abempty | + | param | *<pchar excluding ";"> | + | | | + | uric | unreserved / pct-encoded / ";" / "?" / ":" | + | | / "@" / "&" / "=" / "+" / "$" / "," / "/" | + | | | + | uric_no_slash | unreserved / pct-encoded / ";" / "?" / ":" | + | | / "@" / "&" / "=" / "+" / "$" / "," | + | | | + | mark | "-" / "_" / "." / "!" / "~" / "*" / "'" | + | | / "(" / ")" | + | | | + | escaped | pct-encoded | + | hex | HEXDIG | + | alphanum | ALPHA / DIGIT | + +----------------+--------------------------------------------------+ + + Use of the above obsolete rules for the definition of scheme-specific + syntax is deprecated. + + Section 2, on characters, has been rewritten to explain what + characters are reserved, when they are reserved, and why they are + reserved, even when they are not used as delimiters by the generic + syntax. The mark characters that are typically unsafe to decode, + including the exclamation mark ("!"), asterisk ("*"), single-quote + ("'"), and open and close parentheses ("(" and ")"), have been moved + to the reserved set in order to clarify the distinction between + reserved and unreserved and, hopefully, to answer the most common + question of scheme designers. Likewise, the section on + percent-encoded characters has been rewritten, and URI normalizers + are now given license to decode any percent-encoded octets + + + +Berners-Lee, et al. Standards Track [Page 54] + +RFC 3986 URI Generic Syntax January 2005 + + + corresponding to unreserved characters. In general, the terms + "escaped" and "unescaped" have been replaced with "percent-encoded" + and "decoded", respectively, to reduce confusion with other forms of + escape mechanisms. + + The ABNF for URI and URI-reference has been redesigned to make them + more friendly to LALR parsers and to reduce complexity. As a result, + the layout form of syntax description has been removed, along with + the uric, uric_no_slash, opaque_part, net_path, abs_path, rel_path, + path_segments, rel_segment, and mark rules. All references to + "opaque" URIs have been replaced with a better description of how the + path component may be opaque to hierarchy. The relativeURI rule has + been replaced with relative-ref to avoid unnecessary confusion over + whether they are a subset of URI. The ambiguity regarding the + parsing of URI-reference as a URI or a relative-ref with a colon in + the first segment has been eliminated through the use of five + separate path matching rules. + + The fragment identifier has been moved back into the section on + generic syntax components and within the URI and relative-ref rules, + though it remains excluded from absolute-URI. The number sign ("#") + character has been moved back to the reserved set as a result of + reintegrating the fragment syntax. + + The ABNF has been corrected to allow the path component to be empty. + This also allows an absolute-URI to consist of nothing after the + "scheme:", as is present in practice with the "dav:" namespace + [RFC2518] and with the "about:" scheme used internally by many WWW + browser implementations. The ambiguity regarding the boundary + between authority and path has been eliminated through the use of + five separate path matching rules. + + Registry-based naming authorities that use the generic syntax are now + defined within the host rule. This change allows current + implementations, where whatever name provided is simply fed to the + local name resolution mechanism, to be consistent with the + specification. It also removes the need to re-specify DNS name + formats here. Furthermore, it allows the host component to contain + percent-encoded octets, which is necessary to enable + internationalized domain names to be provided in URIs, processed in + their native character encodings at the application layers above URI + processing, and passed to an IDNA library as a registered name in the + UTF-8 character encoding. The server, hostport, hostname, + domainlabel, toplabel, and alphanum rules have been removed. + + The resolving relative references algorithm of [RFC2396] has been + rewritten with pseudocode for this revision to improve clarity and + fix the following issues: + + + +Berners-Lee, et al. Standards Track [Page 55] + +RFC 3986 URI Generic Syntax January 2005 + + + o [RFC2396] section 5.2, step 6a, failed to account for a base URI + with no path. + + o Restored the behavior of [RFC1808] where, if the reference + contains an empty path and a defined query component, the target + URI inherits the base URI's path component. + + o The determination of whether a URI reference is a same-document + reference has been decoupled from the URI parser, simplifying the + URI processing interface within applications in a way consistent + with the internal architecture of deployed URI processing + implementations. The determination is now based on comparison to + the base URI after transforming a reference to absolute form, + rather than on the format of the reference itself. This change + may result in more references being considered "same-document" + under this specification than there would be under the rules given + in RFC 2396, especially when normalization is used to reduce + aliases. However, it does not change the status of existing + same-document references. + + o Separated the path merge routine into two routines: merge, for + describing combination of the base URI path with a relative-path + reference, and remove_dot_segments, for describing how to remove + the special "." and ".." segments from a composed path. The + remove_dot_segments algorithm is now applied to all URI reference + paths in order to match common implementations and to improve the + normalization of URIs in practice. This change only impacts the + parsing of abnormal references and same-scheme references wherein + the base URI has a non-hierarchical path. + +Index + + A + ABNF 11 + absolute 27 + absolute-path 26 + absolute-URI 27 + access 9 + authority 17, 18 + + B + base URI 28 + + C + character encoding 4 + character 4 + characters 8, 11 + coded character set 4 + + + +Berners-Lee, et al. Standards Track [Page 56] + +RFC 3986 URI Generic Syntax January 2005 + + + D + dec-octet 20 + dereference 9 + dot-segments 23 + + F + fragment 16, 24 + + G + gen-delims 13 + generic syntax 6 + + H + h16 20 + hier-part 16 + hierarchical 10 + host 18 + + I + identifier 5 + IP-literal 19 + IPv4 20 + IPv4address 19, 20 + IPv6 19 + IPv6address 19, 20 + IPvFuture 19 + + L + locator 7 + ls32 20 + + M + merge 32 + + N + name 7 + network-path 26 + + P + path 16, 22, 26 + path-abempty 22 + path-absolute 22 + path-empty 22 + path-noscheme 22 + path-rootless 22 + path-abempty 16, 22, 26 + path-absolute 16, 22, 26 + path-empty 16, 22, 26 + + + +Berners-Lee, et al. Standards Track [Page 57] + +RFC 3986 URI Generic Syntax January 2005 + + + path-rootless 16, 22 + pchar 23 + pct-encoded 12 + percent-encoding 12 + port 22 + + Q + query 16, 23 + + R + reg-name 21 + registered name 20 + relative 10, 28 + relative-path 26 + relative-ref 26 + remove_dot_segments 33 + representation 9 + reserved 12 + resolution 9, 28 + resource 5 + retrieval 9 + + S + same-document 27 + sameness 9 + scheme 16, 17 + segment 22, 23 + segment-nz 23 + segment-nz-nc 23 + sub-delims 13 + suffix 27 + + T + transcription 8 + + U + uniform 4 + unreserved 13 + URI grammar + absolute-URI 27 + ALPHA 11 + authority 18 + CR 11 + dec-octet 20 + DIGIT 11 + DQUOTE 11 + fragment 24 + gen-delims 13 + + + +Berners-Lee, et al. Standards Track [Page 58] + +RFC 3986 URI Generic Syntax January 2005 + + + h16 20 + HEXDIG 11 + hier-part 16 + host 19 + IP-literal 19 + IPv4address 20 + IPv6address 20 + IPvFuture 19 + LF 11 + ls32 20 + OCTET 11 + path 22 + path-abempty 22 + path-absolute 22 + path-empty 22 + path-noscheme 22 + path-rootless 22 + pchar 23 + pct-encoded 12 + port 22 + query 24 + reg-name 21 + relative-ref 26 + reserved 13 + scheme 17 + segment 23 + segment-nz 23 + segment-nz-nc 23 + SP 11 + sub-delims 13 + unreserved 13 + URI 16 + URI-reference 25 + userinfo 18 + URI 16 + URI-reference 25 + URL 7 + URN 7 + userinfo 18 + + + + + + + + + + + + +Berners-Lee, et al. Standards Track [Page 59] + +RFC 3986 URI Generic Syntax January 2005 + + +Authors' Addresses + + Tim Berners-Lee + World Wide Web Consortium + Massachusetts Institute of Technology + 77 Massachusetts Avenue + Cambridge, MA 02139 + USA + + Phone: +1-617-253-5702 + Fax: +1-617-258-5999 + EMail: timbl@w3.org + URI: http://www.w3.org/People/Berners-Lee/ + + + Roy T. Fielding + Day Software + 5251 California Ave., Suite 110 + Irvine, CA 92617 + USA + + Phone: +1-949-679-2960 + Fax: +1-949-679-2972 + EMail: fielding@gbiv.com + URI: http://roy.gbiv.com/ + + + Larry Masinter + Adobe Systems Incorporated + 345 Park Ave + San Jose, CA 95110 + USA + + Phone: +1-408-536-3024 + EMail: LMM@acm.org + URI: http://larry.masinter.net/ + + + + + + + + + + + + + + + +Berners-Lee, et al. Standards Track [Page 60] + +RFC 3986 URI Generic Syntax January 2005 + + +Full Copyright Statement + + Copyright (C) The Internet Society (2005). + + This document is subject to the rights, licenses and restrictions + contained in BCP 78, and except as set forth therein, the authors + retain all their rights. + + This document and the information contained herein are provided on an + "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS + OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET + ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, + INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE + INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED + WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. + +Intellectual Property + + The IETF takes no position regarding the validity or scope of any + Intellectual Property Rights or other rights that might be claimed to + pertain to the implementation or use of the technology described in + this document or the extent to which any license under such rights + might or might not be available; nor does it represent that it has + made any independent effort to identify any such rights. Information + on the IETF's procedures with respect to rights in IETF Documents can + be found in BCP 78 and BCP 79. + + Copies of IPR disclosures made to the IETF Secretariat and any + assurances of licenses to be made available, or the result of an + attempt made to obtain a general license or permission for the use of + such proprietary rights by implementers or users of this + specification can be obtained from the IETF on-line IPR repository at + http://www.ietf.org/ipr. + + The IETF invites any interested party to bring to its attention any + copyrights, patents or patent applications, or other proprietary + rights that may cover technology that may be required to implement + this standard. Please address the information to the IETF at ietf- + ipr@ietf.org. + + +Acknowledgement + + Funding for the RFC Editor function is currently provided by the + Internet Society. + + + + + + +Berners-Lee, et al. Standards Track [Page 61] + diff --git a/txt/rfc3987.txt b/txt/rfc3987.txt new file mode 100644 index 00000000..f0b1513b --- /dev/null +++ b/txt/rfc3987.txt @@ -0,0 +1,2579 @@ + + + + + + +Network Working Group M. Duerst +Request for Comments: 3987 W3C +Category: Standards Track M. Suignard + Microsoft Corporation + January 2005 + + + Internationalized Resource Identifiers (IRIs) + +Status of This Memo + + This document specifies an Internet standards track protocol for the + Internet community, and requests discussion and suggestions for + improvements. Please refer to the current edition of the "Internet + Official Protocol Standards" (STD 1) for the standardization state + and status of this protocol. Distribution of this memo is unlimited. + +Copyright Notice + + Copyright (C) The Internet Society (2005). + +Abstract + + This document defines a new protocol element, the Internationalized + Resource Identifier (IRI), as a complement to the Uniform Resource + Identifier (URI). An IRI is a sequence of characters from the + Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to + URIs is defined, which means that IRIs can be used instead of URIs, + where appropriate, to identify resources. + + The approach of defining a new protocol element was chosen instead of + extending or changing the definition of URIs. This was done in order + to allow a clear distinction and to avoid incompatibilities with + existing software. Guidelines are provided for the use and + deployment of IRIs in various protocols, formats, and software + components that currently deal with URIs. + +Table of Contents + + 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 + 1.1. Overview and Motivation . . . . . . . . . . . . . . . . 3 + 1.2. Applicability . . . . . . . . . . . . . . . . . . . . . 3 + 1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . 4 + 1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 5 + 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 6 + 2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . 6 + 2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 7 + + + + +Duerst & Suignard Standards Track [Page 1] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + 3. Relationship between IRIs and URIs . . . . . . . . . . . . . . 10 + 3.1. Mapping of IRIs to URIs . . . . . . . . . . . . . . . . 10 + 3.2. Converting URIs to IRIs . . . . . . . . . . . . . . . . 14 + 3.2.1. Examples . . . . . . . . . . . . . . . . . . . . 15 + 4. Bidirectional IRIs for Right-to-Left Languages. . . . . . . . 16 + 4.1. Logical Storage and Visual Presentation . . . . . . . . 17 + 4.2. Bidi IRI Structure . . . . . . . . . . . . . . . . . . . 18 + 4.3. Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . 19 + 4.4. Examples . . . . . . . . . . . . . . . . . . . . . . . . 19 + 5. Normalization and Comparison . . . . . . . . . . . . . . . . . 21 + 5.1. Equivalence . . . . . . . . . . . . . . . . . . . . . . 22 + 5.2. Preparation for Comparison . . . . . . . . . . . . . . . 22 + 5.3. Comparison Ladder . . . . . . . . . . . . . . . . . . . 23 + 5.3.1. Simple String Comparison . . . . . . . . . . . . 23 + 5.3.2. Syntax-Based Normalization . . . . . . . . . . . 24 + 5.3.3. Scheme-Based Normalization . . . . . . . . . . . 27 + 5.3.4. Protocol-Based Normalization . . . . . . . . . . 28 + 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 29 + 6.1. Limitations on UCS Characters Allowed in IRIs . . . . . 29 + 6.2. Software Interfaces and Protocols . . . . . . . . . . . 29 + 6.3. Format of URIs and IRIs in Documents and Protocols . . . 30 + 6.4. Use of UTF-8 for Encoding Original Characters .. . . . . 30 + 6.5. Relative IRI References . . . . . . . . . . . . . . . . 32 + 7. URI/IRI Processing Guidelines (informative) . . . . . . . . . 32 + 7.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . 32 + 7.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . 33 + 7.3. URI/IRI Transfer between Applications . . . . . . . . . 33 + 7.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 34 + 7.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . 34 + 7.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 35 + 7.7. Interpretation of URIs and IRIs . . . . . . . . . . . . 36 + 7.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 36 + 8. Security Considerations . . . . . . . . . . . . . . . . . . . 37 + 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 39 + 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 40 + 10.1. Normative References . . . . . . . . . . . . . . . . . . 40 + 10.2. Informative References . . . . . . . . . . . . . . . . . 41 + A. Design Alternatives . . . . . . . . . . . . . . . . . . . . . 44 + A.1. New Scheme(s) . . . . . . . . . . . . . . . . . . . . . 44 + A.2. Character Encodings Other Than UTF-8 . . . . . . . . . . 44 + A.3. New Encoding Convention . . . . . . . . . . . . . . . . 44 + A.4. Indicating Character Encodings in the URI/IRI . . . . . 45 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 45 + Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 46 + + + + + + + +Duerst & Suignard Standards Track [Page 2] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + +1. Introduction + +1.1. Overview and Motivation + + A Uniform Resource Identifier (URI) is defined in [RFC3986] as a + sequence of characters chosen from a limited subset of the repertoire + of US-ASCII [ASCII] characters. + + The characters in URIs are frequently used for representing words of + natural languages. This usage has many advantages: Such URIs are + easier to memorize, easier to interpret, easier to transcribe, easier + to create, and easier to guess. For most languages other than + English, however, the natural script uses characters other than A - + Z. For many people, handling Latin characters is as difficult as + handling the characters of other scripts is for those who use only + the Latin alphabet. Many languages with non-Latin scripts are + transcribed with Latin letters. These transcriptions are now often + used in URIs, but they introduce additional ambiguities. + + The infrastructure for the appropriate handling of characters from + local scripts is now widely deployed in local versions of operating + system and application software. Software that can handle a wide + variety of scripts and languages at the same time is increasingly + common. Also, increasing numbers of protocols and formats can carry + a wide range of characters. + + This document defines a new protocol element called Internationalized + Resource Identifier (IRI) by extending the syntax of URIs to a much + wider repertoire of characters. It also defines "internationalized" + versions corresponding to other constructs from [RFC3986], such as + URI references. The syntax of IRIs is defined in section 2, and the + relationship between IRIs and URIs in section 3. + + Using characters outside of A - Z in IRIs brings some difficulties. + Section 4 discusses the special case of bidirectional IRIs, section 5 + various forms of equivalence between IRIs, and section 6 the use of + IRIs in different situations. Section 7 gives additional informative + guidelines, and section 8 security considerations. + +1.2. Applicability + + IRIs are designed to be compatible with recommendations for new URI + schemes [RFC2718]. The compatibility is provided by specifying a + well-defined and deterministic mapping from the IRI character + sequence to the functionally equivalent URI character sequence. + Practical use of IRIs (or IRI references) in place of URIs (or URI + references) depends on the following conditions being met: + + + + +Duerst & Suignard Standards Track [Page 3] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + a. A protocol or format element should be explicitly designated to + be able to carry IRIs. The intent is not to introduce IRIs into + contexts that are not defined to accept them. For example, XML + schema [XMLSchema] has an explicit type "anyURI" that includes + IRIs and IRI references. Therefore, IRIs and IRI references can + be in attributes and elements of type "anyURI". On the other + hand, in the HTTP protocol [RFC2616], the Request URI is defined + as a URI, which means that direct use of IRIs is not allowed in + HTTP requests. + + b. The protocol or format carrying the IRIs should have a mechanism + to represent the wide range of characters used in IRIs, either + natively or by some protocol- or format-specific escaping + mechanism (for example, numeric character references in [XML1]). + + c. The URI corresponding to the IRI in question has to encode + original characters into octets using UTF-8. For new URI + schemes, this is recommended in [RFC2718]. It can apply to a + whole scheme (e.g., IMAP URLs [RFC2192] and POP URLs [RFC2384], + or the URN syntax [RFC2141]). It can apply to a specific part of + a URI, such as the fragment identifier (e.g., [XPointer]). It + can apply to a specific URI or part(s) thereof. For details, + please see section 6.4. + +1.3. Definitions + + The following definitions are used in this document; they follow the + terms in [RFC2130], [RFC2277], and [ISO10646]. + + character: A member of a set of elements used for the organization, + control, or representation of data. For example, "LATIN CAPITAL + LETTER A" names a character. + + octet: An ordered sequence of eight bits considered as a unit. + + character repertoire: A set of characters (in the mathematical + sense). + + sequence of characters: A sequence of characters (one after another). + + sequence of octets: A sequence of octets (one after another). + + character encoding: A method of representing a sequence of characters + as a sequence of octets (maybe with variants). Also, a method of + (unambiguously) converting a sequence of octets into a sequence of + characters. + + + + + +Duerst & Suignard Standards Track [Page 4] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + charset: The name of a parameter or attribute used to identify a + character encoding. + + UCS: Universal Character Set. The coded character set defined by + ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV4]. + + IRI reference: Denotes the common usage of an Internationalized + Resource Identifier. An IRI reference may be absolute or + relative. However, the "IRI" that results from such a reference + only includes absolute IRIs; any relative IRI references are + resolved to their absolute form. Note that in [RFC2396] URIs did + not include fragment identifiers, but in [RFC3986] fragment + identifiers are part of URIs. + + running text: Human text (paragraphs, sentences, phrases) with syntax + according to orthographic conventions of a natural language, as + opposed to syntax defined for ease of processing by machines + (e.g., markup, programming languages). + + protocol element: Any portion of a message that affects processing of + that message by the protocol in question. + + presentation element: A presentation form corresponding to a protocol + element; for example, using a wider range of characters. + + create (a URI or IRI): With respect to URIs and IRIs, the term is + used for the initial creation. This may be the initial creation + of a resource with a certain identifier, or the initial exposition + of a resource under a particular identifier. + + generate (a URI or IRI): With respect to URIs and IRIs, the term is + used when the IRI is generated by derivation from other + information. + +1.4. Notation + + RFCs and Internet Drafts currently do not allow any characters + outside the US-ASCII repertoire. Therefore, this document uses + various special notations to denote such characters in examples. + + In text, characters outside US-ASCII are sometimes referenced by + using a prefix of 'U+', followed by four to six hexadecimal digits. + + To represent characters outside US-ASCII in examples, this document + uses two notations: 'XML Notation' and 'Bidi Notation'. + + + + + + +Duerst & Suignard Standards Track [Page 5] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + XML Notation uses a leading '&#x', a trailing ';', and the + hexadecimal number of the character in the UCS in between. For + example, я stands for CYRILLIC CAPITAL LETTER YA. In this + notation, an actual '&' is denoted by '&'. + + Bidi Notation is used for bidirectional examples: Lowercase letters + stand for Latin letters or other letters that are written left to + right, whereas uppercase letters represent Arabic or Hebrew letters + that are written right to left. + + To denote actual octets in examples (as opposed to percent-encoded + octets), the two hex digits denoting the octet are enclosed in "<" + and ">". For example, the octet often denoted as 0xc9 is denoted + here as <c9>. + + In this document, the key words "MUST", "MUST NOT", "REQUIRED", + "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", + and "OPTIONAL" are to be interpreted as described in [RFC2119]. + +2. IRI Syntax + + This section defines the syntax of Internationalized Resource + Identifiers (IRIs). + + As with URIs, an IRI is defined as a sequence of characters, not as a + sequence of octets. This definition accommodates the fact that IRIs + may be written on paper or read over the radio as well as stored or + transmitted digitally. The same IRI may be represented as different + sequences of octets in different protocols or documents if these + protocols or documents use different character encodings (and/or + transfer encodings). Using the same character encoding as the + containing protocol or document ensures that the characters in the + IRI can be handled (e.g., searched, converted, displayed) in the same + way as the rest of the protocol or document. + +2.1. Summary of IRI Syntax + + IRIs are defined similarly to URIs in [RFC3986], but the class of + unreserved characters is extended by adding the characters of the UCS + (Universal Character Set, [ISO10646]) beyond U+007F, subject to the + limitations given in the syntax rules below and in section 6.1. + + Otherwise, the syntax and use of components and reserved characters + is the same as that in [RFC3986]. All the operations defined in + [RFC3986], such as the resolution of relative references, can be + applied to IRIs by IRI-processing software in exactly the same way as + they are for URIs by URI-processing software. + + + + +Duerst & Suignard Standards Track [Page 6] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + Characters outside the US-ASCII repertoire are not reserved and + therefore MUST NOT be used for syntactical purposes, such as to + delimit components in newly defined schemes. For example, U+00A2, + CENT SIGN, is not allowed as a delimiter in IRIs, because it is in + the 'iunreserved' category. This is similar to the fact that it is + not possible to use '-' as a delimiter in URIs, because it is in the + 'unreserved' category. + +2.2. ABNF for IRI References and IRIs + + Although it might be possible to define IRI references and IRIs + merely by their transformation to URI references and URIs, they can + also be accepted and processed directly. Therefore, an ABNF + definition for IRI references (which are the most general concept and + the start of the grammar) and IRIs is given here. The syntax of this + ABNF is described in [RFC2234]. Character numbers are taken from the + UCS, without implying any actual binary encoding. Terminals in the + ABNF are characters, not bytes. + + The following grammar closely follows the URI grammar in [RFC3986], + except that the range of unreserved characters is expanded to include + UCS characters, with the restriction that private UCS characters can + occur only in query parts. The grammar is split into two parts: + Rules that differ from [RFC3986] because of the above-mentioned + expansion, and rules that are the same as those in [RFC3986]. For + rules that are different than those in [RFC3986], the names of the + non-terminals have been changed as follows. If the non-terminal + contains 'URI', this has been changed to 'IRI'. Otherwise, an 'i' + has been prefixed. + + The following rules are different from those in [RFC3986]: + + IRI = scheme ":" ihier-part [ "?" iquery ] + [ "#" ifragment ] + + ihier-part = "//" iauthority ipath-abempty + / ipath-absolute + / ipath-rootless + / ipath-empty + + IRI-reference = IRI / irelative-ref + + absolute-IRI = scheme ":" ihier-part [ "?" iquery ] + + irelative-ref = irelative-part [ "?" iquery ] [ "#" ifragment ] + + irelative-part = "//" iauthority ipath-abempty + / ipath-absolute + + + +Duerst & Suignard Standards Track [Page 7] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + / ipath-noscheme + / ipath-empty + + iauthority = [ iuserinfo "@" ] ihost [ ":" port ] + iuserinfo = *( iunreserved / pct-encoded / sub-delims / ":" ) + ihost = IP-literal / IPv4address / ireg-name + + ireg-name = *( iunreserved / pct-encoded / sub-delims ) + + ipath = ipath-abempty ; begins with "/" or is empty + / ipath-absolute ; begins with "/" but not "//" + / ipath-noscheme ; begins with a non-colon segment + / ipath-rootless ; begins with a segment + / ipath-empty ; zero characters + + ipath-abempty = *( "/" isegment ) + ipath-absolute = "/" [ isegment-nz *( "/" isegment ) ] + ipath-noscheme = isegment-nz-nc *( "/" isegment ) + ipath-rootless = isegment-nz *( "/" isegment ) + ipath-empty = 0<ipchar> + + isegment = *ipchar + isegment-nz = 1*ipchar + isegment-nz-nc = 1*( iunreserved / pct-encoded / sub-delims + / "@" ) + ; non-zero-length segment without any colon ":" + + ipchar = iunreserved / pct-encoded / sub-delims / ":" + / "@" + + iquery = *( ipchar / iprivate / "/" / "?" ) + + ifragment = *( ipchar / "/" / "?" ) + + iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar + + ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF + / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD + / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD + / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD + / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD + / %xD0000-DFFFD / %xE1000-EFFFD + + iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD + + Some productions are ambiguous. The "first-match-wins" (a.k.a. + "greedy") algorithm applies. For details, see [RFC3986]. + + + + +Duerst & Suignard Standards Track [Page 8] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + The following rules are the same as those in [RFC3986]: + + scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) + + port = *DIGIT + + IP-literal = "[" ( IPv6address / IPvFuture ) "]" + + IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) + + IPv6address = 6( h16 ":" ) ls32 + / "::" 5( h16 ":" ) ls32 + / [ h16 ] "::" 4( h16 ":" ) ls32 + / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 + / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 + / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 + / [ *4( h16 ":" ) h16 ] "::" ls32 + / [ *5( h16 ":" ) h16 ] "::" h16 + / [ *6( h16 ":" ) h16 ] "::" + + h16 = 1*4HEXDIG + ls32 = ( h16 ":" h16 ) / IPv4address + + IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet + + dec-octet = DIGIT ; 0-9 + / %x31-39 DIGIT ; 10-99 + / "1" 2DIGIT ; 100-199 + / "2" %x30-34 DIGIT ; 200-249 + / "25" %x30-35 ; 250-255 + + pct-encoded = "%" HEXDIG HEXDIG + + unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" + reserved = gen-delims / sub-delims + gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" + sub-delims = "!" / "$" / "&" / "'" / "(" / ")" + / "*" / "+" / "," / ";" / "=" + + This syntax does not support IPv6 scoped addressing zone identifiers. + + + + + + + + + + + +Duerst & Suignard Standards Track [Page 9] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + +3. Relationship between IRIs and URIs + + IRIs are meant to replace URIs in identifying resources for + protocols, formats, and software components that use a UCS-based + character repertoire. These protocols and components may never need + to use URIs directly, especially when the resource identifier is used + simply for identification purposes. However, when the resource + identifier is used for resource retrieval, it is in many cases + necessary to determine the associated URI, because currently most + retrieval mechanisms are only defined for URIs. In this case, IRIs + can serve as presentation elements for URI protocol elements. An + example would be an address bar in a Web user agent. (Additional + rationale is given in section 3.1.) + +3.1. Mapping of IRIs to URIs + + This section defines how to map an IRI to a URI. Everything in this + section also applies to IRI references and URI references, as well as + to components thereof (for example, fragment identifiers). + + This mapping has two purposes: + + Syntaxical. Many URI schemes and components define additional + syntactical restrictions not captured in section 2.2. + Scheme-specific restrictions are applied to IRIs by converting + IRIs to URIs and checking the URIs against the scheme-specific + restrictions. + + Interpretational. URIs identify resources in various ways. IRIs also + identify resources. When the IRI is used solely for + identification purposes, it is not necessary to map the IRI to a + URI (see section 5). However, when an IRI is used for resource + retrieval, the resource that the IRI locates is the same as the + one located by the URI obtained after converting the IRI according + to the procedure defined here. This means that there is no need + to define resolution separately on the IRI level. + + Applications MUST map IRIs to URIs by using the following two steps. + + Step 1. Generate a UCS character sequence from the original IRI + format. This step has the following three variants, + depending on the form of the input: + + a. If the IRI is written on paper, read aloud, or otherwise + represented as a sequence of characters independent of + any character encoding, represent the IRI as a sequence + of characters from the UCS normalized according to + Normalization Form C (NFC, [UTR15]). + + + +Duerst & Suignard Standards Track [Page 10] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + b. If the IRI is in some digital representation (e.g., an + octet stream) in some known non-Unicode character + encoding, convert the IRI to a sequence of characters + from the UCS normalized according to NFC. + + c. If the IRI is in a Unicode-based character encoding (for + example, UTF-8 or UTF-16), do not normalize (see section + 5.3.2.2 for details). Apply step 2 directly to the + encoded Unicode character sequence. + + Step 2. For each character in 'ucschar' or 'iprivate', apply steps + 2.1 through 2.3 below. + + 2.1. Convert the character to a sequence of one or more octets + using UTF-8 [RFC3629]. + + 2.2. Convert each octet to %HH, where HH is the hexadecimal + notation of the octet value. Note that this is identical + to the percent-encoding mechanism in section 2.1 of + [RFC3986]. To reduce variability, the hexadecimal notation + SHOULD use uppercase letters. + + 2.3. Replace the original character with the resulting character + sequence (i.e., a sequence of %HH triplets). + + The above mapping from IRIs to URIs produces URIs fully conforming to + [RFC3986]. The mapping is also an identity transformation for URIs + and is idempotent; applying the mapping a second time will not + change anything. Every URI is by definition an IRI. + + Systems accepting IRIs MAY convert the ireg-name component of an IRI + as follows (before step 2 above) for schemes known to use domain + names in ireg-name, if the scheme definition does not allow + percent-encoding for ireg-name: + + Replace the ireg-name part of the IRI by the part converted using the + ToASCII operation specified in section 4.1 of [RFC3490] on each + dot-separated label, and by using U+002E (FULL STOP) as a label + separator, with the flag UseSTD3ASCIIRules set to TRUE, and with the + flag AllowUnassigned set to FALSE for creating IRIs and set to TRUE + otherwise. + + + + + + + + + + +Duerst & Suignard Standards Track [Page 11] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + The ToASCII operation may fail, but this would mean that the IRI + cannot be resolved. This conversion SHOULD be used when the goal is + to maximize interoperability with legacy URI resolvers. For example, + the IRI + + "http://résumé.example.org" + + may be converted to + + "http://xn--rsum-bpad.example.org" + + instead of + + "http://r%C3%A9sum%C3%A9.example.org". + + An IRI with a scheme that is known to use domain names in ireg-name, + but where the scheme definition does not allow percent-encoding for + ireg-name, meets scheme-specific restrictions if either the + straightforward conversion or the conversion using the ToASCII + operation on ireg-name result in an URI that meets the scheme- + specific restrictions. + + Such an IRI resolves to the URI obtained after converting the IRI and + uses the ToASCII operation on ireg-name. Implementations do not have + to do this conversion as long as they produce the same result. + + Note: The difference between variants b and c in step 1 (using + normalization with NFC, versus not using any normalization) + accounts for the fact that in many non-Unicode character + encodings, some text cannot be represented directly. For example, + the word "Vietnam" is natively written "Việt Nam" + (containing a LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW) + in NFC, but a direct transcoding from the windows-1258 character + encoding leads to "Việt Nam" (containing a LATIN SMALL + LETTER E WITH CIRCUMFLEX followed by a COMBINING DOT BELOW). + Direct transcoding of other 8-bit encodings of Vietnamese may lead + to other representations. + + Note: The uniform treatment of the whole IRI in step 2 is important + to make processing independent of URI scheme. See [Gettys] for an + in-depth discussion. + + Note: In practice, whether the general mapping (steps 1 and 2) or the + ToASCII operation of [RFC3490] is used for ireg-name will not be + noticed if mapping from IRI to URI and resolution is tightly + integrated (e.g., carried out in the same user agent). But + + + + + +Duerst & Suignard Standards Track [Page 12] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + conversion using [RFC3490] may be able to better deal with + backwards compatibility issues in case mapping and resolution are + separated, as in the case of using an HTTP proxy. + + Note: Internationalized Domain Names may be contained in parts of an + IRI other than the ireg-name part. It is the responsibility of + scheme-specific implementations (if the Internationalized Domain + Name is part of the scheme syntax) or of server-side + implementations (if the Internationalized Domain Name is part of + 'iquery') to apply the necessary conversions at the appropriate + point. Example: Trying to validate the Web page at + http://résumé.example.org would lead to an IRI of + http://validator.w3.org/check?uri=http%3A%2F%2Frésumé. + example.org, which would convert to a URI of + http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9. + example.org. The server side implementation would be responsible + for making the necessary conversions to be able to retrieve the + Web page. + + Systems accepting IRIs MAY also deal with the printable characters in + US-ASCII that are not allowed in URIs, namely "<", ">", '"', space, + "{", "}", "|", "\", "^", and "`", in step 2 above. If these + characters are found but are not converted, then the conversion + SHOULD fail. Please note that the number sign ("#"), the percent + sign ("%"), and the square bracket characters ("[", "]") are not part + of the above list and MUST NOT be converted. Protocols and formats + that have used earlier definitions of IRIs including these characters + MAY require percent-encoding of these characters as a preprocessing + step to extract the actual IRI from a given field. This + preprocessing MAY also be used by applications allowing the user to + enter an IRI. + + Note: In this process (in step 2.3), characters allowed in URI + references and existing percent-encoded sequences are not encoded + further. (This mapping is similar to, but different from, the + encoding applied when arbitrary content is included in some part + of a URI.) For example, an IRI of + "http://www.example.org/red%09rosé#red" (in XML notation) is + converted to + "http://www.example.org/red%09ros%C3%A9#red", not to something + like + "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red". + + Note: Some older software transcoding to UTF-8 may produce illegal + output for some input, in particular for characters outside the + BMP (Basic Multilingual Plane). As an example, for the IRI with + non-BMP characters (in XML Notation): + "http://example.com/𐌀𐌁𐌂"; + + + +Duerst & Suignard Standards Track [Page 13] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + which contains the first three letters of the Old Italic alphabet, + the correct conversion to a URI is + "http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82" + +3.2. Converting URIs to IRIs + + In some situations, converting a URI into an equivalent IRI may be + desirable. This section gives a procedure for this conversion. The + conversion described in this section will always result in an IRI + that maps back to the URI used as an input for the conversion (except + for potential case differences in percent-encoding and for potential + percent-encoded unreserved characters). However, the IRI resulting + from this conversion may not be exactly the same as the original IRI + (if there ever was one). + + URI-to-IRI conversion removes percent-encodings, but not all + percent-encodings can be eliminated. There are several reasons for + this: + + 1. Some percent-encodings are necessary to distinguish percent- + encoded and unencoded uses of reserved characters. + + 2. Some percent-encodings cannot be interpreted as sequences of + UTF-8 octets. + + (Note: The octet patterns of UTF-8 are highly regular. + Therefore, there is a very high probability, but no guarantee, + that percent-encodings that can be interpreted as sequences of + UTF-8 octets actually originated from UTF-8. For a detailed + discussion, see [Duerst97].) + + 3. The conversion may result in a character that is not appropriate + in an IRI. See sections 2.2, 4.1, and 6.1 for further details. + + Conversion from a URI to an IRI is done by using the following steps + (or any other algorithm that produces the same result): + + 1. Represent the URI as a sequence of octets in US-ASCII. + + 2. Convert all percent-encodings ("%" followed by two hexadecimal + digits) to the corresponding octets, except those corresponding + to "%", characters in "reserved", and characters in US-ASCII not + allowed in URIs. + + 3. Re-percent-encode any octet produced in step 2 that is not part + of a strictly legal UTF-8 octet sequence. + + + + + +Duerst & Suignard Standards Track [Page 14] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + 4. Re-percent-encode all octets produced in step 3 that in UTF-8 + represent characters that are not appropriate according to + sections 2.2, 4.1, and 6.1. + + 5. Interpret the resulting octet sequence as a sequence of characters + encoded in UTF-8. + + This procedure will convert as many percent-encoded characters as + possible to characters in an IRI. Because there are some choices + when step 4 is applied (see section 6.1), results may vary. + + Conversions from URIs to IRIs MUST NOT use any character encoding + other than UTF-8 in steps 3 and 4, even if it might be possible to + guess from the context that another character encoding than UTF-8 was + used in the URI. For example, the URI + "http://www.example.org/r%E9sum%E9.html" might with some guessing be + interpreted to contain two e-acute characters encoded as iso-8859-1. + It must not be converted to an IRI containing these e-acute + characters. Otherwise, in the future the IRI will be mapped to + "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different + URI from "http://www.example.org/r%E9sum%E9.html". + +3.2.1. Examples + + This section shows various examples of converting URIs to IRIs. Each + example shows the result after each of the steps 1 through 5 is + applied. XML Notation is used for the final result. Octets are + denoted by "<" followed by two hexadecimal digits followed by ">". + + The following example contains the sequence "%C3%BC", which is a + strictly legal UTF-8 sequence, and which is converted into the actual + character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as + u-umlaut). + + 1. http://www.example.org/D%C3%BCrst + + 2. http://www.example.org/D<c3><bc>rst + + 3. http://www.example.org/D<c3><bc>rst + + 4. http://www.example.org/D<c3><bc>rst + + 5. http://www.example.org/Dürst + + The following example contains the sequence "%FC", which might + represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in the + iso-8859-1 character encoding. (It might represent other characters + in other character encodings. For example, the octet <fc> in + + + +Duerst & Suignard Standards Track [Page 15] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + iso-8859-5 represents U+045C, CYRILLIC SMALL LETTER KJE.) Because + <fc> is not part of a strictly legal UTF-8 sequence, it is + re-percent-encoded in step 3. + + 1. http://www.example.org/D%FCrst + + 2. http://www.example.org/D<fc>rst + + 3. http://www.example.org/D%FCrst + + 4. http://www.example.org/D%FCrst + + 5. http://www.example.org/D%FCrst + + The following example contains "%e2%80%ae", which is the percent- + encoded UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. + Section 4.1 forbids the direct use of this character in an IRI. + Therefore, the corresponding octets are re-percent-encoded in step 4. + This example shows that the case (upper- or lowercase) of letters + used in percent-encodings may not be preserved. The example also + contains a punycode-encoded domain name label (xn--99zt52a), which is + not converted. + + 1. http://xn--99zt52a.example.org/%e2%80%ae + + 2. http://xn--99zt52a.example.org/<e2><80><ae> + + 3. http://xn--99zt52a.example.org/<e2><80><ae> + + 4. http://xn--99zt52a.example.org/%E2%80%AE + + 5. http://xn--99zt52a.example.org/%E2%80%AE + + Implementations with scheme-specific knowledge MAY convert + punycode-encoded domain name labels to the corresponding characters + by using the ToUnicode procedure. Thus, for the example above, the + label "xn--99zt52a" may be converted to U+7D0D U+8C46 (Japanese + Natto), leading to the overall IRI of + "http://納豆.example.org/%E2%80%AE". + +4. Bidirectional IRIs for Right-to-Left Languages + + Some UCS characters, such as those used in the Arabic and Hebrew + scripts, have an inherent right-to-left (rtl) writing direction. + IRIs containing these characters (called bidirectional IRIs or Bidi + IRIs) require additional attention because of the non-trivial + + + + + +Duerst & Suignard Standards Track [Page 16] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + relation between logical representation (used for digital + representation and for reading/spelling) and visual representation + (used for display/printing). + + Because of the complex interaction between the logical + representation, the visual representation, and the syntax of a Bidi + IRI, a balance is needed between various requirements. The main + requirements are + + 1. user-predictable conversion between visual and logical + representation; + + 2. the ability to include a wide range of characters in various + parts of the IRI; and + + 3. minor or no changes or restrictions for implementations. + +4.1. Logical Storage and Visual Presentation + + When stored or transmitted in digital representation, bidirectional + IRIs MUST be in full logical order and MUST conform to the IRI syntax + rules (which includes the rules relevant to their scheme). This + ensures that bidirectional IRIs can be processed in the same way as + other IRIs. + + Bidirectional IRIs MUST be rendered by using the Unicode + Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be + rendered in the same way as they would be if they were in a + left-to-right embedding; i.e., as if they were preceded by U+202A, + LEFT-TO-RIGHT EMBEDDING (LRE), and followed by U+202C, POP + DIRECTIONAL FORMATTING (PDF). Setting the embedding direction can + also be done in a higher-level protocol (e.g., the dir='ltr' + attribute in HTML). + + There is no requirement to use the above embedding if the display is + still the same without the embedding. For example, a bidirectional + IRI in a text with left-to-right base directionality (such as used + for English or Cyrillic) that is preceded and followed by whitespace + and strong left-to-right characters does not need an embedding. + Also, a bidirectional relative IRI reference that only contains + strong right-to-left characters and weak characters and that starts + and ends with a strong right-to-left character and appears in a text + with right-to-left base directionality (such as used for Arabic or + Hebrew) and is preceded and followed by whitespace and strong + characters does not need an embedding. + + + + + + +Duerst & Suignard Standards Track [Page 17] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be + sufficient to force the correct display behavior. However, the + details of the Unicode Bidirectional algorithm are not always easy to + understand. Implementers are strongly advised to err on the side of + caution and to use embedding in all cases where they are not + completely sure that the display behavior is unaffected without the + embedding. + + The Unicode Bidirectional Algorithm ([UNI9], section 4.3) permits + higher-level protocols to influence bidirectional rendering. Such + changes by higher-level protocols MUST NOT be used if they change the + rendering of IRIs. + + The bidirectional formatting characters that may be used before or + after the IRI to ensure correct display are not themselves part of + the IRI. IRIs MUST NOT contain bidirectional formatting characters + (LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual + rendering of the IRI but do not appear themselves. It would + therefore not be possible to input an IRI with such characters + correctly. + +4.2. Bidi IRI Structure + + The Unicode Bidirectional Algorithm is designed mainly for running + text. To make sure that it does not affect the rendering of + bidirectional IRIs too much, some restrictions on bidirectional IRIs + are necessary. These restrictions are given in terms of delimiters + (structural characters, mostly punctuation such as "@", ".", ":", and + "/") and components (usually consisting mostly of letters and + digits). + + The following syntax rules from section 2.2 correspond to components + for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment, + isegment-nz, isegment-nz-nc, ireg-name, iquery, and ifragment. + + Specifications that define the syntax of any of the above components + MAY divide them further and define smaller parts to be components + according to this document. As an example, the restrictions of + [RFC3490] on bidirectional domain names correspond to treating each + label of a domain name as a component for schemes with ireg-name as a + domain name. Even where the components are not defined formally, it + may be helpful to think about some syntax in terms of components and + to apply the relevant restrictions. For example, for the usual + name/value syntax in query parts, it is convenient to treat each name + and each value as a component. As another example, the extensions in + a resource name can be treated as separate components. + + + + + +Duerst & Suignard Standards Track [Page 18] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + For each component, the following restrictions apply: + + 1. A component SHOULD NOT use both right-to-left and left-to-right + characters. + + 2. A component using right-to-left characters SHOULD start and end + with right-to-left characters. + + The above restrictions are given as shoulds, rather than as musts. + For IRIs that are never presented visually, they are not relevant. + However, for IRIs in general, they are very important to ensure + consistent conversion between visual presentation and logical + representation, in both directions. + + Note: In some components, the above restrictions may actually be + strictly enforced. For example, [RFC3490] requires that these + restrictions apply to the labels of a host name for those schemes + where ireg-name is a host name. In some other components (for + example, path components) following these restrictions may not be + too difficult. For other components, such as parts of the query + part, it may be very difficult to enforce the restrictions because + the values of query parameters may be arbitrary character + sequences. + + If the above restrictions cannot be satisfied otherwise, the affected + component can always be mapped to URI notation as described in + section 3.1. Please note that the whole component has to be mapped + (see also Example 9 below). + +4.3. Input of Bidi IRIs + + Bidi input methods MUST generate Bidi IRIs in logical order while + rendering them according to section 4.1. During input, rendering + SHOULD be updated after every new character is input to avoid end- + user confusion. + +4.4. Examples + + This section gives examples of bidirectional IRIs, in Bidi Notation. + It shows legal IRIs with the relationship between logical and visual + representation and explains how certain phenomena in this + relationship may look strange to somebody not familiar with + bidirectional behavior, but familiar to users of Arabic and Hebrew. + It also shows what happens if the restrictions given in section 4.2 + are not followed. The examples below can be seen at [BidiEx], in + Arabic, Hebrew, and Bidi Notation variants. + + + + + +Duerst & Suignard Standards Track [Page 19] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + To read the bidi text in the examples, read the visual representation + from left to right until you encounter a block of rtl text. Read the + rtl block (including slashes and other special characters) from right + to left, then continue at the next unread ltr character. + + Example 1: A single component with rtl characters is inverted: + Logical representation: "http://ab.CDEFGH.ij/kl/mn/op.html" + Visual representation: "http://ab.HGFEDC.ij/kl/mn/op.html" + Components can be read one by one, and each component can be read in + its natural direction. + + Example 2: More than one consecutive component with rtl characters is + inverted as a whole: + Logical representation: "http://ab.CDE.FGH/ij/kl/mn/op.html" + Visual representation: "http://ab.HGF.EDC/ij/kl/mn/op.html" + A sequence of rtl components is read rtl, in the same way as a + sequence of rtl words is read rtl in a bidi text. + + Example 3: All components of an IRI (except for the scheme) are rtl. + All rtl components are inverted overall: + Logical representation: "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV" + Visual representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA" + The whole IRI (except the scheme) is read rtl. Delimiters between + rtl components stay between the respective components; delimiters + between ltr and rtl components don't move. + + Example 4: Each of several sequences of rtl components is inverted on + its own: + Logical representation: "http://AB.CD.ef/gh/IJ/KL.html" + Visual representation: "http://DC.BA.ef/gh/LK/JI.html" + Each sequence of rtl components is read rtl, in the same way as each + sequence of rtl words in an ltr text is read rtl. + + Example 5: Example 2, applied to components of different kinds: + Logical representation: "http://ab.cd.EF/GH/ij/kl.html" + Visual representation: "http://ab.cd.HG/FE/ij/kl.html" + The inversion of the domain name label and the path component may be + unexpected, but it is consistent with other bidi behavior. For + reassurance that the domain component really is "ab.cd.EF", it may be + helpful to read aloud the visual representation following the bidi + algorithm. After "http://ab.cd." one reads the RTL block + "E-F-slash-G-H", which corresponds to the logical representation. + + Example 6: Same as Example 5, with more rtl components: + Logical representation: "http://ab.CD.EF/GH/IJ/kl.html" + Visual representation: "http://ab.JI/HG/FE.DC/kl.html" + The inversion of the domain name labels and the path components may + be easier to identify because the delimiters also move. + + + +Duerst & Suignard Standards Track [Page 20] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + Example 7: A single rtl component includes digits: + Logical representation: "http://ab.CDE123FGH.ij/kl/mn/op.html" + Visual representation: "http://ab.HGF123EDC.ij/kl/mn/op.html" + Numbers are written ltr in all cases but are treated as an additional + embedding inside a run of rtl characters. This is completely + consistent with usual bidirectional text. + + Example 8 (not allowed): Numbers are at the start or end of an rtl + component: + Logical representation: "http://ab.cd.ef/GH1/2IJ/KL.html" + Visual representation: "http://ab.cd.ef/LK/JI1/2HG.html" + The sequence "1/2" is interpreted by the bidi algorithm as a + fraction, fragmenting the components and leading to confusion. There + are other characters that are interpreted in a special way close to + numbers; in particular, "+", "-", "#", "$", "%", ",", ".", and ":". + + Example 9 (not allowed): The numbers in the previous example are + percent-encoded: + Logical representation: "http://ab.cd.ef/GH%31/%32IJ/KL.html", + Visual representation (Hebrew): "http://ab.cd.ef/%31HG/LK/JI%32.html" + Visual representation (Arabic): "http://ab.cd.ef/31%HG/%LK/JI32.html" + Depending on whether the uppercase letters represent Arabic or + Hebrew, the visual representation is different. + + Example 10 (allowed but not recommended): + Logical representation: "http://ab.CDEFGH.123/kl/mn/op.html" + Visual representation: "http://ab.123.HGFEDC/kl/mn/op.html" + Components consisting of only numbers are allowed (it would be rather + difficult to prohibit them), but these may interact with adjacent RTL + components in ways that are not easy to predict. + +5. Normalization and Comparison + + Note: The structure and much of the material for this section is + taken from section 6 of [RFC3986]; the differences are due to the + specifics of IRIs. + + One of the most common operations on IRIs is simple comparison: + Determining whether two IRIs are equivalent without using the IRIs or + the mapped URIs to access their respective resource(s). A comparison + is performed whenever a response cache is accessed, a browser checks + its history to color a link, or an XML parser processes tags within a + namespace. Extensive normalization prior to comparison of IRIs may + be used by spiders and indexing engines to prune a search space or + reduce duplication of request actions and response storage. + + + + + + +Duerst & Suignard Standards Track [Page 21] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + IRI comparison is performed for some particular purpose. Protocols + or implementations that compare IRIs for different purposes will + often be subject to differing design trade-offs in regards to how + much effort should be spent in reducing aliased identifiers. This + section describes various methods that may be used to compare IRIs, + the trade-offs between them, and the types of applications that might + use them. + +5.1. Equivalence + + Because IRIs exist to identify resources, presumably they should be + considered equivalent when they identify the same resource. However, + this definition of equivalence is not of much practical use, as there + is no way for an implementation to compare two resources unless it + has full knowledge or control of them. For this reason, determination + of equivalence or difference of IRIs is based on string comparison, + perhaps augmented by reference to additional rules provided by URI + scheme definitions. We use the terms "different" and "equivalent" to + describe the possible outcomes of such comparisons, but there are + many application-dependent versions of equivalence. + + Even though it is possible to determine that two IRIs are equivalent, + IRI comparison is not sufficient to determine whether two IRIs + identify different resources. For example, an owner of two different + domain names could decide to serve the same resource from both, + resulting in two different IRIs. Therefore, comparison methods are + designed to minimize false negatives while strictly avoiding false + positives. + + In testing for equivalence, applications should not directly compare + relative references; the references should be converted to their + respective target IRIs before comparison. When IRIs are compared to + select (or avoid) a network action, such as retrieval of a + representation, fragment components (if any) should be excluded from + the comparison. + + Applications using IRIs as identity tokens with no relationship to a + protocol MUST use the Simple String Comparison (see section 5.3.1). + All other applications MUST select one of the comparison practices + from the Comparison Ladder (see section 5.3 or, after IRI-to-URI + conversion, select one of the comparison practices from the URI + comparison ladder in [RFC3986], section 6.2) + +5.2. Preparation for Comparison + + Any kind of IRI comparison REQUIRES that all escapings or encodings + in the protocol or format that carries an IRI are resolved. This is + usually done when the protocol or format is parsed. Examples of such + + + +Duerst & Suignard Standards Track [Page 22] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + escapings or encodings are entities and numeric character references + in [HTML4] and [XML1]. As an example, + "http://example.org/rosé" (in HTML), + "http://example.org/rosé"; (in HTML or XML), and + "http://example.org/rosé"; (in HTML or XML) are all resolved into + what is denoted in this document (see section 1.4) as + "http://example.org/rosé"; (the "é" here standing for the + actual e-acute character, to compensate for the fact that this + document cannot contain non-ASCII characters). + + Similar considerations apply to encodings such as Transfer Codings in + HTTP (see [RFC2616]) and Content Transfer Encodings in MIME + ([RFC2045]), although in these cases, the encoding is based not on + characters but on octets, and additional care is required to make + sure that characters, and not just arbitrary octets, are compared + (see section 5.3.1). + +5.3. Comparison Ladder + + In practice, a variety of methods are used, to test IRI equivalence. + These methods fall into a range distinguished by the amount of + processing required and the degree to which the probability of false + negatives is reduced. As noted above, false negatives cannot be + eliminated. In practice, their probability can be reduced, but this + reduction requires more processing and is not cost-effective for all + applications. + + If this range of comparison practices is considered as a ladder, the + following discussion will climb the ladder, starting with practices + that are cheap but have a relatively higher chance of producing false + negatives, and proceeding to those that have higher computational + cost and lower risk of false negatives. + +5.3.1. Simple String Comparison + + If two IRIs, when considered as character strings, are identical, + then it is safe to conclude that they are equivalent. This type of + equivalence test has very low computational cost and is in wide use + in a variety of applications, particularly in the domain of parsing. + It is also used when a definitive answer to the question of IRI + equivalence is needed that is independent of the scheme used and that + can be calculated quickly and without accessing a network. An + example of such a case is XML Namespaces ([XMLNamespace]). + + Testing strings for equivalence requires some basic precautions. This + procedure is often referred to as "bit-for-bit" or "byte-for-byte" + comparison, which is potentially misleading. Testing strings for + equality is normally based on pair comparison of the characters that + + + +Duerst & Suignard Standards Track [Page 23] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + make up the strings, starting from the first and proceeding until + both strings are exhausted and all characters are found to be equal, + until a pair of characters compares unequal, or until one of the + strings is exhausted before the other. + + This character comparison requires that each pair of characters be + put in comparable encoding form. For example, should one IRI be + stored in a byte array in UTF-8 encoding form and the second in a + UTF-16 encoding form, bit-for-bit comparisons applied naively will + produce errors. It is better to speak of equality on a + character-for-character rather than on a byte-for-byte or bit-for-bit + basis. In practical terms, character-by-character comparisons should + be done codepoint by codepoint after conversion to a common character + encoding form. When comparing character by character, the comparison + function MUST NOT map IRIs to URIs, because such a mapping would + create additional spurious equivalences. It follows that an IRI + SHOULD NOT be modified when being transported if there is any chance + that this IRI might be used as an identifier. + + False negatives are caused by the production and use of IRI aliases. + Unnecessary aliases can be reduced, regardless of the comparison + method, by consistently providing IRI references in an already + normalized form (i.e., a form identical to what would be produced + after normalization is applied, as described below). Protocols and + data formats often limit some IRI comparisons to simple string + comparison, based on the theory that people and implementations will, + in their own best interest, be consistent in providing IRI + references, or at least be consistent enough to negate any efficiency + that might be obtained from further normalization. + +5.3.2. Syntax-Based Normalization + + Implementations may use logic based on the definitions provided by + this specification to reduce the probability of false negatives. This + processing is moderately higher in cost than character-for-character + string comparison. For example, an application using this approach + could reasonably consider the following two IRIs equivalent: + + example://a/b/c/%7Bfoo%7D/rosé + eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9 + + Web user agents, such as browsers, typically apply this type of IRI + normalization when determining whether a cached response is + available. Syntax-based normalization includes such techniques as + case normalization, character normalization, percent-encoding + normalization, and removal of dot-segments. + + + + + +Duerst & Suignard Standards Track [Page 24] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + +5.3.2.1. Case Normalization + + For all IRIs, the hexadecimal digits within a percent-encoding + triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore + should be normalized to use uppercase letters for the digits A - F. + + When an IRI uses components of the generic syntax, the component + syntax equivalence rules always apply; namely, that the scheme and + US-ASCII only host are case insensitive and therefore should be + normalized to lowercase. For example, the URI + "HTTP://www.EXAMPLE.com/" is equivalent to "http://www.example.com/". + Case equivalence for non-ASCII characters in IRI components that are + IDNs are discussed in section 5.3.3. The other generic syntax + components are assumed to be case sensitive unless specifically + defined otherwise by the scheme. + + Creating schemes that allow case-insensitive syntax components + containing non-ASCII characters should be avoided. Case normalization + of non-ASCII characters can be culturally dependent and is always a + complex operation. The only exception concerns non-ASCII host names + for which the character normalization includes a mapping step derived + from case folding. + +5.3.2.2. Character Normalization + + The Unicode Standard [UNIV4] defines various equivalences between + sequences of characters for various purposes. Unicode Standard Annex + #15 [UTR15] defines various Normalization Forms for these + equivalences, in particular Normalization Form C (NFC, Canonical + Decomposition, followed by Canonical Composition) and Normalization + Form KC (NFKC, Compatibility Decomposition, followed by Canonical + Composition). + + Equivalence of IRIs MUST rely on the assumption that IRIs are + appropriately pre-character-normalized rather than apply character + normalization when comparing two IRIs. The exceptions are conversion + from a non-digital form, and conversion from a non-UCS-based + character encoding to a UCS-based character encoding. In these cases, + NFC or a normalizing transcoder using NFC MUST be used for + interoperability. To avoid false negatives and problems with + transcoding, IRIs SHOULD be created by using NFC. Using NFKC may + avoid even more problems; for example, by choosing half-width Latin + letters instead of full-width ones, and full-width instead of + half-width Katakana. + + As an example, "http://www.example.org/résumé.html" (in XML + Notation) is in NFC. On the other hand, + "http://www.example.org/résumé.html" is not in NFC. + + + +Duerst & Suignard Standards Track [Page 25] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + The former uses precombined e-acute characters, and the latter uses + "e" characters followed by combining acute accents. Both usages are + defined as canonically equivalent in [UNIV4]. + + Note: Because it is unknown how a particular sequence of characters + is being treated with respect to character normalization, it would + be inappropriate to allow third parties to normalize an IRI + arbitrarily. This does not contradict the recommendation that + when a resource is created, its IRI should be as character + normalized as possible (i.e., NFC or even NFKC). This is similar + to the uppercase/lowercase problems. Some parts of a URI are case + insensitive (domain name). For others, it is unclear whether they + are case sensitive, case insensitive, or something in between + (e.g., case sensitive, but with a multiple choice selection if the + wrong case is used, instead of a direct negative result). The + best recipe is that the creator use a reasonable capitalization + and, when transferring the URI, capitalization never be changed. + + Various IRI schemes may allow the usage of Internationalized Domain + Names (IDN) [RFC3490] either in the ireg-name part or elsewhere. + Character Normalization also applies to IDNs, as discussed in section + 5.3.3. + +5.3.2.3. Percent-Encoding Normalization + + The percent-encoding mechanism (section 2.1 of [RFC3986]) is a + frequent source of variance among otherwise identical IRIs. In + addition to the case normalization issue noted above, some IRI + producers percent-encode octets that do not require percent-encoding, + resulting in IRIs that are equivalent to their non encoded + counterparts. These IRIs should be normalized by decoding any + percent-encoded octet sequence that corresponds to an unreserved + character, as described in section 2.3 of [RFC3986]. + + For actual resolution, differences in percent-encoding (except for + the percent-encoding of reserved characters) MUST always result in + the same resource. For example, "http://example.org/~user", + "http://example.org/%7euser", and "http://example.org/%7Euser", must + resolve to the same resource. + + If this kind of equivalence is to be tested, the percent-encoding of + both IRIs to be compared has to be aligned; for example, by + converting both IRIs to URIs (see section 3.1), eliminating escape + differences in the resulting URIs, and making sure that the case of + the hexadecimal characters in the percent-encoding is always the same + (preferably uppercase). If the IRI is to be passed to another + + + + + +Duerst & Suignard Standards Track [Page 26] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + application or used further in some other way, its original form MUST + be preserved. The conversion described here should be performed only + for local comparison. + +5.3.2.4. Path Segment Normalization + + The complete path segments "." and ".." are intended only for use + within relative references (section 4.1 of [RFC3986]) and are removed + as part of the reference resolution process (section 5.2 of + [RFC3986]). However, some implementations may incorrectly assume + that reference resolution is not necessary when the reference is + already an IRI, and thus fail to remove dot-segments when they occur + in non-relative paths. IRI normalizers should remove dot-segments by + applying the remove_dot_segments algorithm to the path, as described + in section 5.2.4 of [RFC3986]. + +5.3.3. Scheme-Based Normalization + + The syntax and semantics of IRIs vary from scheme to scheme, as + described by the defining specification for each scheme. + Implementations may use scheme-specific rules, at further processing + cost, to reduce the probability of false negatives. For example, + because the "http" scheme makes use of an authority component, has a + default port of "80", and defines an empty path to be equivalent to + "/", the following four IRIs are equivalent: + + http://example.com + http://example.com/ + http://example.com:/ + http://example.com:80/ + + In general, an IRI that uses the generic syntax for authority with an + empty path should be normalized to a path of "/". Likewise, an + explicit ":port", for which the port is empty or the default for the + scheme, is equivalent to one where the port and its ":" delimiter are + elided and thus should be removed by scheme-based normalization. For + example, the second IRI above is the normal form for the "http" + scheme. + + Another case where normalization varies by scheme is in the handling + of an empty authority component or empty host subcomponent. For many + scheme specifications, an empty authority or host is considered an + error; for others, it is considered equivalent to "localhost" or the + end-user's host. When a scheme defines a default for authority and + an IRI reference to that default is desired, the reference should be + normalized to an empty authority for the sake of uniformity, brevity, + + + + + +Duerst & Suignard Standards Track [Page 27] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + and internationalization. If, however, either the userinfo or port + subcomponents are non-empty, then the host should be given explicitly + even if it matches the default. + + Normalization should not remove delimiters when their associated + component is empty unless it is licensed to do so by the scheme + specification. For example, the IRI "http://example.com/?" cannot be + assumed to be equivalent to any of the examples above. Likewise, the + presence or absence of delimiters within a userinfo subcomponent is + usually significant to its interpretation. The fragment component is + not subject to any scheme-based normalization; thus, two IRIs that + differ only by the suffix "#" are considered different regardless of + the scheme. + + Some IRI schemes may allow the usage of Internationalized Domain + Names (IDN) [RFC3490] either in their ireg-name part or elsewhere. + When in use in IRIs, those names SHOULD be validated by using the + ToASCII operation defined in [RFC3490], with the flags + "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing an + invalid IDN cannot successfully be resolved. Validated IDN + components of IRIs SHOULD be character normalized by using the + Nameprep process [RFC3491]; however, for legibility purposes, they + SHOULD NOT be converted into ASCII Compatible Encoding (ACE). + + Scheme-based normalization may also consider IDN components and their + conversions to punycode as equivalent. As an example, + "http://résumé.example.org" may be considered equivalent to + "http://xn--rsum-bpad.example.org". + + Other scheme-specific normalizations are possible. + +5.3.4. Protocol-Based Normalization + + Substantial effort to reduce the incidence of false negatives is + often cost-effective for web spiders. Consequently, they implement + even more aggressive techniques in IRI comparison. For example, if + they observe that an IRI such as + + http://example.com/data + + redirects to an IRI differing only in the trailing slash + + http://example.com/data/ + + they will likely regard the two as equivalent in the future. This + kind of technique is only appropriate when equivalence is clearly + indicated by both the result of accessing the resources and the + + + + +Duerst & Suignard Standards Track [Page 28] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + common conventions of their scheme's dereference algorithm (in this + case, use of redirection by HTTP origin servers to avoid problems + with relative references). + +6. Use of IRIs + +6.1. Limitations on UCS Characters Allowed in IRIs + + This section discusses limitations on characters and character + sequences usable for IRIs beyond those given in section 2.2 and + section 4.1. The considerations in this section are relevant when + IRIs are created and when URIs are converted to IRIs. + + a. The repertoire of characters allowed in each IRI component is + limited by the definition of that component. For example, the + definition of the scheme component does not allow characters + beyond US-ASCII. + + (Note: In accordance with URI practice, generic IRI software + cannot and should not check for such limitations.) + + b. The UCS contains many areas of characters for which there are + strong visual look-alikes. Because of the likelihood of + transcription errors, these also should be avoided. This + includes the full-width equivalents of Latin characters, + half-width Katakana characters for Japanese, and many others. It + also includes many look-alikes of "space", "delims", and + "unwise", characters excluded in [RFC3491]. + + Additional information is available from [UNIXML]. [UNIXML] is + written in the context of running text rather than in that of + identifiers. Nevertheless, it discusses many of the categories of + characters not appropriate for IRIs. + +6.2. Software Interfaces and Protocols + + Although an IRI is defined as a sequence of characters, software + interfaces for URIs typically function on sequences of octets or + other kinds of code units. Thus, software interfaces and protocols + MUST define which character encoding is used. + + Intermediate software interfaces between IRI-capable components and + URI-only components MUST map the IRIs per section 3.1, when + transferring from IRI-capable to URI-only components. This mapping + SHOULD be applied as late as possible. It SHOULD NOT be applied + between components that are known to be able to handle IRIs. + + + + + +Duerst & Suignard Standards Track [Page 29] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + +6.3. Format of URIs and IRIs in Documents and Protocols + + Document formats that transport URIs may have to be upgraded to allow + the transport of IRIs. In cases where the document as a whole has a + native character encoding, IRIs MUST also be encoded in this + character encoding and converted accordingly by a parser or + interpreter. IRI characters not expressible in the native character + encoding SHOULD be escaped by using the escaping conventions of the + document format if such conventions are available. Alternatively, + they MAY be percent-encoded according to section 3.1. For example, in + HTML or XML, numeric character references SHOULD be used. If a + document as a whole has a native character encoding and that + character encoding is not UTF-8, then IRIs MUST NOT be placed into + the document in the UTF-8 character encoding. + + Note: Some formats already accommodate IRIs, although they use + different terminology. HTML 4.0 [HTML4] defines the conversion from + IRIs to URIs as error-avoiding behavior. XML 1.0 [XML1], XLink + [XLink], XML Schema [XMLSchema], and specifications based upon them + allow IRIs. Also, it is expected that all relevant new W3C formats + and protocols will be required to handle IRIs [CharMod]. + +6.4. Use of UTF-8 for Encoding Original Characters + + This section discusses details and gives examples for point c) in + section 1.2. To be able to use IRIs, the URI corresponding to the + IRI in question has to encode original characters into octets by + using UTF-8. This can be specified for all URIs of a URI scheme or + can apply to individual URIs for schemes that do not specify how to + encode original characters. It can apply to the whole URI, or only + to some part. For background information on encoding characters into + URIs, see also section 2.5 of [RFC3986]. + + For new URI schemes, using UTF-8 is recommended in [RFC2718]. + Examples where UTF-8 is already used are the URN syntax [RFC2141], + IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, + because the HTTP URL scheme does not specify how to encode original + characters, only some HTTP URLs can have corresponding but different + IRIs. + + For example, for a document with a URI of + "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to + construct a corresponding IRI (in XML notation, see, section 1.4): + "http://www.example.org/résumé.html" ("é"; stands for + the e-acute character, and "%C3%A9" is the UTF-8 encoded and + percent-encoded representation of that character). On the other + hand, for a document with a URI of + + + + +Duerst & Suignard Standards Track [Page 30] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + "http://www.example.org/r%E9sum%E9.html", the percent-encoding octets + cannot be converted to actual characters in an IRI, as the + percent-encoding is not based on UTF-8. + + This means that for most URI schemes, there is no need to upgrade + their scheme definition in order for them to work with IRIs. The + main case where upgrading makes sense is when a scheme definition, or + a particular component of a scheme, is strictly limited to the use of + US-ASCII characters with no provision to include non-ASCII + characters/octets via percent-encoding, or if a scheme definition + currently uses highly scheme-specific provisions for the encoding of + non-ASCII characters. An example of this is the mailto: scheme + [RFC2368]. + + This specification does not upgrade any scheme specifications in any + way; this has to be done separately. Also, note that there is no + such thing as an "IRI scheme"; all IRIs use URI schemes, and all URI + schemes can be used with IRIs, even though in some cases only by + using URIs directly as IRIs, without any conversion. + + URI schemes can impose restrictions on the syntax of scheme-specific + URIs; i.e., URIs that are admissible under the generic URI syntax + [RFC3986] may not be admissible due to narrower syntactic constraints + imposed by a URI scheme specification. URI scheme definitions cannot + broaden the syntactic restrictions of the generic URI syntax; + otherwise, it would be possible to generate URIs that satisfied the + scheme-specific syntactic constraints without satisfying the + syntactic constraints of the generic URI syntax. However, additional + syntactic constraints imposed by URI scheme specifications are + applicable to IRI, as the corresponding URI resulting from the + mapping defined in section 3.1 MUST be a valid URI under the + syntactic restrictions of generic URI syntax and any narrower + restrictions imposed by the corresponding URI scheme specification. + + The requirement for the use of UTF-8 applies to all parts of a URI + (with the potential exception of the ireg-name part; see section + 3.1). However, it is possible that the capability of IRIs to + represent a wide range of characters directly is used just in some + parts of the IRI (or IRI reference). The other parts of the IRI may + only contain US-ASCII characters, or they may not be based on UTF-8. + They may be based on another character encoding, or they may directly + encode raw binary data (see also [RFC2397]). + + For example, it is possible to have a URI reference of + "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the + document name is encoded in iso-8859-1 based on server settings, but + where the fragment identifier is encoded in UTF-8 according to + + + + +Duerst & Suignard Standards Track [Page 31] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + [XPointer]. The IRI corresponding to the above URI would be (in XML + notation) + "http://www.example.org/r%E9sum%E9.xml#résumé";. + + Similar considerations apply to query parts. The functionality of + IRIs (namely, to be able to include non-ASCII characters) can only be + used if the query part is encoded in UTF-8. + +6.5. Relative IRI References + + Processing of relative IRI references against a base is handled + straightforwardly; the algorithms of [RFC3986] can be applied + directly, treating the characters additionally allowed in IRI + references in the same way that unreserved characters are in URI + references. + +7. URI/IRI Processing Guidelines (Informative) + + This informative section provides guidelines for supporting IRIs in + the same software components and operations that currently process + URIs: Software interfaces that handle URIs, software that allows + users to enter URIs, software that creates or generates URIs, + software that displays URIs, formats and protocols that transport + URIs, and software that interprets URIs. These may all require + modification before functioning properly with IRIs. The + considerations in this section also apply to URI references and IRI + references. + +7.1. URI/IRI Software Interfaces + + Software interfaces that handle URIs, such as URI-handling APIs and + protocols transferring URIs, need interfaces and protocol elements + that are designed to carry IRIs. + + In case the current handling in an API or protocol is based on + US-ASCII, UTF-8 is recommended as the character encoding for IRIs, as + it is compatible with US-ASCII, is in accordance with the + recommendations of [RFC2277], and makes converting to URIs easy. In + any case, the API or protocol definition must clearly define the + character encoding to be used. + + The transfer from URI-only to IRI-capable components requires no + mapping, although the conversion described in section 3.2 above may + be performed. It is preferable not to perform this inverse + conversion when there is a chance that this cannot be done correctly. + + + + + + +Duerst & Suignard Standards Track [Page 32] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + +7.2. URI/IRI Entry + + Some components allow users to enter URIs into the system by typing + or dictation, for example. This software must be updated to allow + for IRI entry. + + A person viewing a visual representation of an IRI (as a sequence of + glyphs, in some order, in some visual display) or hearing an IRI will + use an entry method for characters in the user's language to input + the IRI. Depending on the script and the input method used, this may + be a more or less complicated process. + + The process of IRI entry must ensure, as much as possible, that the + restrictions defined in section 2.2 are met. This may be done by + choosing appropriate input methods or variants/settings thereof, by + appropriately converting the characters being input, by eliminating + characters that cannot be converted, and/or by issuing a warning or + error message to the user. + + As an example of variant settings, input method editors for East + Asian Languages usually allow the input of Latin letters and related + characters in full-width or half-width versions. For IRI input, the + input method editor should be set so that it produces half-width + Latin letters and punctuation and full-width Katakana. + + An input field primarily or solely used for the input of URIs/IRIs + may allow the user to view an IRI as it is mapped to a URI. Places + where the input of IRIs is frequent may provide the possibility for + viewing an IRI as mapped to a URI. This will help users when some of + the software they use does not yet accept IRIs. + + An IRI input component interfacing to components that handle URIs, + but not IRIs, must map the IRI to a URI before passing it to these + components. + + For the input of IRIs with right-to-left characters, please see + section 4.3. + +7.3. URI/IRI Transfer between Applications + + Many applications, particularly mail user agents, try to detect URIs + appearing in plain text. For this, they use some heuristics based on + URI syntax. They then allow the user to click on such URIs and + retrieve the corresponding resource in an appropriate (usually + scheme-dependent) application. + + + + + + +Duerst & Suignard Standards Track [Page 33] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + Such applications have to be upgraded to use the IRI syntax as a base + for heuristics. In particular, a non-ASCII character should not be + taken as the indication of the end of an IRI. Such applications also + have to make sure that they correctly convert the detected IRI from + the character encoding of the document or application where the IRI + appears to the character encoding used by the system-wide IRI + invocation mechanism, or to a URI (according to section 3.1) if the + system-wide invocation mechanism only accepts URIs. + + The clipboard is another frequently used way to transfer URIs and + IRIs from one application to another. On most platforms, the + clipboard is able to store and transfer text in many languages and + scripts. Correctly used, the clipboard transfers characters, not + bytes, which will do the right thing with IRIs. + +7.4. URI/IRI Generation + + Systems that offer resources through the Internet, where those + resources have logical names, sometimes automatically generate URIs + for the resources they offer. For example, some HTTP servers can + generate a directory listing for a file directory and then respond to + the generated URIs with the files. + + Many legacy character encodings are in use in various file systems. + Many currently deployed systems do not transform the local character + representation of the underlying system before generating URIs. + + For maximum interoperability, systems that generate resource + identifiers should make the appropriate transformations. For + example, if a file system contains a file named + "résumé.html", a server should expose this as + "r%C3%A9sum%C3%A9.html" in a URI, which allows use of + "résumé.html" in an IRI, even if locally the file name is + kept in a character encoding other than UTF-8. + + This recommendation particularly applies to HTTP servers. For FTP + servers, similar considerations apply; see [RFC2640]. + +7.5. URI/IRI Selection + + In some cases, resource owners and publishers have control over the + IRIs used to identify their resources. This control is mostly + executed by controlling the resource names, such as file names, + directly. + + + + + + + +Duerst & Suignard Standards Track [Page 34] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + In these cases, it is recommended to avoid choosing IRIs that are + easily confused. For example, for US-ASCII, the lower-case ell ("l") + is easily confused with the digit one ("1"), and the upper-case oh + ("O") is easily confused with the digit zero ("0"). Publishers + should avoid confusing users with "br0ken" or "1ame" identifiers. + + Outside the US-ASCII repertoire, there are many more opportunities + for confusion; a complete set of guidelines is too lengthy to include + here. As long as names are limited to characters from a single + script, native writers of a given script or language will know best + when ambiguities can appear, and how they can be avoided. What may + look ambiguous to a stranger may be completely obvious to the average + native user. On the other hand, in some cases, the UCS contains + variants for compatibility reasons; for example, for typographic + purposes. These should be avoided wherever possible. Although there + may be exceptions, newly created resource names should generally be + in NFKC [UTR15] (which means that they are also in NFC). + + As an example, the UCS contains the "fi" ligature at U+FB01 for + compatibility reasons. Wherever possible, IRIs should use the two + letters "f" and "i" rather than the "fi" ligature. An example where + the latter may be used is in the query part of an IRI for an explicit + search for a word written containing the "fi" ligature. + + In certain cases, there is a chance that characters from different + scripts look the same. The best known example is the similarity of + the Latin "A", the Greek "Alpha", and the Cyrillic "A". To avoid + such cases, only IRIs should be created where all the characters in a + single component are used together in a given language. This usually + means that all of these characters will be from the same script, but + there are languages that mix characters from different scripts (such + as Japanese). This is similar to the heuristics used to distinguish + between letters and numbers in the examples above. Also, for Latin, + Greek, and Cyrillic, using lowercase letters results in fewer + ambiguities than using uppercase letters would. + +7.6. Display of URIs/IRIs + + In situations where the rendering software is not expected to display + non-ASCII parts of the IRI correctly using the available layout and + font resources, these parts should be percent-encoded before being + displayed. + + For display of Bidi IRIs, please see section 4.1. + + + + + + + +Duerst & Suignard Standards Track [Page 35] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + +7.7. Interpretation of URIs and IRIs + + Software that interprets IRIs as the names of local resources should + accept IRIs in multiple forms and convert and match them with the + appropriate local resource names. + + First, multiple representations include both IRIs in the native + character encoding of the protocol and also their URI counterparts. + + Second, it may include URIs constructed based on character encodings + other than UTF-8. These URIs may be produced by user agents that do + not conform to this specification and that use legacy character + encodings to convert non-ASCII characters to URIs. Whether this is + necessary, and what character encodings to cover, depends on a number + of factors, such as the legacy character encodings used locally and + the distribution of various versions of user agents. For example, + software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in + addition to UTF-8. + + Third, it may include additional mappings to be more user-friendly + and robust against transmission errors. These would be similar to + how some servers currently treat URIs as case insensitive or perform + additional matching to account for spelling errors. For characters + beyond the US-ASCII repertoire, this may, for example, include + ignoring the accents on received IRIs or resource names. Please note + that such mappings, including case mappings, are language dependent. + + It can be difficult to identify a resource unambiguously if too many + mappings are taken into consideration. However, percent-encoded and + not percent-encoded parts of IRIs can always be clearly + distinguished. Also, the regularity of UTF-8 (see [Duerst97]) makes + the potential for collisions lower than it may seem at first. + +7.8. Upgrading Strategy + + Where this recommendation places further constraints on software for + which many instances are already deployed, it is important to + introduce upgrades carefully and to be aware of the various + interdependencies. + + If IRIs cannot be interpreted correctly, they should not be created, + generated, or transported. This suggests that upgrading URI + interpreting software to accept IRIs should have highest priority. + + On the other hand, a single IRI is interpreted only by a single or + very few interpreters that are known in advance, although it may be + entered and transported very widely. + + + + +Duerst & Suignard Standards Track [Page 36] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + Therefore, IRIs benefit most from a broad upgrade of software to be + able to enter and transport IRIs. However, before an individual IRI + is published, care should be taken to upgrade the corresponding + interpreting software in order to cover the forms expected to be + received by various versions of entry and transport software. + + The upgrade of generating software to generate IRIs instead of using + a local character encoding should happen only after the service is + upgraded to accept IRIs. Similarly, IRIs should only be generated + when the service accepts IRIs and the intervening infrastructure and + protocol is known to transport them safely. + + Software converting from URIs to IRIs for display should be upgraded + only after upgraded entry software has been widely deployed to the + population that will see the displayed result. + + Where there is a free choice of character encodings, it is often + possible to reduce the effort and dependencies for upgrading to IRIs + by using UTF-8 rather than another encoding. For example, when a new + file-based Web server is set up, using UTF-8 as the character + encoding for file names will make the transition to IRIs easier. + Likewise, when a new Web form is set up using UTF-8 as the character + encoding of the form page, the returned query URIs will use UTF-8 as + the character encoding (unless the user, for whatever reason, changes + the character encoding) and will therefore be compatible with IRIs. + + These recommendations, when taken together, will allow for the + extension from URIs to IRIs in order to handle characters other than + US-ASCII while minimizing interoperability problems. For + considerations regarding the upgrade of URI scheme definitions, see + section 6.4. + +8. Security Considerations + + The security considerations discussed in [RFC3986] also apply to + IRIs. In addition, the following issues require particular care for + IRIs. + + Incorrect encoding or decoding can lead to security problems. In + particular, some UTF-8 decoders do not check against overlong byte + sequences. As an example, a "/" is encoded with the byte 0x2F both + in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly + interpret the sequence 0xC0 0xAF as a "/". A sequence such as + + + + + + + + +Duerst & Suignard Standards Track [Page 37] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + "%C0%AF.." may pass some security tests and then be interpreted as + "/.." in a path if UTF-8 decoders are fault-tolerant, if conversion + and checking are not done in the right order, and/or if reserved + characters and unreserved characters are not clearly distinguished. + + There are various ways in which "spoofing" can occur with IRIs. + "Spoofing" means that somebody may add a resource name that looks the + same or similar to the user, but that points to a different resource. + The added resource may pretend to be the real resource by looking + very similar but may contain all kinds of changes that may be + difficult to spot and that can cause all kinds of problems. Most + spoofing possibilities for IRIs are extensions of those for URIs. + + Spoofing can occur for various reasons. First, a user's + normalization expectations or actual normalization when entering an + IRI or transcoding an IRI from a legacy character encoding do not + match the normalization used on the server side. Conceptually, this + is no different from the problems surrounding the use of + case-insensitive web servers. For example, a popular web page with a + mixed-case name ("http://big.example.com/PopularPage.html") might be + "spoofed" by someone who is able to create + "http://big.example.com/popularpage.html". However, the use of + unnormalized character sequences, and of additional mappings for user + convenience, may increase the chance for spoofing. Protocols and + servers that allow the creation of resources with names that are not + normalized are particularly vulnerable to such attacks. This is an + inherent security problem of the relevant protocol, server, or + resource and is not specific to IRIs, but it is mentioned here for + completeness. + + Spoofing can occur in various IRI components, such as the domain name + part or a path part. For considerations specific to the domain name + part, see [RFC3491]. For the path part, administrators of sites that + allow independent users to create resources in the same sub area may + have to be careful to check for spoofing. + + Spoofing can occur because in the UCS many characters look very + similar. Details are discussed in Section 7.5. Again, this is very + similar to spoofing possibilities on US-ASCII, e.g., using "br0ken" + or "1ame" URIs. + + Spoofing can occur when URIs with percent-encodings based on various + character encodings are accepted to deal with older user agents. In + some cases, particularly for Latin-based resource names, this is + usually easy to detect because UTF-8-encoded names, when interpreted + and viewed as legacy character encodings, produce mostly garbage. + + + + + +Duerst & Suignard Standards Track [Page 38] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + When concurrently used character encodings have a similar structure + but there are no characters that have exactly the same encoding, + detection is more difficult. + + Spoofing can occur with bidirectional IRIs, if the restrictions in + section 4.2 are not followed. The same visual representation may be + interpreted as different logical representations, and vice versa. It + is also very important that a correct Unicode bidirectional + implementation be used. + +9. Acknowledgements + + We would like to thank Larry Masinter for his work as coauthor of + many earlier versions of this document (draft-masinter-url-i18n-xx). + + The discussion on the issue addressed here started a long time ago. + There was a thread in the HTML working group in August 1995 (under + the topic of "Globalizing URIs") and in the www-international mailing + list in July 1996 (under the topic of "Internationalization and + URLs"), and there were ad-hoc meetings at the Unicode conferences in + September 1995 and September 1997. + + Many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding, + Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim + Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie + Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman, Russ Housley, + Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex Texin, Graham Klyne, + Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam Costello, Dan + Oscarson, Elliotte Rusty Harold, Mike J. Brown, Roy Badami, Jonathan + Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio, Chris + Haynes, Walter Underwood, and many others for help with understanding + the issues and possible solutions, and with getting the details + right. + + This document is a product of the Internationalization Working Group + (I18N WG) of the World Wide Web Consortium (W3C). Thanks to the + members of the W3C I18N Working Group and Interest Group for their + contributions and their work on [CharMod]. Thanks also go to the + members of many other W3C Working Groups for adopting IRIs, and to + the members of the Montreal IAB Workshop on Internationalization and + Localization for their review. + + + + + + + + + + +Duerst & Suignard Standards Track [Page 39] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + +10. References + +10.1. Normative References + + [ASCII] American National Standards Institute, "Coded + Character Set -- 7-bit American Standard Code for + Information Interchange", ANSI X3.4, 1986. + + [ISO10646] International Organization for Standardization, + "ISO/IEC 10646:2003: Information Technology - + Universal Multiple-Octet Coded Character Set (UCS)", + ISO Standard 10646, December 2003. + + [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, March 1997. + + [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax + Specifications: ABNF", RFC 2234, November 1997. + + [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, + "Internationalizing Domain Names in Applications + (IDNA)", RFC 3490, March 2003. + + [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep + Profile for Internationalized Domain Names (IDN)", RFC + 3491, March 2003. + + [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO + 10646", STD 63, RFC 3629, November 2003. + + [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, + "Uniform Resource Identifier (URI): Generic Syntax", + STD 66, RFC 3986, January 2005. + + [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode + Standard Annex #9, March 2004, + <http://www.unicode.org/reports/tr9/tr9-13.html>. + + [UNIV4] The Unicode Consortium, "The Unicode Standard, Version + 4.0.1, defined by: The Unicode Standard, Version 4.0 + (Reading, MA, Addison-Wesley, 2003. ISBN + 0-321-18578-1), as amended by Unicode 4.0.1 + (http://www.unicode.org/versions/Unicode4.0.1/)", + March 2004. + + + + + + + +Duerst & Suignard Standards Track [Page 40] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + [UTR15] Davis, M. and M. Duerst, "Unicode Normalization + Forms", Unicode Standard Annex #15, April 2003, + <http://www.unicode.org/unicode/reports/ + tr15/tr15-23.html>. + +10.2. Informative References + + [BidiEx] "Examples of bidirectional IRIs", + <http://www.w3.org/International/iri-edit/ + BidiExamples>. + + [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T. + Texin, "Character Model for the World Wide Web: + Resource Identifiers", World Wide Web Consortium + Candidate Recommendation, November 2004, + <http://www.w3.org/TR/charmod-resid>. + + [Duerst97] Duerst, M., "The Properties and Promises of UTF-8", + Proc. 11th International Unicode Conference, San Jose + , September 1997, + <http://www.ifi.unizh.ch/mml/mduerst/papers/ + PDF/IUC11-UTF-8.pdf>. + + [Gettys] Gettys, J., "URI Model Consequences", + <http://www.w3.org/DesignIssues/ModelConsequences>. + + [HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01 + Specification", World Wide Web Consortium + Recommendation, December 1999, + <http://www.w3.org/TR/html401/appendix/ + notes.html#h-B.2>. + + [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet + Mail Extensions (MIME) Part One: Format of Internet + Message Bodies", RFC 2045, November 1996. + + [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., + Atkinson, R., Crispin, M., and P. Svanberg, "The + Report of the IAB Character Set Workshop held 29 + February - 1 March, 1996", RFC 2130, April 1997. + + [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. + + [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September + 1997. + + [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and + Languages", BCP 18, RFC 2277, January 1998. + + + +Duerst & Suignard Standards Track [Page 41] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + [RFC2368] Hoffman, P., Masinter, L., and J. Zawinski, "The + mailto URL scheme", RFC 2368, July 1998. + + [RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998. + + [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, + "Uniform Resource Identifiers (URI): Generic Syntax", + RFC 2396, August 1998. + + [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, + August 1998. + + [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., + Masinter, L., Leach, P., and T. Berners-Lee, + "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, + June 1999. + + [RFC2640] Curtin, B., "Internationalization of the File Transfer + Protocol", RFC 2640, July 1999. + + [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D., and R. + Petke, "Guidelines for new URL Schemes", RFC 2718, + November 1999. + + [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other + Markup Languages", Unicode Technical Report #20, World + Wide Web Consortium Note, June 2003, + <http://www.w3.org/TR/unicode-xml/>. + + [XLink] DeRose, S., Maler, E., and D. Orchard, "XML Linking + Language (XLink) Version 1.0", World Wide Web + Consortium Recommendation, June 2001, + <http://www.w3.org/TR/xlink/#link-locators>. + + [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., + and F. Yergeau, "Extensible Markup Language (XML) 1.0 + (Third Edition)", World Wide Web Consortium + Recommendation, February 2004, + <http://www.w3.org/TR/REC-xml#sec-external-ent>. + + [XMLNamespace] Bray, T., Hollander, D., and A. Layman, "Namespaces in + XML", World Wide Web Consortium Recommendation, + January 1999, <http://www.w3.org/TR/REC-xml-names>. + + [XMLSchema] Biron, P. and A. Malhotra, "XML Schema Part 2: + Datatypes", World Wide Web Consortium Recommendation, + May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>. + + + + +Duerst & Suignard Standards Track [Page 42] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + [XPointer] Grosso, P., Maler, E., Marsh, J. and N. Walsh, + "XPointer Framework", World Wide Web Consortium + Recommendation, March 2003, + <http://www.w3.org/TR/xptr-framework/#escaping>. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Duerst & Suignard Standards Track [Page 43] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + +Appendix A. Design Alternatives + + This section shortly summarizes major design alternatives and the + reasons for why they were not chosen. + +Appendix A.1. New Scheme(s) + + Introducing new schemes (for example, httpi:, ftpi:,...) or a new + metascheme (e.g., i:, leading to URI/IRI prefixes such as i:http:, + i:ftp:,...) was proposed to make IRI-to-URI conversion scheme + dependent or to distinguish between percent-encodings resulting from + IRI-to-URI conversion and percent-encodings from legacy character + encodings. + + New schemes are not needed to distinguish URIs from true IRIs (i.e., + IRIs that contain non-ASCII characters). The benefit of being able + to detect the origin of percent-encodings is marginal, as UTF-8 can + be detected with very high reliability. Deploying new schemes is + extremely hard, so not requiring new schemes for IRIs makes + deployment of IRIs vastly easier. Making conversion scheme dependent + is highly inadvisable and would be encouraged by separate schemes for + IRIs. Using a uniform convention for conversion from IRIs to URIs + makes IRI implementation orthogonal to the introduction of actual new + schemes. + +Appendix A.2. Character Encodings Other Than UTF-8 + + At an early stage, UTF-7 was considered as an alternative to UTF-8 + when IRIs are converted to URIs. UTF-7 would not have needed + percent-encoding and in most cases would have been shorter than + percent-encoded UTF-8. + + Using UTF-8 avoids a double layering and overloading of the use of + the "+" character. UTF-8 is fully compatible with US-ASCII and has + therefore been recommended by the IETF, and is being used widely. + + UTF-7 has never been used much and is now clearly being discouraged. + Requiring implementations to convert from UTF-8 to UTF-7 and back + would be an additional implementation burden. + +Appendix A.3. New Encoding Convention + + Instead of using the existing percent-encoding convention of URIs, + which is based on octets, the idea was to create a new encoding + convention; for example, to use "%u" to introduce UCS code points. + + + + + + +Duerst & Suignard Standards Track [Page 44] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + + Using the existing octet-based percent-encoding mechanism does not + need an upgrade of the URI syntax and does not need corresponding + server upgrades. + +Appendix A.4. Indicating Character Encodings in the URI/IRI + + Some proposals suggested indicating the character encodings used in + an URI or IRI with some new syntactic convention in the URI itself, + similar to the "charset" parameter for e-mails and Web pages. As an + example, the label in square brackets in + "http://www.example.org/ros[iso-8859-1]é"; indicated that the + following "é"; had to be interpreted as iso-8859-1. + + If UTF-8 is used exclusively, an upgrade to the URI syntax is not + needed. It avoids potentially multiple labels that have to be copied + correctly in all cases, even on the side of a bus or on a napkin, + leading to usability problems (and being prohibitively annoying). + Exclusively using UTF-8 also reduces transcoding errors and + confusion. + +Authors' Addresses + + Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever + possible, for example as "Dürst" in XML and + HTML.) + World Wide Web Consortium + 5322 Endo + Fujisawa, Kanagawa 252-8520 + Japan + + Phone: +81 466 49 1170 + Fax: +81 466 49 1171 + EMail: duerst@w3.org + URI: http://www.w3.org/People/D%C3%BCrst/ + (Note: This is the percent-encoded form of an IRI.) + + + Michel Suignard + Microsoft Corporation + One Microsoft Way + Redmond, WA 98052 + U.S.A. + + Phone: +1 425 882-8080 + EMail: michelsu@microsoft.com + URI: http://www.suignard.com + + + + + +Duerst & Suignard Standards Track [Page 45] + +RFC 3987 Internationalized Resource Identifiers January 2005 + + +Full Copyright Statement + + Copyright (C) The Internet Society (2005). + + This document is subject to the rights, licenses and restrictions + contained in BCP 78, and except as set forth therein, the authors + retain all their rights. + + This document and the information contained herein are provided on an + "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS + OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET + ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, + INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE + INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED + WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. + +Intellectual Property + + The IETF takes no position regarding the validity or scope of any + Intellectual Property Rights or other rights that might be claimed to + pertain to the implementation or use of the technology described in + this document or the extent to which any license under such rights + might or might not be available; nor does it represent that it has + made any independent effort to identify any such rights. Information + on the IETF's procedures with respect to rights in IETF Documents can + be found in BCP 78 and BCP 79. + + Copies of IPR disclosures made to the IETF Secretariat and any + assurances of licenses to be made available, or the result of an + attempt made to obtain a general license or permission for the use of + such proprietary rights by implementers or users of this + specification can be obtained from the IETF on-line IPR repository at + http://www.ietf.org/ipr. + + The IETF invites any interested party to bring to its attention any + copyrights, patents or patent applications, or other proprietary + rights that may cover technology that may be required to implement + this standard. Please address the information to the IETF at ietf- + ipr@ietf.org. + + +Acknowledgement + + Funding for the RFC Editor function is currently provided by the + Internet Society. + + + + + + +Duerst & Suignard Standards Track [Page 46] + |