summaryrefslogtreecommitdiff
path: root/trunk/txt
diff options
context:
space:
mode:
authorAlexander Larsson <alexl@src.gnome.org>2009-03-16 11:43:23 +0000
committerAlexander Larsson <alexl@src.gnome.org>2009-03-16 11:43:23 +0000
commit4ad537c5c3e17e1efe289020d7dc6cd0efae42c5 (patch)
tree891f2ec720f5ae321762965a00d352ad0a1592a2 /trunk/txt
parent4c59b80ab2b0e942bd45ff12f238038293d21821 (diff)
downloadgvfs-82d3197d52d9a1f8a1a1b928e2550444138d088b.tar.gz
Tagged for release 1.2.0GVFS_1_2_0
svn path=/tags/GVFS_1_2_0/; revision=2331
Diffstat (limited to 'trunk/txt')
-rw-r--r--trunk/txt/gvfs_dbus.txt65
-rw-r--r--trunk/txt/ops.txt140
-rw-r--r--trunk/txt/rfc3986.txt3419
-rw-r--r--trunk/txt/rfc3987.txt2579
-rw-r--r--trunk/txt/vfs-ideas.txt425
-rw-r--r--trunk/txt/vfs-names.txt142
6 files changed, 6770 insertions, 0 deletions
diff --git a/trunk/txt/gvfs_dbus.txt b/trunk/txt/gvfs_dbus.txt
new file mode 100644
index 00000000..86ac19ac
--- /dev/null
+++ b/trunk/txt/gvfs_dbus.txt
@@ -0,0 +1,65 @@
+how to chain to simple stuff
+
+how to parse uris (i.e. map to mounts)
+
+what connections do we have:
+shared dbus connection
+connection to main daemom
+connection to each mount daemon
+
+"fast ops" (uri->gfile) vs blocking ops (read, open etc) and how to avoid slow blocking fast
+
+
+each thread has, on demand:
+connection to main daemon
+connection to some mount daemons
+
+global state:
+cache of previously used mountpoints
+
+
+how to mount
+
+how to store/restore permanent mounts with the session => store as drives (mountpoints), not volumes!
+
+Don't always want to log in to all mounts on login? (mounpoints!)
+
+computer:// handled in main daemon?
+
+No volume monitor in public API, only computer:// ?
+Problems:
+* mounted (desktop/computer:, trash dir)
+* unmounted/pre_unmount (desktop/computer:, close windows on unmounted volumes, trash dir)
+* map path to volume (close windows on unmounted volumes, check for readonly mount, get volume name)
+* get all drives/volumes (detecting where to show eject, mount, unmount menu items,
+ tree view, places sidebar, display volume icon in pathbar)
+* eject/unmount ops
+* needs eject
+
+unmounted URI => return a mountpoint object?
+
+GMountOperation, async mount operation object
+signals => passwd, question, keyring?
+
+GFile mountpoint => GMountOperation
+
+What process calls gnome-keyring?
+
+
+
+--------------------
+
+GFile creation => decompose URI, no i/o
+
+on i/o:
+ * figure out mountpoint (for now, always toplevel uri location)
+ * if we have a local dbus connection to that, use it, otherwise:
+ + create (if needed) local session dbus connection
+ + ask for mount daemon for new session
+ - If not existing, error on i/o, return mountpoint type on get_info
+ + set up new local connection with the mount daemon
+ * send dbus message
+ * recieve answer, if has magic flag, followed by fd sendmsg() (created by socketpair())
+
+
+
diff --git a/trunk/txt/ops.txt b/trunk/txt/ops.txt
new file mode 100644
index 00000000..c1f04c34
--- /dev/null
+++ b/trunk/txt/ops.txt
@@ -0,0 +1,140 @@
+type: File, Folder, Symlink, Shortcut, Mountable, special (fifo, socket, chardec, blockdev)
+flags: hidden,
+GFileInfo {
+ type get_type()
+ char *get_name()
+ char *get_display_name()
+ char *get_icon() /* string? what about win32, remote icons etc */
+
+ gint64 get_file_size()
+ char *get_mime_type()
+ char *get_link_target()
+ can_read()/write()/delete()/rename()/maybe: move()/copy()
+ flags get_flags()
+ time_t get_modification_time()
+ gboolean get_unix_stat ()
+ char *get_attribute()
+ char **get_attributes(char *namespace)
+ char **get_all_attributes() /* form namespace:attrname -> string */
+}
+
+GFSInfo {
+ char *get_fs_type()
+ gint64 get_free_space()
+ gint64 get_total_space()
+ char * get_hal_uid()
+ can_unmount()
+ can_eject()
+ must_eject()
+
+}
+
+GFile *g_file_for_path (char *path)
+GFile *g_file_for_uri (char *uri)
+GFile *g_file_parse_display_name (char *display_name)
+
+GFile {
+ char *get_path()
+ is_native() => is_file:///
+
+ char *get_uri ();
+ char *get_absolute_display_name ()
+
+ set_keep_open(boolean keep_open)
+
+ GFile *get_parent ()
+ GFile *get_child (char *name)
+ GFileEnumerator *enumerate_children(flags, attributes... "*", "vfs:*;dav:*;foo:bar")
+ GFileInfo *get_info (flags, attributes...)
+
+ void reload()
+ GInputStream *read()
+
+ GOutputStream *append_to() /* optional (not on webdav) */
+ GOutputStream *create()
+ GSaveStream *replace(mtime, backup_name, )
+/* permissions are all set minus umask, except replace which
+ saves old permissions */
+
+/* ?? */
+ GFile *resolve_symlink(char *symlink_target);
+
+/* output ops */
+ write/save
+ rename
+ move
+ copy
+ delete
+ mkdir
+ rmkdir
+ display name -> filename (for new files)
+ set attrs
+
+ /* other ops: */
+ monitor(flags) + signals
+ mount/unmount
+ list volumes
+
+ Maybe:
+ GFile *new_from_uri(path, flags) (file:/// uris)
+
+}
+
+
+names:
+ URIs == raw filename (no encoding), all escaped
+ We generate display absolute paths as filenames if possible, otherwise
+ as IRIs. This means we can display nice URIs for native utf8 backends
+ and filenames. However, URIs for non-utf8 shares will look bad. If we know
+ the encoding we can still get nice non-absolut display names though.
+
+ In client we store names as mountpoint + non-escaped no-encoding string.
+ Non-uri display name handling done in daemon
+
+GStatable iface for fstat() support
+GSaveStream, with get_final_file_info()
+
+open for writing:
+
+append vs truncate
+fail on existing or replace
+
+mtime match
+mtime return
+backup (suffix+prefix)
+create filename from display name
+unique name
+keep inode or be atomic?
+
+filename_for_display_name()
+write_append() /* optional (not on webdav) */
+write_new()
+write_replace()
+
+
+
+
+ftp supports:
+ overwrite
+ append
+ generate unique name
+
+http+webdav supports:
+ overwrite
+ append in recent versions
+ get mtime, length, mimetype, atime on read open
+
+
+
+
+
+async thread work:
+
+function to run in thread
+data to pass to thread
+cancel identifier
+pass in cancel func + data
+way for function to communicate with mainloop (of specific context)
+does mainloop notifiers block on ack?
+
+
diff --git a/trunk/txt/rfc3986.txt b/trunk/txt/rfc3986.txt
new file mode 100644
index 00000000..c56ed4eb
--- /dev/null
+++ b/trunk/txt/rfc3986.txt
@@ -0,0 +1,3419 @@
+
+
+
+
+
+
+Network Working Group T. Berners-Lee
+Request for Comments: 3986 W3C/MIT
+STD: 66 R. Fielding
+Updates: 1738 Day Software
+Obsoletes: 2732, 2396, 1808 L. Masinter
+Category: Standards Track Adobe Systems
+ January 2005
+
+
+ Uniform Resource Identifier (URI): Generic Syntax
+
+Status of This Memo
+
+ This document specifies an Internet standards track protocol for the
+ Internet community, and requests discussion and suggestions for
+ improvements. Please refer to the current edition of the "Internet
+ Official Protocol Standards" (STD 1) for the standardization state
+ and status of this protocol. Distribution of this memo is unlimited.
+
+Copyright Notice
+
+ Copyright (C) The Internet Society (2005).
+
+Abstract
+
+ A Uniform Resource Identifier (URI) is a compact sequence of
+ characters that identifies an abstract or physical resource. This
+ specification defines the generic URI syntax and a process for
+ resolving URI references that might be in relative form, along with
+ guidelines and security considerations for the use of URIs on the
+ Internet. The URI syntax defines a grammar that is a superset of all
+ valid URIs, allowing an implementation to parse the common components
+ of a URI reference without knowing the scheme-specific requirements
+ of every possible identifier. This specification does not define a
+ generative grammar for URIs; that task is performed by the individual
+ specifications of each URI scheme.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 1]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+Table of Contents
+
+ 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
+ 1.1. Overview of URIs . . . . . . . . . . . . . . . . . . . . 4
+ 1.1.1. Generic Syntax . . . . . . . . . . . . . . . . . 6
+ 1.1.2. Examples . . . . . . . . . . . . . . . . . . . . 7
+ 1.1.3. URI, URL, and URN . . . . . . . . . . . . . . . 7
+ 1.2. Design Considerations . . . . . . . . . . . . . . . . . 8
+ 1.2.1. Transcription . . . . . . . . . . . . . . . . . 8
+ 1.2.2. Separating Identification from Interaction . . . 9
+ 1.2.3. Hierarchical Identifiers . . . . . . . . . . . . 10
+ 1.3. Syntax Notation . . . . . . . . . . . . . . . . . . . . 11
+ 2. Characters . . . . . . . . . . . . . . . . . . . . . . . . . . 11
+ 2.1. Percent-Encoding . . . . . . . . . . . . . . . . . . . . 12
+ 2.2. Reserved Characters . . . . . . . . . . . . . . . . . . 12
+ 2.3. Unreserved Characters . . . . . . . . . . . . . . . . . 13
+ 2.4. When to Encode or Decode . . . . . . . . . . . . . . . . 14
+ 2.5. Identifying Data . . . . . . . . . . . . . . . . . . . . 14
+ 3. Syntax Components . . . . . . . . . . . . . . . . . . . . . . 16
+ 3.1. Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 17
+ 3.2. Authority . . . . . . . . . . . . . . . . . . . . . . . 17
+ 3.2.1. User Information . . . . . . . . . . . . . . . . 18
+ 3.2.2. Host . . . . . . . . . . . . . . . . . . . . . . 18
+ 3.2.3. Port . . . . . . . . . . . . . . . . . . . . . . 22
+ 3.3. Path . . . . . . . . . . . . . . . . . . . . . . . . . . 22
+ 3.4. Query . . . . . . . . . . . . . . . . . . . . . . . . . 23
+ 3.5. Fragment . . . . . . . . . . . . . . . . . . . . . . . . 24
+ 4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
+ 4.1. URI Reference . . . . . . . . . . . . . . . . . . . . . 25
+ 4.2. Relative Reference . . . . . . . . . . . . . . . . . . . 26
+ 4.3. Absolute URI . . . . . . . . . . . . . . . . . . . . . . 27
+ 4.4. Same-Document Reference . . . . . . . . . . . . . . . . 27
+ 4.5. Suffix Reference . . . . . . . . . . . . . . . . . . . . 27
+ 5. Reference Resolution . . . . . . . . . . . . . . . . . . . . . 28
+ 5.1. Establishing a Base URI . . . . . . . . . . . . . . . . 28
+ 5.1.1. Base URI Embedded in Content . . . . . . . . . . 29
+ 5.1.2. Base URI from the Encapsulating Entity . . . . . 29
+ 5.1.3. Base URI from the Retrieval URI . . . . . . . . 30
+ 5.1.4. Default Base URI . . . . . . . . . . . . . . . . 30
+ 5.2. Relative Resolution . . . . . . . . . . . . . . . . . . 30
+ 5.2.1. Pre-parse the Base URI . . . . . . . . . . . . . 31
+ 5.2.2. Transform References . . . . . . . . . . . . . . 31
+ 5.2.3. Merge Paths . . . . . . . . . . . . . . . . . . 32
+ 5.2.4. Remove Dot Segments . . . . . . . . . . . . . . 33
+ 5.3. Component Recomposition . . . . . . . . . . . . . . . . 35
+ 5.4. Reference Resolution Examples . . . . . . . . . . . . . 35
+ 5.4.1. Normal Examples . . . . . . . . . . . . . . . . 36
+ 5.4.2. Abnormal Examples . . . . . . . . . . . . . . . 36
+
+
+
+Berners-Lee, et al. Standards Track [Page 2]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ 6. Normalization and Comparison . . . . . . . . . . . . . . . . . 38
+ 6.1. Equivalence . . . . . . . . . . . . . . . . . . . . . . 38
+ 6.2. Comparison Ladder . . . . . . . . . . . . . . . . . . . 39
+ 6.2.1. Simple String Comparison . . . . . . . . . . . . 39
+ 6.2.2. Syntax-Based Normalization . . . . . . . . . . . 40
+ 6.2.3. Scheme-Based Normalization . . . . . . . . . . . 41
+ 6.2.4. Protocol-Based Normalization . . . . . . . . . . 42
+ 7. Security Considerations . . . . . . . . . . . . . . . . . . . 43
+ 7.1. Reliability and Consistency . . . . . . . . . . . . . . 43
+ 7.2. Malicious Construction . . . . . . . . . . . . . . . . . 43
+ 7.3. Back-End Transcoding . . . . . . . . . . . . . . . . . . 44
+ 7.4. Rare IP Address Formats . . . . . . . . . . . . . . . . 45
+ 7.5. Sensitive Information . . . . . . . . . . . . . . . . . 45
+ 7.6. Semantic Attacks . . . . . . . . . . . . . . . . . . . . 45
+ 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 46
+ 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 46
+ 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 46
+ 10.1. Normative References . . . . . . . . . . . . . . . . . . 46
+ 10.2. Informative References . . . . . . . . . . . . . . . . . 47
+ A. Collected ABNF for URI . . . . . . . . . . . . . . . . . . . . 49
+ B. Parsing a URI Reference with a Regular Expression . . . . . . 50
+ C. Delimiting a URI in Context . . . . . . . . . . . . . . . . . 51
+ D. Changes from RFC 2396 . . . . . . . . . . . . . . . . . . . . 53
+ D.1. Additions . . . . . . . . . . . . . . . . . . . . . . . 53
+ D.2. Modifications . . . . . . . . . . . . . . . . . . . . . 53
+ Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
+ Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 60
+ Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 61
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 3]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+1. Introduction
+
+ A Uniform Resource Identifier (URI) provides a simple and extensible
+ means for identifying a resource. This specification of URI syntax
+ and semantics is derived from concepts introduced by the World Wide
+ Web global information initiative, whose use of these identifiers
+ dates from 1990 and is described in "Universal Resource Identifiers
+ in WWW" [RFC1630]. The syntax is designed to meet the
+ recommendations laid out in "Functional Recommendations for Internet
+ Resource Locators" [RFC1736] and "Functional Requirements for Uniform
+ Resource Names" [RFC1737].
+
+ This document obsoletes [RFC2396], which merged "Uniform Resource
+ Locators" [RFC1738] and "Relative Uniform Resource Locators"
+ [RFC1808] in order to define a single, generic syntax for all URIs.
+ It obsoletes [RFC2732], which introduced syntax for an IPv6 address.
+ It excludes portions of RFC 1738 that defined the specific syntax of
+ individual URI schemes; those portions will be updated as separate
+ documents. The process for registration of new URI schemes is
+ defined separately by [BCP35]. Advice for designers of new URI
+ schemes can be found in [RFC2718]. All significant changes from RFC
+ 2396 are noted in Appendix D.
+
+ This specification uses the terms "character" and "coded character
+ set" in accordance with the definitions provided in [BCP19], and
+ "character encoding" in place of what [BCP19] refers to as a
+ "charset".
+
+1.1. Overview of URIs
+
+ URIs are characterized as follows:
+
+ Uniform
+
+ Uniformity provides several benefits. It allows different types
+ of resource identifiers to be used in the same context, even when
+ the mechanisms used to access those resources may differ. It
+ allows uniform semantic interpretation of common syntactic
+ conventions across different types of resource identifiers. It
+ allows introduction of new types of resource identifiers without
+ interfering with the way that existing identifiers are used. It
+ allows the identifiers to be reused in many different contexts,
+ thus permitting new applications or protocols to leverage a pre-
+ existing, large, and widely used set of resource identifiers.
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 4]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ Resource
+
+ This specification does not limit the scope of what might be a
+ resource; rather, the term "resource" is used in a general sense
+ for whatever might be identified by a URI. Familiar examples
+ include an electronic document, an image, a source of information
+ with a consistent purpose (e.g., "today's weather report for Los
+ Angeles"), a service (e.g., an HTTP-to-SMS gateway), and a
+ collection of other resources. A resource is not necessarily
+ accessible via the Internet; e.g., human beings, corporations, and
+ bound books in a library can also be resources. Likewise,
+ abstract concepts can be resources, such as the operators and
+ operands of a mathematical equation, the types of a relationship
+ (e.g., "parent" or "employee"), or numeric values (e.g., zero,
+ one, and infinity).
+
+ Identifier
+
+ An identifier embodies the information required to distinguish
+ what is being identified from all other things within its scope of
+ identification. Our use of the terms "identify" and "identifying"
+ refer to this purpose of distinguishing one resource from all
+ other resources, regardless of how that purpose is accomplished
+ (e.g., by name, address, or context). These terms should not be
+ mistaken as an assumption that an identifier defines or embodies
+ the identity of what is referenced, though that may be the case
+ for some identifiers. Nor should it be assumed that a system
+ using URIs will access the resource identified: in many cases,
+ URIs are used to denote resources without any intention that they
+ be accessed. Likewise, the "one" resource identified might not be
+ singular in nature (e.g., a resource might be a named set or a
+ mapping that varies over time).
+
+ A URI is an identifier consisting of a sequence of characters
+ matching the syntax rule named <URI> in Section 3. It enables
+ uniform identification of resources via a separately defined
+ extensible set of naming schemes (Section 3.1). How that
+ identification is accomplished, assigned, or enabled is delegated to
+ each scheme specification.
+
+ This specification does not place any limits on the nature of a
+ resource, the reasons why an application might seek to refer to a
+ resource, or the kinds of systems that might use URIs for the sake of
+ identifying resources. This specification does not require that a
+ URI persists in identifying the same resource over time, though that
+ is a common goal of all URI schemes. Nevertheless, nothing in this
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 5]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ specification prevents an application from limiting itself to
+ particular types of resources, or to a subset of URIs that maintains
+ characteristics desired by that application.
+
+ URIs have a global scope and are interpreted consistently regardless
+ of context, though the result of that interpretation may be in
+ relation to the end-user's context. For example, "http://localhost/"
+ has the same interpretation for every user of that reference, even
+ though the network interface corresponding to "localhost" may be
+ different for each end-user: interpretation is independent of access.
+ However, an action made on the basis of that reference will take
+ place in relation to the end-user's context, which implies that an
+ action intended to refer to a globally unique thing must use a URI
+ that distinguishes that resource from all other things. URIs that
+ identify in relation to the end-user's local context should only be
+ used when the context itself is a defining aspect of the resource,
+ such as when an on-line help manual refers to a file on the end-
+ user's file system (e.g., "file:///etc/hosts").
+
+1.1.1. Generic Syntax
+
+ Each URI begins with a scheme name, as defined in Section 3.1, that
+ refers to a specification for assigning identifiers within that
+ scheme. As such, the URI syntax is a federated and extensible naming
+ system wherein each scheme's specification may further restrict the
+ syntax and semantics of identifiers using that scheme.
+
+ This specification defines those elements of the URI syntax that are
+ required of all URI schemes or are common to many URI schemes. It
+ thus defines the syntax and semantics needed to implement a scheme-
+ independent parsing mechanism for URI references, by which the
+ scheme-dependent handling of a URI can be postponed until the
+ scheme-dependent semantics are needed. Likewise, protocols and data
+ formats that make use of URI references can refer to this
+ specification as a definition for the range of syntax allowed for all
+ URIs, including those schemes that have yet to be defined. This
+ decouples the evolution of identification schemes from the evolution
+ of protocols, data formats, and implementations that make use of
+ URIs.
+
+ A parser of the generic URI syntax can parse any URI reference into
+ its major components. Once the scheme is determined, further
+ scheme-specific parsing can be performed on the components. In other
+ words, the URI generic syntax is a superset of the syntax of all URI
+ schemes.
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 6]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+1.1.2. Examples
+
+ The following example URIs illustrate several URI schemes and
+ variations in their common syntax components:
+
+ ftp://ftp.is.co.za/rfc/rfc1808.txt
+
+ http://www.ietf.org/rfc/rfc2396.txt
+
+ ldap://[2001:db8::7]/c=GB?objectClass?one
+
+ mailto:John.Doe@example.com
+
+ news:comp.infosystems.www.servers.unix
+
+ tel:+1-816-555-1212
+
+ telnet://192.0.2.16:80/
+
+ urn:oasis:names:specification:docbook:dtd:xml:4.1.2
+
+
+1.1.3. URI, URL, and URN
+
+ A URI can be further classified as a locator, a name, or both. The
+ term "Uniform Resource Locator" (URL) refers to the subset of URIs
+ that, in addition to identifying a resource, provide a means of
+ locating the resource by describing its primary access mechanism
+ (e.g., its network "location"). The term "Uniform Resource Name"
+ (URN) has been used historically to refer to both URIs under the
+ "urn" scheme [RFC2141], which are required to remain globally unique
+ and persistent even when the resource ceases to exist or becomes
+ unavailable, and to any other URI with the properties of a name.
+
+ An individual scheme does not have to be classified as being just one
+ of "name" or "locator". Instances of URIs from any given scheme may
+ have the characteristics of names or locators or both, often
+ depending on the persistence and care in the assignment of
+ identifiers by the naming authority, rather than on any quality of
+ the scheme. Future specifications and related documentation should
+ use the general term "URI" rather than the more restrictive terms
+ "URL" and "URN" [RFC3305].
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 7]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+1.2. Design Considerations
+
+1.2.1. Transcription
+
+ The URI syntax has been designed with global transcription as one of
+ its main considerations. A URI is a sequence of characters from a
+ very limited set: the letters of the basic Latin alphabet, digits,
+ and a few special characters. A URI may be represented in a variety
+ of ways; e.g., ink on paper, pixels on a screen, or a sequence of
+ character encoding octets. The interpretation of a URI depends only
+ on the characters used and not on how those characters are
+ represented in a network protocol.
+
+ The goal of transcription can be described by a simple scenario.
+ Imagine two colleagues, Sam and Kim, sitting in a pub at an
+ international conference and exchanging research ideas. Sam asks Kim
+ for a location to get more information, so Kim writes the URI for the
+ research site on a napkin. Upon returning home, Sam takes out the
+ napkin and types the URI into a computer, which then retrieves the
+ information to which Kim referred.
+
+ There are several design considerations revealed by the scenario:
+
+ o A URI is a sequence of characters that is not always represented
+ as a sequence of octets.
+
+ o A URI might be transcribed from a non-network source and thus
+ should consist of characters that are most likely able to be
+ entered into a computer, within the constraints imposed by
+ keyboards (and related input devices) across languages and
+ locales.
+
+ o A URI often has to be remembered by people, and it is easier for
+ people to remember a URI when it consists of meaningful or
+ familiar components.
+
+ These design considerations are not always in alignment. For
+ example, it is often the case that the most meaningful name for a URI
+ component would require characters that cannot be typed into some
+ systems. The ability to transcribe a resource identifier from one
+ medium to another has been considered more important than having a
+ URI consist of the most meaningful of components.
+
+ In local or regional contexts and with improving technology, users
+ might benefit from being able to use a wider range of characters;
+ such use is not defined by this specification. Percent-encoded
+ octets (Section 2.1) may be used within a URI to represent characters
+ outside the range of the US-ASCII coded character set if this
+
+
+
+Berners-Lee, et al. Standards Track [Page 8]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ representation is allowed by the scheme or by the protocol element in
+ which the URI is referenced. Such a definition should specify the
+ character encoding used to map those characters to octets prior to
+ being percent-encoded for the URI.
+
+1.2.2. Separating Identification from Interaction
+
+ A common misunderstanding of URIs is that they are only used to refer
+ to accessible resources. The URI itself only provides
+ identification; access to the resource is neither guaranteed nor
+ implied by the presence of a URI. Instead, any operation associated
+ with a URI reference is defined by the protocol element, data format
+ attribute, or natural language text in which it appears.
+
+ Given a URI, a system may attempt to perform a variety of operations
+ on the resource, as might be characterized by words such as "access",
+ "update", "replace", or "find attributes". Such operations are
+ defined by the protocols that make use of URIs, not by this
+ specification. However, we do use a few general terms for describing
+ common operations on URIs. URI "resolution" is the process of
+ determining an access mechanism and the appropriate parameters
+ necessary to dereference a URI; this resolution may require several
+ iterations. To use that access mechanism to perform an action on the
+ URI's resource is to "dereference" the URI.
+
+ When URIs are used within information retrieval systems to identify
+ sources of information, the most common form of URI dereference is
+ "retrieval": making use of a URI in order to retrieve a
+ representation of its associated resource. A "representation" is a
+ sequence of octets, along with representation metadata describing
+ those octets, that constitutes a record of the state of the resource
+ at the time when the representation is generated. Retrieval is
+ achieved by a process that might include using the URI as a cache key
+ to check for a locally cached representation, resolution of the URI
+ to determine an appropriate access mechanism (if any), and
+ dereference of the URI for the sake of applying a retrieval
+ operation. Depending on the protocols used to perform the retrieval,
+ additional information might be supplied about the resource (resource
+ metadata) and its relation to other resources.
+
+ URI references in information retrieval systems are designed to be
+ late-binding: the result of an access is generally determined when it
+ is accessed and may vary over time or due to other aspects of the
+ interaction. These references are created in order to be used in the
+ future: what is being identified is not some specific result that was
+ obtained in the past, but rather some characteristic that is expected
+ to be true for future results. In such cases, the resource referred
+ to by the URI is actually a sameness of characteristics as observed
+
+
+
+Berners-Lee, et al. Standards Track [Page 9]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ over time, perhaps elucidated by additional comments or assertions
+ made by the resource provider.
+
+ Although many URI schemes are named after protocols, this does not
+ imply that use of these URIs will result in access to the resource
+ via the named protocol. URIs are often used simply for the sake of
+ identification. Even when a URI is used to retrieve a representation
+ of a resource, that access might be through gateways, proxies,
+ caches, and name resolution services that are independent of the
+ protocol associated with the scheme name. The resolution of some
+ URIs may require the use of more than one protocol (e.g., both DNS
+ and HTTP are typically used to access an "http" URI's origin server
+ when a representation isn't found in a local cache).
+
+1.2.3. Hierarchical Identifiers
+
+ The URI syntax is organized hierarchically, with components listed in
+ order of decreasing significance from left to right. For some URI
+ schemes, the visible hierarchy is limited to the scheme itself:
+ everything after the scheme component delimiter (":") is considered
+ opaque to URI processing. Other URI schemes make the hierarchy
+ explicit and visible to generic parsing algorithms.
+
+ The generic syntax uses the slash ("/"), question mark ("?"), and
+ number sign ("#") characters to delimit components that are
+ significant to the generic parser's hierarchical interpretation of an
+ identifier. In addition to aiding the readability of such
+ identifiers through the consistent use of familiar syntax, this
+ uniform representation of hierarchy across naming schemes allows
+ scheme-independent references to be made relative to that hierarchy.
+
+ It is often the case that a group or "tree" of documents has been
+ constructed to serve a common purpose, wherein the vast majority of
+ URI references in these documents point to resources within the tree
+ rather than outside it. Similarly, documents located at a particular
+ site are much more likely to refer to other resources at that site
+ than to resources at remote sites. Relative referencing of URIs
+ allows document trees to be partially independent of their location
+ and access scheme. For instance, it is possible for a single set of
+ hypertext documents to be simultaneously accessible and traversable
+ via each of the "file", "http", and "ftp" schemes if the documents
+ refer to each other with relative references. Furthermore, such
+ document trees can be moved, as a whole, without changing any of the
+ relative references.
+
+ A relative reference (Section 4.2) refers to a resource by describing
+ the difference within a hierarchical name space between the reference
+ context and the target URI. The reference resolution algorithm,
+
+
+
+Berners-Lee, et al. Standards Track [Page 10]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ presented in Section 5, defines how such a reference is transformed
+ to the target URI. As relative references can only be used within
+ the context of a hierarchical URI, designers of new URI schemes
+ should use a syntax consistent with the generic syntax's hierarchical
+ components unless there are compelling reasons to forbid relative
+ referencing within that scheme.
+
+ NOTE: Previous specifications used the terms "partial URI" and
+ "relative URI" to denote a relative reference to a URI. As some
+ readers misunderstood those terms to mean that relative URIs are a
+ subset of URIs rather than a method of referencing URIs, this
+ specification simply refers to them as relative references.
+
+ All URI references are parsed by generic syntax parsers when used.
+ However, because hierarchical processing has no effect on an absolute
+ URI used in a reference unless it contains one or more dot-segments
+ (complete path segments of "." or "..", as described in Section 3.3),
+ URI scheme specifications can define opaque identifiers by
+ disallowing use of slash characters, question mark characters, and
+ the URIs "scheme:." and "scheme:..".
+
+1.3. Syntax Notation
+
+ This specification uses the Augmented Backus-Naur Form (ABNF)
+ notation of [RFC2234], including the following core ABNF syntax rules
+ defined by that specification: ALPHA (letters), CR (carriage return),
+ DIGIT (decimal digits), DQUOTE (double quote), HEXDIG (hexadecimal
+ digits), LF (line feed), and SP (space). The complete URI syntax is
+ collected in Appendix A.
+
+2. Characters
+
+ The URI syntax provides a method of encoding data, presumably for the
+ sake of identifying a resource, as a sequence of characters. The URI
+ characters are, in turn, frequently encoded as octets for transport
+ or presentation. This specification does not mandate any particular
+ character encoding for mapping between URI characters and the octets
+ used to store or transmit those characters. When a URI appears in a
+ protocol element, the character encoding is defined by that protocol;
+ without such a definition, a URI is assumed to be in the same
+ character encoding as the surrounding text.
+
+ The ABNF notation defines its terminal values to be non-negative
+ integers (codepoints) based on the US-ASCII coded character set
+ [ASCII]. Because a URI is a sequence of characters, we must invert
+ that relation in order to understand the URI syntax. Therefore, the
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 11]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ integer values used by the ABNF must be mapped back to their
+ corresponding characters via US-ASCII in order to complete the syntax
+ rules.
+
+ A URI is composed from a limited set of characters consisting of
+ digits, letters, and a few graphic symbols. A reserved subset of
+ those characters may be used to delimit syntax components within a
+ URI while the remaining characters, including both the unreserved set
+ and those reserved characters not acting as delimiters, define each
+ component's identifying data.
+
+2.1. Percent-Encoding
+
+ A percent-encoding mechanism is used to represent a data octet in a
+ component when that octet's corresponding character is outside the
+ allowed set or is being used as a delimiter of, or within, the
+ component. A percent-encoded octet is encoded as a character
+ triplet, consisting of the percent character "%" followed by the two
+ hexadecimal digits representing that octet's numeric value. For
+ example, "%20" is the percent-encoding for the binary octet
+ "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space
+ character (SP). Section 2.4 describes when percent-encoding and
+ decoding is applied.
+
+ pct-encoded = "%" HEXDIG HEXDIG
+
+ The uppercase hexadecimal digits 'A' through 'F' are equivalent to
+ the lowercase digits 'a' through 'f', respectively. If two URIs
+ differ only in the case of hexadecimal digits used in percent-encoded
+ octets, they are equivalent. For consistency, URI producers and
+ normalizers should use uppercase hexadecimal digits for all percent-
+ encodings.
+
+2.2. Reserved Characters
+
+ URIs include components and subcomponents that are delimited by
+ characters in the "reserved" set. These characters are called
+ "reserved" because they may (or may not) be defined as delimiters by
+ the generic syntax, by each scheme-specific syntax, or by the
+ implementation-specific syntax of a URI's dereferencing algorithm.
+ If data for a URI component would conflict with a reserved
+ character's purpose as a delimiter, then the conflicting data must be
+ percent-encoded before the URI is formed.
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 12]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ reserved = gen-delims / sub-delims
+
+ gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
+
+ sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
+ / "*" / "+" / "," / ";" / "="
+
+ The purpose of reserved characters is to provide a set of delimiting
+ characters that are distinguishable from other data within a URI.
+ URIs that differ in the replacement of a reserved character with its
+ corresponding percent-encoded octet are not equivalent. Percent-
+ encoding a reserved character, or decoding a percent-encoded octet
+ that corresponds to a reserved character, will change how the URI is
+ interpreted by most applications. Thus, characters in the reserved
+ set are protected from normalization and are therefore safe to be
+ used by scheme-specific and producer-specific algorithms for
+ delimiting data subcomponents within a URI.
+
+ A subset of the reserved characters (gen-delims) is used as
+ delimiters of the generic URI components described in Section 3. A
+ component's ABNF syntax rule will not use the reserved or gen-delims
+ rule names directly; instead, each syntax rule lists the characters
+ allowed within that component (i.e., not delimiting it), and any of
+ those characters that are also in the reserved set are "reserved" for
+ use as subcomponent delimiters within the component. Only the most
+ common subcomponents are defined by this specification; other
+ subcomponents may be defined by a URI scheme's specification, or by
+ the implementation-specific syntax of a URI's dereferencing
+ algorithm, provided that such subcomponents are delimited by
+ characters in the reserved set allowed within that component.
+
+ URI producing applications should percent-encode data octets that
+ correspond to characters in the reserved set unless these characters
+ are specifically allowed by the URI scheme to represent data in that
+ component. If a reserved character is found in a URI component and
+ no delimiting role is known for that character, then it must be
+ interpreted as representing the data octet corresponding to that
+ character's encoding in US-ASCII.
+
+2.3. Unreserved Characters
+
+ Characters that are allowed in a URI but do not have a reserved
+ purpose are called unreserved. These include uppercase and lowercase
+ letters, decimal digits, hyphen, period, underscore, and tilde.
+
+ unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 13]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ URIs that differ in the replacement of an unreserved character with
+ its corresponding percent-encoded US-ASCII octet are equivalent: they
+ identify the same resource. However, URI comparison implementations
+ do not always perform normalization prior to comparison (see Section
+ 6). For consistency, percent-encoded octets in the ranges of ALPHA
+ (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
+ underscore (%5F), or tilde (%7E) should not be created by URI
+ producers and, when found in a URI, should be decoded to their
+ corresponding unreserved characters by URI normalizers.
+
+2.4. When to Encode or Decode
+
+ Under normal circumstances, the only time when octets within a URI
+ are percent-encoded is during the process of producing the URI from
+ its component parts. This is when an implementation determines which
+ of the reserved characters are to be used as subcomponent delimiters
+ and which can be safely used as data. Once produced, a URI is always
+ in its percent-encoded form.
+
+ When a URI is dereferenced, the components and subcomponents
+ significant to the scheme-specific dereferencing process (if any)
+ must be parsed and separated before the percent-encoded octets within
+ those components can be safely decoded, as otherwise the data may be
+ mistaken for component delimiters. The only exception is for
+ percent-encoded octets corresponding to characters in the unreserved
+ set, which can be decoded at any time. For example, the octet
+ corresponding to the tilde ("~") character is often encoded as "%7E"
+ by older URI processing implementations; the "%7E" can be replaced by
+ "~" without changing its interpretation.
+
+ Because the percent ("%") character serves as the indicator for
+ percent-encoded octets, it must be percent-encoded as "%25" for that
+ octet to be used as data within a URI. Implementations must not
+ percent-encode or decode the same string more than once, as decoding
+ an already decoded string might lead to misinterpreting a percent
+ data octet as the beginning of a percent-encoding, or vice versa in
+ the case of percent-encoding an already percent-encoded string.
+
+2.5. Identifying Data
+
+ URI characters provide identifying data for each of the URI
+ components, serving as an external interface for identification
+ between systems. Although the presence and nature of the URI
+ production interface is hidden from clients that use its URIs (and is
+ thus beyond the scope of the interoperability requirements defined by
+ this specification), it is a frequent source of confusion and errors
+ in the interpretation of URI character issues. Implementers have to
+ be aware that there are multiple character encodings involved in the
+
+
+
+Berners-Lee, et al. Standards Track [Page 14]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ production and transmission of URIs: local name and data encoding,
+ public interface encoding, URI character encoding, data format
+ encoding, and protocol encoding.
+
+ Local names, such as file system names, are stored with a local
+ character encoding. URI producing applications (e.g., origin
+ servers) will typically use the local encoding as the basis for
+ producing meaningful names. The URI producer will transform the
+ local encoding to one that is suitable for a public interface and
+ then transform the public interface encoding into the restricted set
+ of URI characters (reserved, unreserved, and percent-encodings).
+ Those characters are, in turn, encoded as octets to be used as a
+ reference within a data format (e.g., a document charset), and such
+ data formats are often subsequently encoded for transmission over
+ Internet protocols.
+
+ For most systems, an unreserved character appearing within a URI
+ component is interpreted as representing the data octet corresponding
+ to that character's encoding in US-ASCII. Consumers of URIs assume
+ that the letter "X" corresponds to the octet "01011000", and even
+ when that assumption is incorrect, there is no harm in making it. A
+ system that internally provides identifiers in the form of a
+ different character encoding, such as EBCDIC, will generally perform
+ character translation of textual identifiers to UTF-8 [STD63] (or
+ some other superset of the US-ASCII character encoding) at an
+ internal interface, thereby providing more meaningful identifiers
+ than those resulting from simply percent-encoding the original
+ octets.
+
+ For example, consider an information service that provides data,
+ stored locally using an EBCDIC-based file system, to clients on the
+ Internet through an HTTP server. When an author creates a file with
+ the name "Laguna Beach" on that file system, the "http" URI
+ corresponding to that resource is expected to contain the meaningful
+ string "Laguna%20Beach". If, however, that server produces URIs by
+ using an overly simplistic raw octet mapping, then the result would
+ be a URI containing "%D3%81%87%A4%95%81@%C2%85%81%83%88". An
+ internal transcoding interface fixes this problem by transcoding the
+ local name to a superset of US-ASCII prior to producing the URI.
+ Naturally, proper interpretation of an incoming URI on such an
+ interface requires that percent-encoded octets be decoded (e.g.,
+ "%20" to SP) before the reverse transcoding is applied to obtain the
+ local name.
+
+ In some cases, the internal interface between a URI component and the
+ identifying data that it has been crafted to represent is much less
+ direct than a character encoding translation. For example, portions
+ of a URI might reflect a query on non-ASCII data, or numeric
+
+
+
+Berners-Lee, et al. Standards Track [Page 15]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ coordinates on a map. Likewise, a URI scheme may define components
+ with additional encoding requirements that are applied prior to
+ forming the component and producing the URI.
+
+ When a new URI scheme defines a component that represents textual
+ data consisting of characters from the Universal Character Set [UCS],
+ the data should first be encoded as octets according to the UTF-8
+ character encoding [STD63]; then only those octets that do not
+ correspond to characters in the unreserved set should be percent-
+ encoded. For example, the character A would be represented as "A",
+ the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
+ as "%C3%80", and the character KATAKANA LETTER A would be represented
+ as "%E3%82%A2".
+
+3. Syntax Components
+
+ The generic URI syntax consists of a hierarchical sequence of
+ components referred to as the scheme, authority, path, query, and
+ fragment.
+
+ URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
+
+ hier-part = "//" authority path-abempty
+ / path-absolute
+ / path-rootless
+ / path-empty
+
+ The scheme and path components are required, though the path may be
+ empty (no characters). When authority is present, the path must
+ either be empty or begin with a slash ("/") character. When
+ authority is not present, the path cannot begin with two slash
+ characters ("//"). These restrictions result in five different ABNF
+ rules for a path (Section 3.3), only one of which will match any
+ given URI reference.
+
+ The following are two example URIs and their component parts:
+
+ foo://example.com:8042/over/there?name=ferret#nose
+ \_/ \______________/\_________/ \_________/ \__/
+ | | | | |
+ scheme authority path query fragment
+ | _____________________|__
+ / \ / \
+ urn:example:animal:ferret:nose
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 16]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+3.1. Scheme
+
+ Each URI begins with a scheme name that refers to a specification for
+ assigning identifiers within that scheme. As such, the URI syntax is
+ a federated and extensible naming system wherein each scheme's
+ specification may further restrict the syntax and semantics of
+ identifiers using that scheme.
+
+ Scheme names consist of a sequence of characters beginning with a
+ letter and followed by any combination of letters, digits, plus
+ ("+"), period ("."), or hyphen ("-"). Although schemes are case-
+ insensitive, the canonical form is lowercase and documents that
+ specify schemes must do so with lowercase letters. An implementation
+ should accept uppercase letters as equivalent to lowercase in scheme
+ names (e.g., allow "HTTP" as well as "http") for the sake of
+ robustness but should only produce lowercase scheme names for
+ consistency.
+
+ scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
+
+ Individual schemes are not specified by this document. The process
+ for registration of new URI schemes is defined separately by [BCP35].
+ The scheme registry maintains the mapping between scheme names and
+ their specifications. Advice for designers of new URI schemes can be
+ found in [RFC2718]. URI scheme specifications must define their own
+ syntax so that all strings matching their scheme-specific syntax will
+ also match the <absolute-URI> grammar, as described in Section 4.3.
+
+ When presented with a URI that violates one or more scheme-specific
+ restrictions, the scheme-specific resolution process should flag the
+ reference as an error rather than ignore the unused parts; doing so
+ reduces the number of equivalent URIs and helps detect abuses of the
+ generic syntax, which might indicate that the URI has been
+ constructed to mislead the user (Section 7.6).
+
+3.2. Authority
+
+ Many URI schemes include a hierarchical element for a naming
+ authority so that governance of the name space defined by the
+ remainder of the URI is delegated to that authority (which may, in
+ turn, delegate it further). The generic syntax provides a common
+ means for distinguishing an authority based on a registered name or
+ server address, along with optional port and user information.
+
+ The authority component is preceded by a double slash ("//") and is
+ terminated by the next slash ("/"), question mark ("?"), or number
+ sign ("#") character, or by the end of the URI.
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 17]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ authority = [ userinfo "@" ] host [ ":" port ]
+
+ URI producers and normalizers should omit the ":" delimiter that
+ separates host from port if the port component is empty. Some
+ schemes do not allow the userinfo and/or port subcomponents.
+
+ If a URI contains an authority component, then the path component
+ must either be empty or begin with a slash ("/") character. Non-
+ validating parsers (those that merely separate a URI reference into
+ its major components) will often ignore the subcomponent structure of
+ authority, treating it as an opaque string from the double-slash to
+ the first terminating delimiter, until such time as the URI is
+ dereferenced.
+
+3.2.1. User Information
+
+ The userinfo subcomponent may consist of a user name and, optionally,
+ scheme-specific information about how to gain authorization to access
+ the resource. The user information, if present, is followed by a
+ commercial at-sign ("@") that delimits it from the host.
+
+ userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
+
+ Use of the format "user:password" in the userinfo field is
+ deprecated. Applications should not render as clear text any data
+ after the first colon (":") character found within a userinfo
+ subcomponent unless the data after the colon is the empty string
+ (indicating no password). Applications may choose to ignore or
+ reject such data when it is received as part of a reference and
+ should reject the storage of such data in unencrypted form. The
+ passing of authentication information in clear text has proven to be
+ a security risk in almost every case where it has been used.
+
+ Applications that render a URI for the sake of user feedback, such as
+ in graphical hypertext browsing, should render userinfo in a way that
+ is distinguished from the rest of a URI, when feasible. Such
+ rendering will assist the user in cases where the userinfo has been
+ misleadingly crafted to look like a trusted domain name
+ (Section 7.6).
+
+3.2.2. Host
+
+ The host subcomponent of authority is identified by an IP literal
+ encapsulated within square brackets, an IPv4 address in dotted-
+ decimal form, or a registered name. The host subcomponent is case-
+ insensitive. The presence of a host subcomponent within a URI does
+ not imply that the scheme requires access to the given host on the
+ Internet. In many cases, the host syntax is used only for the sake
+
+
+
+Berners-Lee, et al. Standards Track [Page 18]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ of reusing the existing registration process created and deployed for
+ DNS, thus obtaining a globally unique name without the cost of
+ deploying another registry. However, such use comes with its own
+ costs: domain name ownership may change over time for reasons not
+ anticipated by the URI producer. In other cases, the data within the
+ host component identifies a registered name that has nothing to do
+ with an Internet host. We use the name "host" for the ABNF rule
+ because that is its most common purpose, not its only purpose.
+
+ host = IP-literal / IPv4address / reg-name
+
+ The syntax rule for host is ambiguous because it does not completely
+ distinguish between an IPv4address and a reg-name. In order to
+ disambiguate the syntax, we apply the "first-match-wins" algorithm:
+ If host matches the rule for IPv4address, then it should be
+ considered an IPv4 address literal and not a reg-name. Although host
+ is case-insensitive, producers and normalizers should use lowercase
+ for registered names and hexadecimal addresses for the sake of
+ uniformity, while only using uppercase letters for percent-encodings.
+
+ A host identified by an Internet Protocol literal address, version 6
+ [RFC3513] or later, is distinguished by enclosing the IP literal
+ within square brackets ("[" and "]"). This is the only place where
+ square bracket characters are allowed in the URI syntax. In
+ anticipation of future, as-yet-undefined IP literal address formats,
+ an implementation may use an optional version flag to indicate such a
+ format explicitly rather than rely on heuristic determination.
+
+ IP-literal = "[" ( IPv6address / IPvFuture ) "]"
+
+ IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
+
+ The version flag does not indicate the IP version; rather, it
+ indicates future versions of the literal format. As such,
+ implementations must not provide the version flag for the existing
+ IPv4 and IPv6 literal address forms described below. If a URI
+ containing an IP-literal that starts with "v" (case-insensitive),
+ indicating that the version flag is present, is dereferenced by an
+ application that does not know the meaning of that version flag, then
+ the application should return an appropriate error for "address
+ mechanism not supported".
+
+ A host identified by an IPv6 literal address is represented inside
+ the square brackets without a preceding version flag. The ABNF
+ provided here is a translation of the text definition of an IPv6
+ literal address provided in [RFC3513]. This syntax does not support
+ IPv6 scoped addressing zone identifiers.
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 19]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ A 128-bit IPv6 address is divided into eight 16-bit pieces. Each
+ piece is represented numerically in case-insensitive hexadecimal,
+ using one to four hexadecimal digits (leading zeroes are permitted).
+ The eight encoded pieces are given most-significant first, separated
+ by colon characters. Optionally, the least-significant two pieces
+ may instead be represented in IPv4 address textual format. A
+ sequence of one or more consecutive zero-valued 16-bit pieces within
+ the address may be elided, omitting all their digits and leaving
+ exactly two consecutive colons in their place to mark the elision.
+
+ IPv6address = 6( h16 ":" ) ls32
+ / "::" 5( h16 ":" ) ls32
+ / [ h16 ] "::" 4( h16 ":" ) ls32
+ / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
+ / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
+ / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
+ / [ *4( h16 ":" ) h16 ] "::" ls32
+ / [ *5( h16 ":" ) h16 ] "::" h16
+ / [ *6( h16 ":" ) h16 ] "::"
+
+ ls32 = ( h16 ":" h16 ) / IPv4address
+ ; least-significant 32 bits of address
+
+ h16 = 1*4HEXDIG
+ ; 16 bits of address represented in hexadecimal
+
+ A host identified by an IPv4 literal address is represented in
+ dotted-decimal notation (a sequence of four decimal numbers in the
+ range 0 to 255, separated by "."), as described in [RFC1123] by
+ reference to [RFC0952]. Note that other forms of dotted notation may
+ be interpreted on some platforms, as described in Section 7.4, but
+ only the dotted-decimal form of four octets is allowed by this
+ grammar.
+
+ IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
+
+ dec-octet = DIGIT ; 0-9
+ / %x31-39 DIGIT ; 10-99
+ / "1" 2DIGIT ; 100-199
+ / "2" %x30-34 DIGIT ; 200-249
+ / "25" %x30-35 ; 250-255
+
+ A host identified by a registered name is a sequence of characters
+ usually intended for lookup within a locally defined host or service
+ name registry, though the URI's scheme-specific semantics may require
+ that a specific registry (or fixed name table) be used instead. The
+ most common name registry mechanism is the Domain Name System (DNS).
+ A registered name intended for lookup in the DNS uses the syntax
+
+
+
+Berners-Lee, et al. Standards Track [Page 20]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ defined in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123].
+ Such a name consists of a sequence of domain labels separated by ".",
+ each domain label starting and ending with an alphanumeric character
+ and possibly also containing "-" characters. The rightmost domain
+ label of a fully qualified domain name in DNS may be followed by a
+ single "." and should be if it is necessary to distinguish between
+ the complete domain name and some local domain.
+
+ reg-name = *( unreserved / pct-encoded / sub-delims )
+
+ If the URI scheme defines a default for host, then that default
+ applies when the host subcomponent is undefined or when the
+ registered name is empty (zero length). For example, the "file" URI
+ scheme is defined so that no authority, an empty host, and
+ "localhost" all mean the end-user's machine, whereas the "http"
+ scheme considers a missing authority or empty host invalid.
+
+ This specification does not mandate a particular registered name
+ lookup technology and therefore does not restrict the syntax of reg-
+ name beyond what is necessary for interoperability. Instead, it
+ delegates the issue of registered name syntax conformance to the
+ operating system of each application performing URI resolution, and
+ that operating system decides what it will allow for the purpose of
+ host identification. A URI resolution implementation might use DNS,
+ host tables, yellow pages, NetInfo, WINS, or any other system for
+ lookup of registered names. However, a globally scoped naming
+ system, such as DNS fully qualified domain names, is necessary for
+ URIs intended to have global scope. URI producers should use names
+ that conform to the DNS syntax, even when use of DNS is not
+ immediately apparent, and should limit these names to no more than
+ 255 characters in length.
+
+ The reg-name syntax allows percent-encoded octets in order to
+ represent non-ASCII registered names in a uniform way that is
+ independent of the underlying name resolution technology. Non-ASCII
+ characters must first be encoded according to UTF-8 [STD63], and then
+ each octet of the corresponding UTF-8 sequence must be percent-
+ encoded to be represented as URI characters. URI producing
+ applications must not use percent-encoding in host unless it is used
+ to represent a UTF-8 character sequence. When a non-ASCII registered
+ name represents an internationalized domain name intended for
+ resolution via the DNS, the name must be transformed to the IDNA
+ encoding [RFC3490] prior to name lookup. URI producers should
+ provide these registered names in the IDNA encoding, rather than a
+ percent-encoding, if they wish to maximize interoperability with
+ legacy URI resolvers.
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 21]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+3.2.3. Port
+
+ The port subcomponent of authority is designated by an optional port
+ number in decimal following the host and delimited from it by a
+ single colon (":") character.
+
+ port = *DIGIT
+
+ A scheme may define a default port. For example, the "http" scheme
+ defines a default port of "80", corresponding to its reserved TCP
+ port number. The type of port designated by the port number (e.g.,
+ TCP, UDP, SCTP) is defined by the URI scheme. URI producers and
+ normalizers should omit the port component and its ":" delimiter if
+ port is empty or if its value would be the same as that of the
+ scheme's default.
+
+3.3. Path
+
+ The path component contains data, usually organized in hierarchical
+ form, that, along with data in the non-hierarchical query component
+ (Section 3.4), serves to identify a resource within the scope of the
+ URI's scheme and naming authority (if any). The path is terminated
+ by the first question mark ("?") or number sign ("#") character, or
+ by the end of the URI.
+
+ If a URI contains an authority component, then the path component
+ must either be empty or begin with a slash ("/") character. If a URI
+ does not contain an authority component, then the path cannot begin
+ with two slash characters ("//"). In addition, a URI reference
+ (Section 4.1) may be a relative-path reference, in which case the
+ first path segment cannot contain a colon (":") character. The ABNF
+ requires five separate rules to disambiguate these cases, only one of
+ which will match the path substring within a given URI reference. We
+ use the generic term "path component" to describe the URI substring
+ matched by the parser to one of these rules.
+
+ path = path-abempty ; begins with "/" or is empty
+ / path-absolute ; begins with "/" but not "//"
+ / path-noscheme ; begins with a non-colon segment
+ / path-rootless ; begins with a segment
+ / path-empty ; zero characters
+
+ path-abempty = *( "/" segment )
+ path-absolute = "/" [ segment-nz *( "/" segment ) ]
+ path-noscheme = segment-nz-nc *( "/" segment )
+ path-rootless = segment-nz *( "/" segment )
+ path-empty = 0<pchar>
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 22]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ segment = *pchar
+ segment-nz = 1*pchar
+ segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
+ ; non-zero-length segment without any colon ":"
+
+ pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
+
+ A path consists of a sequence of path segments separated by a slash
+ ("/") character. A path is always defined for a URI, though the
+ defined path may be empty (zero length). Use of the slash character
+ to indicate hierarchy is only required when a URI will be used as the
+ context for relative references. For example, the URI
+ <mailto:fred@example.com> has a path of "fred@example.com", whereas
+ the URI <foo://info.example.com?fred> has an empty path.
+
+ The path segments "." and "..", also known as dot-segments, are
+ defined for relative reference within the path name hierarchy. They
+ are intended for use at the beginning of a relative-path reference
+ (Section 4.2) to indicate relative position within the hierarchical
+ tree of names. This is similar to their role within some operating
+ systems' file directory structures to indicate the current directory
+ and parent directory, respectively. However, unlike in a file
+ system, these dot-segments are only interpreted within the URI path
+ hierarchy and are removed as part of the resolution process (Section
+ 5.2).
+
+ Aside from dot-segments in hierarchical paths, a path segment is
+ considered opaque by the generic syntax. URI producing applications
+ often use the reserved characters allowed in a segment to delimit
+ scheme-specific or dereference-handler-specific subcomponents. For
+ example, the semicolon (";") and equals ("=") reserved characters are
+ often used to delimit parameters and parameter values applicable to
+ that segment. The comma (",") reserved character is often used for
+ similar purposes. For example, one URI producer might use a segment
+ such as "name;v=1.1" to indicate a reference to version 1.1 of
+ "name", whereas another might use a segment such as "name,1.1" to
+ indicate the same. Parameter types may be defined by scheme-specific
+ semantics, but in most cases the syntax of a parameter is specific to
+ the implementation of the URI's dereferencing algorithm.
+
+3.4. Query
+
+ The query component contains non-hierarchical data that, along with
+ data in the path component (Section 3.3), serves to identify a
+ resource within the scope of the URI's scheme and naming authority
+ (if any). The query component is indicated by the first question
+ mark ("?") character and terminated by a number sign ("#") character
+ or by the end of the URI.
+
+
+
+Berners-Lee, et al. Standards Track [Page 23]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ query = *( pchar / "/" / "?" )
+
+ The characters slash ("/") and question mark ("?") may represent data
+ within the query component. Beware that some older, erroneous
+ implementations may not handle such data correctly when it is used as
+ the base URI for relative references (Section 5.1), apparently
+ because they fail to distinguish query data from path data when
+ looking for hierarchical separators. However, as query components
+ are often used to carry identifying information in the form of
+ "key=value" pairs and one frequently used value is a reference to
+ another URI, it is sometimes better for usability to avoid percent-
+ encoding those characters.
+
+3.5. Fragment
+
+ The fragment identifier component of a URI allows indirect
+ identification of a secondary resource by reference to a primary
+ resource and additional identifying information. The identified
+ secondary resource may be some portion or subset of the primary
+ resource, some view on representations of the primary resource, or
+ some other resource defined or described by those representations. A
+ fragment identifier component is indicated by the presence of a
+ number sign ("#") character and terminated by the end of the URI.
+
+ fragment = *( pchar / "/" / "?" )
+
+ The semantics of a fragment identifier are defined by the set of
+ representations that might result from a retrieval action on the
+ primary resource. The fragment's format and resolution is therefore
+ dependent on the media type [RFC2046] of a potentially retrieved
+ representation, even though such a retrieval is only performed if the
+ URI is dereferenced. If no such representation exists, then the
+ semantics of the fragment are considered unknown and are effectively
+ unconstrained. Fragment identifier semantics are independent of the
+ URI scheme and thus cannot be redefined by scheme specifications.
+
+ Individual media types may define their own restrictions on or
+ structures within the fragment identifier syntax for specifying
+ different types of subsets, views, or external references that are
+ identifiable as secondary resources by that media type. If the
+ primary resource has multiple representations, as is often the case
+ for resources whose representation is selected based on attributes of
+ the retrieval request (a.k.a., content negotiation), then whatever is
+ identified by the fragment should be consistent across all of those
+ representations. Each representation should either define the
+ fragment so that it corresponds to the same secondary resource,
+ regardless of how it is represented, or should leave the fragment
+ undefined (i.e., not found).
+
+
+
+Berners-Lee, et al. Standards Track [Page 24]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ As with any URI, use of a fragment identifier component does not
+ imply that a retrieval action will take place. A URI with a fragment
+ identifier may be used to refer to the secondary resource without any
+ implication that the primary resource is accessible or will ever be
+ accessed.
+
+ Fragment identifiers have a special role in information retrieval
+ systems as the primary form of client-side indirect referencing,
+ allowing an author to specifically identify aspects of an existing
+ resource that are only indirectly provided by the resource owner. As
+ such, the fragment identifier is not used in the scheme-specific
+ processing of a URI; instead, the fragment identifier is separated
+ from the rest of the URI prior to a dereference, and thus the
+ identifying information within the fragment itself is dereferenced
+ solely by the user agent, regardless of the URI scheme. Although
+ this separate handling is often perceived to be a loss of
+ information, particularly for accurate redirection of references as
+ resources move over time, it also serves to prevent information
+ providers from denying reference authors the right to refer to
+ information within a resource selectively. Indirect referencing also
+ provides additional flexibility and extensibility to systems that use
+ URIs, as new media types are easier to define and deploy than new
+ schemes of identification.
+
+ The characters slash ("/") and question mark ("?") are allowed to
+ represent data within the fragment identifier. Beware that some
+ older, erroneous implementations may not handle this data correctly
+ when it is used as the base URI for relative references (Section
+ 5.1).
+
+4. Usage
+
+ When applications make reference to a URI, they do not always use the
+ full form of reference defined by the "URI" syntax rule. To save
+ space and take advantage of hierarchical locality, many Internet
+ protocol elements and media type formats allow an abbreviation of a
+ URI, whereas others restrict the syntax to a particular form of URI.
+ We define the most common forms of reference syntax in this
+ specification because they impact and depend upon the design of the
+ generic syntax, requiring a uniform parsing algorithm in order to be
+ interpreted consistently.
+
+4.1. URI Reference
+
+ URI-reference is used to denote the most common usage of a resource
+ identifier.
+
+ URI-reference = URI / relative-ref
+
+
+
+Berners-Lee, et al. Standards Track [Page 25]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ A URI-reference is either a URI or a relative reference. If the
+ URI-reference's prefix does not match the syntax of a scheme followed
+ by its colon separator, then the URI-reference is a relative
+ reference.
+
+ A URI-reference is typically parsed first into the five URI
+ components, in order to determine what components are present and
+ whether the reference is relative. Then, each component is parsed
+ for its subparts and their validation. The ABNF of URI-reference,
+ along with the "first-match-wins" disambiguation rule, is sufficient
+ to define a validating parser for the generic syntax. Readers
+ familiar with regular expressions should see Appendix B for an
+ example of a non-validating URI-reference parser that will take any
+ given string and extract the URI components.
+
+4.2. Relative Reference
+
+ A relative reference takes advantage of the hierarchical syntax
+ (Section 1.2.3) to express a URI reference relative to the name space
+ of another hierarchical URI.
+
+ relative-ref = relative-part [ "?" query ] [ "#" fragment ]
+
+ relative-part = "//" authority path-abempty
+ / path-absolute
+ / path-noscheme
+ / path-empty
+
+ The URI referred to by a relative reference, also known as the target
+ URI, is obtained by applying the reference resolution algorithm of
+ Section 5.
+
+ A relative reference that begins with two slash characters is termed
+ a network-path reference; such references are rarely used. A
+ relative reference that begins with a single slash character is
+ termed an absolute-path reference. A relative reference that does
+ not begin with a slash character is termed a relative-path reference.
+
+ A path segment that contains a colon character (e.g., "this:that")
+ cannot be used as the first segment of a relative-path reference, as
+ it would be mistaken for a scheme name. Such a segment must be
+ preceded by a dot-segment (e.g., "./this:that") to make a relative-
+ path reference.
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 26]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+4.3. Absolute URI
+
+ Some protocol elements allow only the absolute form of a URI without
+ a fragment identifier. For example, defining a base URI for later
+ use by relative references calls for an absolute-URI syntax rule that
+ does not allow a fragment.
+
+ absolute-URI = scheme ":" hier-part [ "?" query ]
+
+ URI scheme specifications must define their own syntax so that all
+ strings matching their scheme-specific syntax will also match the
+ <absolute-URI> grammar. Scheme specifications will not define
+ fragment identifier syntax or usage, regardless of its applicability
+ to resources identifiable via that scheme, as fragment identification
+ is orthogonal to scheme definition. However, scheme specifications
+ are encouraged to include a wide range of examples, including
+ examples that show use of the scheme's URIs with fragment identifiers
+ when such usage is appropriate.
+
+4.4. Same-Document Reference
+
+ When a URI reference refers to a URI that is, aside from its fragment
+ component (if any), identical to the base URI (Section 5.1), that
+ reference is called a "same-document" reference. The most frequent
+ examples of same-document references are relative references that are
+ empty or include only the number sign ("#") separator followed by a
+ fragment identifier.
+
+ When a same-document reference is dereferenced for a retrieval
+ action, the target of that reference is defined to be within the same
+ entity (representation, document, or message) as the reference;
+ therefore, a dereference should not result in a new retrieval action.
+
+ Normalization of the base and target URIs prior to their comparison,
+ as described in Sections 6.2.2 and 6.2.3, is allowed but rarely
+ performed in practice. Normalization may increase the set of same-
+ document references, which may be of benefit to some caching
+ applications. As such, reference authors should not assume that a
+ slightly different, though equivalent, reference URI will (or will
+ not) be interpreted as a same-document reference by any given
+ application.
+
+4.5. Suffix Reference
+
+ The URI syntax is designed for unambiguous reference to resources and
+ extensibility via the URI scheme. However, as URI identification and
+ usage have become commonplace, traditional media (television, radio,
+ newspapers, billboards, etc.) have increasingly used a suffix of the
+
+
+
+Berners-Lee, et al. Standards Track [Page 27]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ URI as a reference, consisting of only the authority and path
+ portions of the URI, such as
+
+ www.w3.org/Addressing/
+
+ or simply a DNS registered name on its own. Such references are
+ primarily intended for human interpretation rather than for machines,
+ with the assumption that context-based heuristics are sufficient to
+ complete the URI (e.g., most registered names beginning with "www"
+ are likely to have a URI prefix of "http://"). Although there is no
+ standard set of heuristics for disambiguating a URI suffix, many
+ client implementations allow them to be entered by the user and
+ heuristically resolved.
+
+ Although this practice of using suffix references is common, it
+ should be avoided whenever possible and should never be used in
+ situations where long-term references are expected. The heuristics
+ noted above will change over time, particularly when a new URI scheme
+ becomes popular, and are often incorrect when used out of context.
+ Furthermore, they can lead to security issues along the lines of
+ those described in [RFC1535].
+
+ As a URI suffix has the same syntax as a relative-path reference, a
+ suffix reference cannot be used in contexts where a relative
+ reference is expected. As a result, suffix references are limited to
+ places where there is no defined base URI, such as dialog boxes and
+ off-line advertisements.
+
+5. Reference Resolution
+
+ This section defines the process of resolving a URI reference within
+ a context that allows relative references so that the result is a
+ string matching the <URI> syntax rule of Section 3.
+
+5.1. Establishing a Base URI
+
+ The term "relative" implies that a "base URI" exists against which
+ the relative reference is applied. Aside from fragment-only
+ references (Section 4.4), relative references are only usable when a
+ base URI is known. A base URI must be established by the parser
+ prior to parsing URI references that might be relative. A base URI
+ must conform to the <absolute-URI> syntax rule (Section 4.3). If the
+ base URI is obtained from a URI reference, then that reference must
+ be converted to absolute form and stripped of any fragment component
+ prior to its use as a base URI.
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 28]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ The base URI of a reference can be established in one of four ways,
+ discussed below in order of precedence. The order of precedence can
+ be thought of in terms of layers, where the innermost defined base
+ URI has the highest precedence. This can be visualized graphically
+ as follows:
+
+ .----------------------------------------------------------.
+ | .----------------------------------------------------. |
+ | | .----------------------------------------------. | |
+ | | | .----------------------------------------. | | |
+ | | | | .----------------------------------. | | | |
+ | | | | | <relative-reference> | | | | |
+ | | | | `----------------------------------' | | | |
+ | | | | (5.1.1) Base URI embedded in content | | | |
+ | | | `----------------------------------------' | | |
+ | | | (5.1.2) Base URI of the encapsulating entity | | |
+ | | | (message, representation, or none) | | |
+ | | `----------------------------------------------' | |
+ | | (5.1.3) URI used to retrieve the entity | |
+ | `----------------------------------------------------' |
+ | (5.1.4) Default Base URI (application-dependent) |
+ `----------------------------------------------------------'
+
+5.1.1. Base URI Embedded in Content
+
+ Within certain media types, a base URI for relative references can be
+ embedded within the content itself so that it can be readily obtained
+ by a parser. This can be useful for descriptive documents, such as
+ tables of contents, which may be transmitted to others through
+ protocols other than their usual retrieval context (e.g., email or
+ USENET news).
+
+ It is beyond the scope of this specification to specify how, for each
+ media type, a base URI can be embedded. The appropriate syntax, when
+ available, is described by the data format specification associated
+ with each media type.
+
+5.1.2. Base URI from the Encapsulating Entity
+
+ If no base URI is embedded, the base URI is defined by the
+ representation's retrieval context. For a document that is enclosed
+ within another entity, such as a message or archive, the retrieval
+ context is that entity. Thus, the default base URI of a
+ representation is the base URI of the entity in which the
+ representation is encapsulated.
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 29]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ A mechanism for embedding a base URI within MIME container types
+ (e.g., the message and multipart types) is defined by MHTML
+ [RFC2557]. Protocols that do not use the MIME message header syntax,
+ but that do allow some form of tagged metadata to be included within
+ messages, may define their own syntax for defining a base URI as part
+ of a message.
+
+5.1.3. Base URI from the Retrieval URI
+
+ If no base URI is embedded and the representation is not encapsulated
+ within some other entity, then, if a URI was used to retrieve the
+ representation, that URI shall be considered the base URI. Note that
+ if the retrieval was the result of a redirected request, the last URI
+ used (i.e., the URI that resulted in the actual retrieval of the
+ representation) is the base URI.
+
+5.1.4. Default Base URI
+
+ If none of the conditions described above apply, then the base URI is
+ defined by the context of the application. As this definition is
+ necessarily application-dependent, failing to define a base URI by
+ using one of the other methods may result in the same content being
+ interpreted differently by different types of applications.
+
+ A sender of a representation containing relative references is
+ responsible for ensuring that a base URI for those references can be
+ established. Aside from fragment-only references, relative
+ references can only be used reliably in situations where the base URI
+ is well defined.
+
+5.2. Relative Resolution
+
+ This section describes an algorithm for converting a URI reference
+ that might be relative to a given base URI into the parsed components
+ of the reference's target. The components can then be recomposed, as
+ described in Section 5.3, to form the target URI. This algorithm
+ provides definitive results that can be used to test the output of
+ other implementations. Applications may implement relative reference
+ resolution by using some other algorithm, provided that the results
+ match what would be given by this one.
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 30]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+5.2.1. Pre-parse the Base URI
+
+ The base URI (Base) is established according to the procedure of
+ Section 5.1 and parsed into the five main components described in
+ Section 3. Note that only the scheme component is required to be
+ present in a base URI; the other components may be empty or
+ undefined. A component is undefined if its associated delimiter does
+ not appear in the URI reference; the path component is never
+ undefined, though it may be empty.
+
+ Normalization of the base URI, as described in Sections 6.2.2 and
+ 6.2.3, is optional. A URI reference must be transformed to its
+ target URI before it can be normalized.
+
+5.2.2. Transform References
+
+ For each URI reference (R), the following pseudocode describes an
+ algorithm for transforming R into its target URI (T):
+
+ -- The URI reference is parsed into the five URI components
+ --
+ (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R);
+
+ -- A non-strict parser may ignore a scheme in the reference
+ -- if it is identical to the base URI's scheme.
+ --
+ if ((not strict) and (R.scheme == Base.scheme)) then
+ undefine(R.scheme);
+ endif;
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 31]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ if defined(R.scheme) then
+ T.scheme = R.scheme;
+ T.authority = R.authority;
+ T.path = remove_dot_segments(R.path);
+ T.query = R.query;
+ else
+ if defined(R.authority) then
+ T.authority = R.authority;
+ T.path = remove_dot_segments(R.path);
+ T.query = R.query;
+ else
+ if (R.path == "") then
+ T.path = Base.path;
+ if defined(R.query) then
+ T.query = R.query;
+ else
+ T.query = Base.query;
+ endif;
+ else
+ if (R.path starts-with "/") then
+ T.path = remove_dot_segments(R.path);
+ else
+ T.path = merge(Base.path, R.path);
+ T.path = remove_dot_segments(T.path);
+ endif;
+ T.query = R.query;
+ endif;
+ T.authority = Base.authority;
+ endif;
+ T.scheme = Base.scheme;
+ endif;
+
+ T.fragment = R.fragment;
+
+5.2.3. Merge Paths
+
+ The pseudocode above refers to a "merge" routine for merging a
+ relative-path reference with the path of the base URI. This is
+ accomplished as follows:
+
+ o If the base URI has a defined authority component and an empty
+ path, then return a string consisting of "/" concatenated with the
+ reference's path; otherwise,
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 32]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ o return a string consisting of the reference's path component
+ appended to all but the last segment of the base URI's path (i.e.,
+ excluding any characters after the right-most "/" in the base URI
+ path, or excluding the entire base URI path if it does not contain
+ any "/" characters).
+
+5.2.4. Remove Dot Segments
+
+ The pseudocode also refers to a "remove_dot_segments" routine for
+ interpreting and removing the special "." and ".." complete path
+ segments from a referenced path. This is done after the path is
+ extracted from a reference, whether or not the path was relative, in
+ order to remove any invalid or extraneous dot-segments prior to
+ forming the target URI. Although there are many ways to accomplish
+ this removal process, we describe a simple method using two string
+ buffers.
+
+ 1. The input buffer is initialized with the now-appended path
+ components and the output buffer is initialized to the empty
+ string.
+
+ 2. While the input buffer is not empty, loop as follows:
+
+ A. If the input buffer begins with a prefix of "../" or "./",
+ then remove that prefix from the input buffer; otherwise,
+
+ B. if the input buffer begins with a prefix of "/./" or "/.",
+ where "." is a complete path segment, then replace that
+ prefix with "/" in the input buffer; otherwise,
+
+ C. if the input buffer begins with a prefix of "/../" or "/..",
+ where ".." is a complete path segment, then replace that
+ prefix with "/" in the input buffer and remove the last
+ segment and its preceding "/" (if any) from the output
+ buffer; otherwise,
+
+ D. if the input buffer consists only of "." or "..", then remove
+ that from the input buffer; otherwise,
+
+ E. move the first path segment in the input buffer to the end of
+ the output buffer, including the initial "/" character (if
+ any) and any subsequent characters up to, but not including,
+ the next "/" character or the end of the input buffer.
+
+ 3. Finally, the output buffer is returned as the result of
+ remove_dot_segments.
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 33]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ Note that dot-segments are intended for use in URI references to
+ express an identifier relative to the hierarchy of names in the base
+ URI. The remove_dot_segments algorithm respects that hierarchy by
+ removing extra dot-segments rather than treat them as an error or
+ leaving them to be misinterpreted by dereference implementations.
+
+ The following illustrates how the above steps are applied for two
+ examples of merged paths, showing the state of the two buffers after
+ each step.
+
+ STEP OUTPUT BUFFER INPUT BUFFER
+
+ 1 : /a/b/c/./../../g
+ 2E: /a /b/c/./../../g
+ 2E: /a/b /c/./../../g
+ 2E: /a/b/c /./../../g
+ 2B: /a/b/c /../../g
+ 2C: /a/b /../g
+ 2C: /a /g
+ 2E: /a/g
+
+ STEP OUTPUT BUFFER INPUT BUFFER
+
+ 1 : mid/content=5/../6
+ 2E: mid /content=5/../6
+ 2E: mid/content=5 /../6
+ 2C: mid /6
+ 2E: mid/6
+
+ Some applications may find it more efficient to implement the
+ remove_dot_segments algorithm by using two segment stacks rather than
+ strings.
+
+ Note: Beware that some older, erroneous implementations will fail
+ to separate a reference's query component from its path component
+ prior to merging the base and reference paths, resulting in an
+ interoperability failure if the query component contains the
+ strings "/../" or "/./".
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 34]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+5.3. Component Recomposition
+
+ Parsed URI components can be recomposed to obtain the corresponding
+ URI reference string. Using pseudocode, this would be:
+
+ result = ""
+
+ if defined(scheme) then
+ append scheme to result;
+ append ":" to result;
+ endif;
+
+ if defined(authority) then
+ append "//" to result;
+ append authority to result;
+ endif;
+
+ append path to result;
+
+ if defined(query) then
+ append "?" to result;
+ append query to result;
+ endif;
+
+ if defined(fragment) then
+ append "#" to result;
+ append fragment to result;
+ endif;
+
+ return result;
+
+ Note that we are careful to preserve the distinction between a
+ component that is undefined, meaning that its separator was not
+ present in the reference, and a component that is empty, meaning that
+ the separator was present and was immediately followed by the next
+ component separator or the end of the reference.
+
+5.4. Reference Resolution Examples
+
+ Within a representation with a well defined base URI of
+
+ http://a/b/c/d;p?q
+
+ a relative reference is transformed to its target URI as follows.
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 35]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+5.4.1. Normal Examples
+
+ "g:h" = "g:h"
+ "g" = "http://a/b/c/g"
+ "./g" = "http://a/b/c/g"
+ "g/" = "http://a/b/c/g/"
+ "/g" = "http://a/g"
+ "//g" = "http://g"
+ "?y" = "http://a/b/c/d;p?y"
+ "g?y" = "http://a/b/c/g?y"
+ "#s" = "http://a/b/c/d;p?q#s"
+ "g#s" = "http://a/b/c/g#s"
+ "g?y#s" = "http://a/b/c/g?y#s"
+ ";x" = "http://a/b/c/;x"
+ "g;x" = "http://a/b/c/g;x"
+ "g;x?y#s" = "http://a/b/c/g;x?y#s"
+ "" = "http://a/b/c/d;p?q"
+ "." = "http://a/b/c/"
+ "./" = "http://a/b/c/"
+ ".." = "http://a/b/"
+ "../" = "http://a/b/"
+ "../g" = "http://a/b/g"
+ "../.." = "http://a/"
+ "../../" = "http://a/"
+ "../../g" = "http://a/g"
+
+5.4.2. Abnormal Examples
+
+ Although the following abnormal examples are unlikely to occur in
+ normal practice, all URI parsers should be capable of resolving them
+ consistently. Each example uses the same base as that above.
+
+ Parsers must be careful in handling cases where there are more ".."
+ segments in a relative-path reference than there are hierarchical
+ levels in the base URI's path. Note that the ".." syntax cannot be
+ used to change the authority component of a URI.
+
+ "../../../g" = "http://a/g"
+ "../../../../g" = "http://a/g"
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 36]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ Similarly, parsers must remove the dot-segments "." and ".." when
+ they are complete components of a path, but not when they are only
+ part of a segment.
+
+ "/./g" = "http://a/g"
+ "/../g" = "http://a/g"
+ "g." = "http://a/b/c/g."
+ ".g" = "http://a/b/c/.g"
+ "g.." = "http://a/b/c/g.."
+ "..g" = "http://a/b/c/..g"
+
+ Less likely are cases where the relative reference uses unnecessary
+ or nonsensical forms of the "." and ".." complete path segments.
+
+ "./../g" = "http://a/b/g"
+ "./g/." = "http://a/b/c/g/"
+ "g/./h" = "http://a/b/c/g/h"
+ "g/../h" = "http://a/b/c/h"
+ "g;x=1/./y" = "http://a/b/c/g;x=1/y"
+ "g;x=1/../y" = "http://a/b/c/y"
+
+ Some applications fail to separate the reference's query and/or
+ fragment components from the path component before merging it with
+ the base path and removing dot-segments. This error is rarely
+ noticed, as typical usage of a fragment never includes the hierarchy
+ ("/") character and the query component is not normally used within
+ relative references.
+
+ "g?y/./x" = "http://a/b/c/g?y/./x"
+ "g?y/../x" = "http://a/b/c/g?y/../x"
+ "g#s/./x" = "http://a/b/c/g#s/./x"
+ "g#s/../x" = "http://a/b/c/g#s/../x"
+
+ Some parsers allow the scheme name to be present in a relative
+ reference if it is the same as the base URI scheme. This is
+ considered to be a loophole in prior specifications of partial URI
+ [RFC1630]. Its use should be avoided but is allowed for backward
+ compatibility.
+
+ "http:g" = "http:g" ; for strict parsers
+ / "http://a/b/c/g" ; for backward compatibility
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 37]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+6. Normalization and Comparison
+
+ One of the most common operations on URIs is simple comparison:
+ determining whether two URIs are equivalent without using the URIs to
+ access their respective resource(s). A comparison is performed every
+ time a response cache is accessed, a browser checks its history to
+ color a link, or an XML parser processes tags within a namespace.
+ Extensive normalization prior to comparison of URIs is often used by
+ spiders and indexing engines to prune a search space or to reduce
+ duplication of request actions and response storage.
+
+ URI comparison is performed for some particular purpose. Protocols
+ or implementations that compare URIs for different purposes will
+ often be subject to differing design trade-offs in regards to how
+ much effort should be spent in reducing aliased identifiers. This
+ section describes various methods that may be used to compare URIs,
+ the trade-offs between them, and the types of applications that might
+ use them.
+
+6.1. Equivalence
+
+ Because URIs exist to identify resources, presumably they should be
+ considered equivalent when they identify the same resource. However,
+ this definition of equivalence is not of much practical use, as there
+ is no way for an implementation to compare two resources unless it
+ has full knowledge or control of them. For this reason,
+ determination of equivalence or difference of URIs is based on string
+ comparison, perhaps augmented by reference to additional rules
+ provided by URI scheme definitions. We use the terms "different" and
+ "equivalent" to describe the possible outcomes of such comparisons,
+ but there are many application-dependent versions of equivalence.
+
+ Even though it is possible to determine that two URIs are equivalent,
+ URI comparison is not sufficient to determine whether two URIs
+ identify different resources. For example, an owner of two different
+ domain names could decide to serve the same resource from both,
+ resulting in two different URIs. Therefore, comparison methods are
+ designed to minimize false negatives while strictly avoiding false
+ positives.
+
+ In testing for equivalence, applications should not directly compare
+ relative references; the references should be converted to their
+ respective target URIs before comparison. When URIs are compared to
+ select (or avoid) a network action, such as retrieval of a
+ representation, fragment components (if any) should be excluded from
+ the comparison.
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 38]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+6.2. Comparison Ladder
+
+ A variety of methods are used in practice to test URI equivalence.
+ These methods fall into a range, distinguished by the amount of
+ processing required and the degree to which the probability of false
+ negatives is reduced. As noted above, false negatives cannot be
+ eliminated. In practice, their probability can be reduced, but this
+ reduction requires more processing and is not cost-effective for all
+ applications.
+
+ If this range of comparison practices is considered as a ladder, the
+ following discussion will climb the ladder, starting with practices
+ that are cheap but have a relatively higher chance of producing false
+ negatives, and proceeding to those that have higher computational
+ cost and lower risk of false negatives.
+
+6.2.1. Simple String Comparison
+
+ If two URIs, when considered as character strings, are identical,
+ then it is safe to conclude that they are equivalent. This type of
+ equivalence test has very low computational cost and is in wide use
+ in a variety of applications, particularly in the domain of parsing.
+
+ Testing strings for equivalence requires some basic precautions.
+ This procedure is often referred to as "bit-for-bit" or
+ "byte-for-byte" comparison, which is potentially misleading. Testing
+ strings for equality is normally based on pair comparison of the
+ characters that make up the strings, starting from the first and
+ proceeding until both strings are exhausted and all characters are
+ found to be equal, until a pair of characters compares unequal, or
+ until one of the strings is exhausted before the other.
+
+ This character comparison requires that each pair of characters be
+ put in comparable form. For example, should one URI be stored in a
+ byte array in EBCDIC encoding and the second in a Java String object
+ (UTF-16), bit-for-bit comparisons applied naively will produce
+ errors. It is better to speak of equality on a character-for-
+ character basis rather than on a byte-for-byte or bit-for-bit basis.
+ In practical terms, character-by-character comparisons should be done
+ codepoint-by-codepoint after conversion to a common character
+ encoding.
+
+ False negatives are caused by the production and use of URI aliases.
+ Unnecessary aliases can be reduced, regardless of the comparison
+ method, by consistently providing URI references in an already-
+ normalized form (i.e., a form identical to what would be produced
+ after normalization is applied, as described below).
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 39]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ Protocols and data formats often limit some URI comparisons to simple
+ string comparison, based on the theory that people and
+ implementations will, in their own best interest, be consistent in
+ providing URI references, or at least consistent enough to negate any
+ efficiency that might be obtained from further normalization.
+
+6.2.2. Syntax-Based Normalization
+
+ Implementations may use logic based on the definitions provided by
+ this specification to reduce the probability of false negatives.
+ This processing is moderately higher in cost than character-for-
+ character string comparison. For example, an application using this
+ approach could reasonably consider the following two URIs equivalent:
+
+ example://a/b/c/%7Bfoo%7D
+ eXAMPLE://a/./b/../b/%63/%7bfoo%7d
+
+ Web user agents, such as browsers, typically apply this type of URI
+ normalization when determining whether a cached response is
+ available. Syntax-based normalization includes such techniques as
+ case normalization, percent-encoding normalization, and removal of
+ dot-segments.
+
+6.2.2.1. Case Normalization
+
+ For all URIs, the hexadecimal digits within a percent-encoding
+ triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
+ should be normalized to use uppercase letters for the digits A-F.
+
+ When a URI uses components of the generic syntax, the component
+ syntax equivalence rules always apply; namely, that the scheme and
+ host are case-insensitive and therefore should be normalized to
+ lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is
+ equivalent to <http://www.example.com/>. The other generic syntax
+ components are assumed to be case-sensitive unless specifically
+ defined otherwise by the scheme (see Section 6.2.3).
+
+6.2.2.2. Percent-Encoding Normalization
+
+ The percent-encoding mechanism (Section 2.1) is a frequent source of
+ variance among otherwise identical URIs. In addition to the case
+ normalization issue noted above, some URI producers percent-encode
+ octets that do not require percent-encoding, resulting in URIs that
+ are equivalent to their non-encoded counterparts. These URIs should
+ be normalized by decoding any percent-encoded octet that corresponds
+ to an unreserved character, as described in Section 2.3.
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 40]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+6.2.2.3. Path Segment Normalization
+
+ The complete path segments "." and ".." are intended only for use
+ within relative references (Section 4.1) and are removed as part of
+ the reference resolution process (Section 5.2). However, some
+ deployed implementations incorrectly assume that reference resolution
+ is not necessary when the reference is already a URI and thus fail to
+ remove dot-segments when they occur in non-relative paths. URI
+ normalizers should remove dot-segments by applying the
+ remove_dot_segments algorithm to the path, as described in
+ Section 5.2.4.
+
+6.2.3. Scheme-Based Normalization
+
+ The syntax and semantics of URIs vary from scheme to scheme, as
+ described by the defining specification for each scheme.
+ Implementations may use scheme-specific rules, at further processing
+ cost, to reduce the probability of false negatives. For example,
+ because the "http" scheme makes use of an authority component, has a
+ default port of "80", and defines an empty path to be equivalent to
+ "/", the following four URIs are equivalent:
+
+ http://example.com
+ http://example.com/
+ http://example.com:/
+ http://example.com:80/
+
+ In general, a URI that uses the generic syntax for authority with an
+ empty path should be normalized to a path of "/". Likewise, an
+ explicit ":port", for which the port is empty or the default for the
+ scheme, is equivalent to one where the port and its ":" delimiter are
+ elided and thus should be removed by scheme-based normalization. For
+ example, the second URI above is the normal form for the "http"
+ scheme.
+
+ Another case where normalization varies by scheme is in the handling
+ of an empty authority component or empty host subcomponent. For many
+ scheme specifications, an empty authority or host is considered an
+ error; for others, it is considered equivalent to "localhost" or the
+ end-user's host. When a scheme defines a default for authority and a
+ URI reference to that default is desired, the reference should be
+ normalized to an empty authority for the sake of uniformity, brevity,
+ and internationalization. If, however, either the userinfo or port
+ subcomponents are non-empty, then the host should be given explicitly
+ even if it matches the default.
+
+ Normalization should not remove delimiters when their associated
+ component is empty unless licensed to do so by the scheme
+
+
+
+Berners-Lee, et al. Standards Track [Page 41]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ specification. For example, the URI "http://example.com/?" cannot be
+ assumed to be equivalent to any of the examples above. Likewise, the
+ presence or absence of delimiters within a userinfo subcomponent is
+ usually significant to its interpretation. The fragment component is
+ not subject to any scheme-based normalization; thus, two URIs that
+ differ only by the suffix "#" are considered different regardless of
+ the scheme.
+
+ Some schemes define additional subcomponents that consist of case-
+ insensitive data, giving an implicit license to normalizers to
+ convert this data to a common case (e.g., all lowercase). For
+ example, URI schemes that define a subcomponent of path to contain an
+ Internet hostname, such as the "mailto" URI scheme, cause that
+ subcomponent to be case-insensitive and thus subject to case
+ normalization (e.g., "mailto:Joe@Example.COM" is equivalent to
+ "mailto:Joe@example.com", even though the generic syntax considers
+ the path component to be case-sensitive).
+
+ Other scheme-specific normalizations are possible.
+
+6.2.4. Protocol-Based Normalization
+
+ Substantial effort to reduce the incidence of false negatives is
+ often cost-effective for web spiders. Therefore, they implement even
+ more aggressive techniques in URI comparison. For example, if they
+ observe that a URI such as
+
+ http://example.com/data
+
+ redirects to a URI differing only in the trailing slash
+
+ http://example.com/data/
+
+ they will likely regard the two as equivalent in the future. This
+ kind of technique is only appropriate when equivalence is clearly
+ indicated by both the result of accessing the resources and the
+ common conventions of their scheme's dereference algorithm (in this
+ case, use of redirection by HTTP origin servers to avoid problems
+ with relative references).
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 42]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+7. Security Considerations
+
+ A URI does not in itself pose a security threat. However, as URIs
+ are often used to provide a compact set of instructions for access to
+ network resources, care must be taken to properly interpret the data
+ within a URI, to prevent that data from causing unintended access,
+ and to avoid including data that should not be revealed in plain
+ text.
+
+7.1. Reliability and Consistency
+
+ There is no guarantee that once a URI has been used to retrieve
+ information, the same information will be retrievable by that URI in
+ the future. Nor is there any guarantee that the information
+ retrievable via that URI in the future will be observably similar to
+ that retrieved in the past. The URI syntax does not constrain how a
+ given scheme or authority apportions its namespace or maintains it
+ over time. Such guarantees can only be obtained from the person(s)
+ controlling that namespace and the resource in question. A specific
+ URI scheme may define additional semantics, such as name persistence,
+ if those semantics are required of all naming authorities for that
+ scheme.
+
+7.2. Malicious Construction
+
+ It is sometimes possible to construct a URI so that an attempt to
+ perform a seemingly harmless, idempotent operation, such as the
+ retrieval of a representation, will in fact cause a possibly damaging
+ remote operation. The unsafe URI is typically constructed by
+ specifying a port number other than that reserved for the network
+ protocol in question. The client unwittingly contacts a site running
+ a different protocol service, and data within the URI contains
+ instructions that, when interpreted according to this other protocol,
+ cause an unexpected operation. A frequent example of such abuse has
+ been the use of a protocol-based scheme with a port component of
+ "25", thereby fooling user agent software into sending an unintended
+ or impersonating message via an SMTP server.
+
+ Applications should prevent dereference of a URI that specifies a TCP
+ port number within the "well-known port" range (0 - 1023) unless the
+ protocol being used to dereference that URI is compatible with the
+ protocol expected on that well-known port. Although IANA maintains a
+ registry of well-known ports, applications should make such
+ restrictions user-configurable to avoid preventing the deployment of
+ new services.
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 43]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ When a URI contains percent-encoded octets that match the delimiters
+ for a given resolution or dereference protocol (for example, CR and
+ LF characters for the TELNET protocol), these percent-encodings must
+ not be decoded before transmission across that protocol. Transfer of
+ the percent-encoding, which might violate the protocol, is less
+ harmful than allowing decoded octets to be interpreted as additional
+ operations or parameters, perhaps triggering an unexpected and
+ possibly harmful remote operation.
+
+7.3. Back-End Transcoding
+
+ When a URI is dereferenced, the data within it is often parsed by
+ both the user agent and one or more servers. In HTTP, for example, a
+ typical user agent will parse a URI into its five major components,
+ access the authority's server, and send it the data within the
+ authority, path, and query components. A typical server will take
+ that information, parse the path into segments and the query into
+ key/value pairs, and then invoke implementation-specific handlers to
+ respond to the request. As a result, a common security concern for
+ server implementations that handle a URI, either as a whole or split
+ into separate components, is proper interpretation of the octet data
+ represented by the characters and percent-encodings within that URI.
+
+ Percent-encoded octets must be decoded at some point during the
+ dereference process. Applications must split the URI into its
+ components and subcomponents prior to decoding the octets, as
+ otherwise the decoded octets might be mistaken for delimiters.
+ Security checks of the data within a URI should be applied after
+ decoding the octets. Note, however, that the "%00" percent-encoding
+ (NUL) may require special handling and should be rejected if the
+ application is not expecting to receive raw data within a component.
+
+ Special care should be taken when the URI path interpretation process
+ involves the use of a back-end file system or related system
+ functions. File systems typically assign an operational meaning to
+ special characters, such as the "/", "\", ":", "[", and "]"
+ characters, and to special device names like ".", "..", "...", "aux",
+ "lpt", etc. In some cases, merely testing for the existence of such
+ a name will cause the operating system to pause or invoke unrelated
+ system calls, leading to significant security concerns regarding
+ denial of service and unintended data transfer. It would be
+ impossible for this specification to list all such significant
+ characters and device names. Implementers should research the
+ reserved names and characters for the types of storage device that
+ may be attached to their applications and restrict the use of data
+ obtained from URI components accordingly.
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 44]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+7.4. Rare IP Address Formats
+
+ Although the URI syntax for IPv4address only allows the common
+ dotted-decimal form of IPv4 address literal, many implementations
+ that process URIs make use of platform-dependent system routines,
+ such as gethostbyname() and inet_aton(), to translate the string
+ literal to an actual IP address. Unfortunately, such system routines
+ often allow and process a much larger set of formats than those
+ described in Section 3.2.2.
+
+ For example, many implementations allow dotted forms of three
+ numbers, wherein the last part is interpreted as a 16-bit quantity
+ and placed in the right-most two bytes of the network address (e.g.,
+ a Class B network). Likewise, a dotted form of two numbers means
+ that the last part is interpreted as a 24-bit quantity and placed in
+ the right-most three bytes of the network address (Class A), and a
+ single number (without dots) is interpreted as a 32-bit quantity and
+ stored directly in the network address. Adding further to the
+ confusion, some implementations allow each dotted part to be
+ interpreted as decimal, octal, or hexadecimal, as specified in the C
+ language (i.e., a leading 0x or 0X implies hexadecimal; a leading 0
+ implies octal; otherwise, the number is interpreted as decimal).
+
+ These additional IP address formats are not allowed in the URI syntax
+ due to differences between platform implementations. However, they
+ can become a security concern if an application attempts to filter
+ access to resources based on the IP address in string literal format.
+ If this filtering is performed, literals should be converted to
+ numeric form and filtered based on the numeric value, and not on a
+ prefix or suffix of the string form.
+
+7.5. Sensitive Information
+
+ URI producers should not provide a URI that contains a username or
+ password that is intended to be secret. URIs are frequently
+ displayed by browsers, stored in clear text bookmarks, and logged by
+ user agent history and intermediary applications (proxies). A
+ password appearing within the userinfo component is deprecated and
+ should be considered an error (or simply ignored) except in those
+ rare cases where the 'password' parameter is intended to be public.
+
+7.6. Semantic Attacks
+
+ Because the userinfo subcomponent is rarely used and appears before
+ the host in the authority component, it can be used to construct a
+ URI intended to mislead a human user by appearing to identify one
+ (trusted) naming authority while actually identifying a different
+ authority hidden behind the noise. For example
+
+
+
+Berners-Lee, et al. Standards Track [Page 45]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm
+
+ might lead a human user to assume that the host is 'cnn.example.com',
+ whereas it is actually '10.0.0.1'. Note that a misleading userinfo
+ subcomponent could be much longer than the example above.
+
+ A misleading URI, such as that above, is an attack on the user's
+ preconceived notions about the meaning of a URI rather than an attack
+ on the software itself. User agents may be able to reduce the impact
+ of such attacks by distinguishing the various components of the URI
+ when they are rendered, such as by using a different color or tone to
+ render userinfo if any is present, though there is no panacea. More
+ information on URI-based semantic attacks can be found in [Siedzik].
+
+8. IANA Considerations
+
+ URI scheme names, as defined by <scheme> in Section 3.1, form a
+ registered namespace that is managed by IANA according to the
+ procedures defined in [BCP35]. No IANA actions are required by this
+ document.
+
+9. Acknowledgements
+
+ This specification is derived from RFC 2396 [RFC2396], RFC 1808
+ [RFC1808], and RFC 1738 [RFC1738]; the acknowledgements in those
+ documents still apply. It also incorporates the update (with
+ corrections) for IPv6 literals in the host syntax, as defined by
+ Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in
+ [RFC2732]. In addition, contributions by Gisle Aas, Reese Anschultz,
+ Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll,
+ Dan Connolly, Adam M. Costello, John Cowan, Jason Diamond, Martin
+ Duerst, Stefan Eissing, Clive D.W. Feather, Al Gilman, Tony Hammond,
+ Elliotte Harold, Pat Hayes, Henry Holtzman, Ian B. Jacobs, Michael
+ Kay, John C. Klensin, Graham Klyne, Dan Kohn, Bruce Lilly, Andrew
+ Main, Dave McAlpin, Ira McDonald, Michael Mealling, Ray Merkert,
+ Stephen Pollei, Julian Reschke, Tomas Rokicki, Miles Sabin, Kai
+ Schaetzl, Mark Thomson, Ronald Tschalaer, Norm Walsh, Marc Warne,
+ Stuart Williams, and Henry Zongaro are gratefully acknowledged.
+
+10. References
+
+10.1. Normative References
+
+ [ASCII] American National Standards Institute, "Coded Character
+ Set -- 7-bit American Standard Code for Information
+ Interchange", ANSI X3.4, 1986.
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 46]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
+ Specifications: ABNF", RFC 2234, November 1997.
+
+ [STD63] Yergeau, F., "UTF-8, a transformation format of
+ ISO 10646", STD 63, RFC 3629, November 2003.
+
+ [UCS] International Organization for Standardization,
+ "Information Technology - Universal Multiple-Octet Coded
+ Character Set (UCS)", ISO/IEC 10646:2003, December 2003.
+
+10.2. Informative References
+
+ [BCP19] Freed, N. and J. Postel, "IANA Charset Registration
+ Procedures", BCP 19, RFC 2978, October 2000.
+
+ [BCP35] Petke, R. and I. King, "Registration Procedures for URL
+ Scheme Names", BCP 35, RFC 2717, November 1999.
+
+ [RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
+ host table specification", RFC 952, October 1985.
+
+ [RFC1034] Mockapetris, P., "Domain names - concepts and facilities",
+ STD 13, RFC 1034, November 1987.
+
+ [RFC1123] Braden, R., "Requirements for Internet Hosts - Application
+ and Support", STD 3, RFC 1123, October 1989.
+
+ [RFC1535] Gavron, E., "A Security Problem and Proposed Correction
+ With Widely Deployed DNS Software", RFC 1535,
+ October 1993.
+
+ [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A
+ Unifying Syntax for the Expression of Names and Addresses
+ of Objects on the Network as used in the World-Wide Web",
+ RFC 1630, June 1994.
+
+ [RFC1736] Kunze, J., "Functional Recommendations for Internet
+ Resource Locators", RFC 1736, February 1995.
+
+ [RFC1737] Sollins, K. and L. Masinter, "Functional Requirements for
+ Uniform Resource Names", RFC 1737, December 1994.
+
+ [RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
+ Resource Locators (URL)", RFC 1738, December 1994.
+
+ [RFC1808] Fielding, R., "Relative Uniform Resource Locators",
+ RFC 1808, June 1995.
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 47]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
+ Extensions (MIME) Part Two: Media Types", RFC 2046,
+ November 1996.
+
+ [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997.
+
+ [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
+ Resource Identifiers (URI): Generic Syntax", RFC 2396,
+ August 1998.
+
+ [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S., and D.
+ Jensen, "HTTP Extensions for Distributed Authoring --
+ WEBDAV", RFC 2518, February 1999.
+
+ [RFC2557] Palme, J., Hopmann, A., and N. Shelness, "MIME
+ Encapsulation of Aggregate Documents, such as HTML
+ (MHTML)", RFC 2557, March 1999.
+
+ [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D., and R. Petke,
+ "Guidelines for new URL Schemes", RFC 2718, November 1999.
+
+ [RFC2732] Hinden, R., Carpenter, B., and L. Masinter, "Format for
+ Literal IPv6 Addresses in URL's", RFC 2732, December 1999.
+
+ [RFC3305] Mealling, M. and R. Denenberg, "Report from the Joint
+ W3C/IETF URI Planning Interest Group: Uniform Resource
+ Identifiers (URIs), URLs, and Uniform Resource Names
+ (URNs): Clarifications and Recommendations", RFC 3305,
+ August 2002.
+
+ [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
+ "Internationalizing Domain Names in Applications (IDNA)",
+ RFC 3490, March 2003.
+
+ [RFC3513] Hinden, R. and S. Deering, "Internet Protocol Version 6
+ (IPv6) Addressing Architecture", RFC 3513, April 2003.
+
+ [Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?",
+ April 2001, <http://www.giac.org/practical/gsec/
+ Richard_Siedzik_GSEC.pdf>.
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 48]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+Appendix A. Collected ABNF for URI
+
+ URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
+
+ hier-part = "//" authority path-abempty
+ / path-absolute
+ / path-rootless
+ / path-empty
+
+ URI-reference = URI / relative-ref
+
+ absolute-URI = scheme ":" hier-part [ "?" query ]
+
+ relative-ref = relative-part [ "?" query ] [ "#" fragment ]
+
+ relative-part = "//" authority path-abempty
+ / path-absolute
+ / path-noscheme
+ / path-empty
+
+ scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
+
+ authority = [ userinfo "@" ] host [ ":" port ]
+ userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
+ host = IP-literal / IPv4address / reg-name
+ port = *DIGIT
+
+ IP-literal = "[" ( IPv6address / IPvFuture ) "]"
+
+ IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
+
+ IPv6address = 6( h16 ":" ) ls32
+ / "::" 5( h16 ":" ) ls32
+ / [ h16 ] "::" 4( h16 ":" ) ls32
+ / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
+ / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
+ / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
+ / [ *4( h16 ":" ) h16 ] "::" ls32
+ / [ *5( h16 ":" ) h16 ] "::" h16
+ / [ *6( h16 ":" ) h16 ] "::"
+
+ h16 = 1*4HEXDIG
+ ls32 = ( h16 ":" h16 ) / IPv4address
+ IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 49]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ dec-octet = DIGIT ; 0-9
+ / %x31-39 DIGIT ; 10-99
+ / "1" 2DIGIT ; 100-199
+ / "2" %x30-34 DIGIT ; 200-249
+ / "25" %x30-35 ; 250-255
+
+ reg-name = *( unreserved / pct-encoded / sub-delims )
+
+ path = path-abempty ; begins with "/" or is empty
+ / path-absolute ; begins with "/" but not "//"
+ / path-noscheme ; begins with a non-colon segment
+ / path-rootless ; begins with a segment
+ / path-empty ; zero characters
+
+ path-abempty = *( "/" segment )
+ path-absolute = "/" [ segment-nz *( "/" segment ) ]
+ path-noscheme = segment-nz-nc *( "/" segment )
+ path-rootless = segment-nz *( "/" segment )
+ path-empty = 0<pchar>
+
+ segment = *pchar
+ segment-nz = 1*pchar
+ segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
+ ; non-zero-length segment without any colon ":"
+
+ pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
+
+ query = *( pchar / "/" / "?" )
+
+ fragment = *( pchar / "/" / "?" )
+
+ pct-encoded = "%" HEXDIG HEXDIG
+
+ unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
+ reserved = gen-delims / sub-delims
+ gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
+ sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
+ / "*" / "+" / "," / ";" / "="
+
+Appendix B. Parsing a URI Reference with a Regular Expression
+
+ As the "first-match-wins" algorithm is identical to the "greedy"
+ disambiguation method used by POSIX regular expressions, it is
+ natural and commonplace to use a regular expression for parsing the
+ potential five components of a URI reference.
+
+ The following line is the regular expression for breaking-down a
+ well-formed URI reference into its components.
+
+
+
+Berners-Lee, et al. Standards Track [Page 50]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
+ 12 3 4 5 6 7 8 9
+
+ The numbers in the second line above are only to assist readability;
+ they indicate the reference points for each subexpression (i.e., each
+ paired parenthesis). We refer to the value matched for subexpression
+ <n> as $<n>. For example, matching the above expression to
+
+ http://www.ics.uci.edu/pub/ietf/uri/#Related
+
+ results in the following subexpression matches:
+
+ $1 = http:
+ $2 = http
+ $3 = //www.ics.uci.edu
+ $4 = www.ics.uci.edu
+ $5 = /pub/ietf/uri/
+ $6 = <undefined>
+ $7 = <undefined>
+ $8 = #Related
+ $9 = Related
+
+ where <undefined> indicates that the component is not present, as is
+ the case for the query component in the above example. Therefore, we
+ can determine the value of the five components as
+
+ scheme = $2
+ authority = $4
+ path = $5
+ query = $7
+ fragment = $9
+
+ Going in the opposite direction, we can recreate a URI reference from
+ its components by using the algorithm of Section 5.3.
+
+Appendix C. Delimiting a URI in Context
+
+ URIs are often transmitted through formats that do not provide a
+ clear context for their interpretation. For example, there are many
+ occasions when a URI is included in plain text; examples include text
+ sent in email, USENET news, and on printed paper. In such cases, it
+ is important to be able to delimit the URI from the rest of the text,
+ and in particular from punctuation marks that might be mistaken for
+ part of the URI.
+
+ In practice, URIs are delimited in a variety of ways, but usually
+ within double-quotes "http://example.com/", angle brackets
+ <http://example.com/>, or just by using whitespace:
+
+
+
+Berners-Lee, et al. Standards Track [Page 51]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ http://example.com/
+
+ These wrappers do not form part of the URI.
+
+ In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may
+ have to be added to break a long URI across lines. The whitespace
+ should be ignored when the URI is extracted.
+
+ No whitespace should be introduced after a hyphen ("-") character.
+ Because some typesetters and printers may (erroneously) introduce a
+ hyphen at the end of line when breaking it, the interpreter of a URI
+ containing a line break immediately after a hyphen should ignore all
+ whitespace around the line break and should be aware that the hyphen
+ may or may not actually be part of the URI.
+
+ Using <> angle brackets around each URI is especially recommended as
+ a delimiting style for a reference that contains embedded whitespace.
+
+ The prefix "URL:" (with or without a trailing space) was formerly
+ recommended as a way to help distinguish a URI from other bracketed
+ designators, though it is not commonly used in practice and is no
+ longer recommended.
+
+ For robustness, software that accepts user-typed URI should attempt
+ to recognize and strip both delimiters and embedded whitespace.
+
+ For example, the text
+
+ Yes, Jim, I found it under "http://www.w3.org/Addressing/",
+ but you can probably pick it up from <ftp://foo.example.
+ com/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/
+ ietf/uri/historical.html#WARNING>.
+
+ contains the URI references
+
+ http://www.w3.org/Addressing/
+ ftp://foo.example.com/rfc/
+ http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 52]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+Appendix D. Changes from RFC 2396
+
+D.1. Additions
+
+ An ABNF rule for URI has been introduced to correspond to one common
+ usage of the term: an absolute URI with optional fragment.
+
+ IPv6 (and later) literals have been added to the list of possible
+ identifiers for the host portion of an authority component, as
+ described by [RFC2732], with the addition of "[" and "]" to the
+ reserved set and a version flag to anticipate future versions of IP
+ literals. Square brackets are now specified as reserved within the
+ authority component and are not allowed outside their use as
+ delimiters for an IP literal within host. In order to make this
+ change without changing the technical definition of the path, query,
+ and fragment components, those rules were redefined to directly
+ specify the characters allowed.
+
+ As [RFC2732] defers to [RFC3513] for definition of an IPv6 literal
+ address, which, unfortunately, lacks an ABNF description of
+ IPv6address, we created a new ABNF rule for IPv6address that matches
+ the text representations defined by Section 2.2 of [RFC3513].
+ Likewise, the definition of IPv4address has been improved in order to
+ limit each decimal octet to the range 0-255.
+
+ Section 6, on URI normalization and comparison, has been completely
+ rewritten and extended by using input from Tim Bray and discussion
+ within the W3C Technical Architecture Group.
+
+D.2. Modifications
+
+ The ad-hoc BNF syntax of RFC 2396 has been replaced with the ABNF of
+ [RFC2234]. This change required all rule names that formerly
+ included underscore characters to be renamed with a dash instead. In
+ addition, a number of syntax rules have been eliminated or simplified
+ to make the overall grammar more comprehensible. Specifications that
+ refer to the obsolete grammar rules may be understood by replacing
+ those rules according to the following table:
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 53]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ +----------------+--------------------------------------------------+
+ | obsolete rule | translation |
+ +----------------+--------------------------------------------------+
+ | absoluteURI | absolute-URI |
+ | relativeURI | relative-part [ "?" query ] |
+ | hier_part | ( "//" authority path-abempty / |
+ | | path-absolute ) [ "?" query ] |
+ | | |
+ | opaque_part | path-rootless [ "?" query ] |
+ | net_path | "//" authority path-abempty |
+ | abs_path | path-absolute |
+ | rel_path | path-rootless |
+ | rel_segment | segment-nz-nc |
+ | reg_name | reg-name |
+ | server | authority |
+ | hostport | host [ ":" port ] |
+ | hostname | reg-name |
+ | path_segments | path-abempty |
+ | param | *<pchar excluding ";"> |
+ | | |
+ | uric | unreserved / pct-encoded / ";" / "?" / ":" |
+ | | / "@" / "&" / "=" / "+" / "$" / "," / "/" |
+ | | |
+ | uric_no_slash | unreserved / pct-encoded / ";" / "?" / ":" |
+ | | / "@" / "&" / "=" / "+" / "$" / "," |
+ | | |
+ | mark | "-" / "_" / "." / "!" / "~" / "*" / "'" |
+ | | / "(" / ")" |
+ | | |
+ | escaped | pct-encoded |
+ | hex | HEXDIG |
+ | alphanum | ALPHA / DIGIT |
+ +----------------+--------------------------------------------------+
+
+ Use of the above obsolete rules for the definition of scheme-specific
+ syntax is deprecated.
+
+ Section 2, on characters, has been rewritten to explain what
+ characters are reserved, when they are reserved, and why they are
+ reserved, even when they are not used as delimiters by the generic
+ syntax. The mark characters that are typically unsafe to decode,
+ including the exclamation mark ("!"), asterisk ("*"), single-quote
+ ("'"), and open and close parentheses ("(" and ")"), have been moved
+ to the reserved set in order to clarify the distinction between
+ reserved and unreserved and, hopefully, to answer the most common
+ question of scheme designers. Likewise, the section on
+ percent-encoded characters has been rewritten, and URI normalizers
+ are now given license to decode any percent-encoded octets
+
+
+
+Berners-Lee, et al. Standards Track [Page 54]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ corresponding to unreserved characters. In general, the terms
+ "escaped" and "unescaped" have been replaced with "percent-encoded"
+ and "decoded", respectively, to reduce confusion with other forms of
+ escape mechanisms.
+
+ The ABNF for URI and URI-reference has been redesigned to make them
+ more friendly to LALR parsers and to reduce complexity. As a result,
+ the layout form of syntax description has been removed, along with
+ the uric, uric_no_slash, opaque_part, net_path, abs_path, rel_path,
+ path_segments, rel_segment, and mark rules. All references to
+ "opaque" URIs have been replaced with a better description of how the
+ path component may be opaque to hierarchy. The relativeURI rule has
+ been replaced with relative-ref to avoid unnecessary confusion over
+ whether they are a subset of URI. The ambiguity regarding the
+ parsing of URI-reference as a URI or a relative-ref with a colon in
+ the first segment has been eliminated through the use of five
+ separate path matching rules.
+
+ The fragment identifier has been moved back into the section on
+ generic syntax components and within the URI and relative-ref rules,
+ though it remains excluded from absolute-URI. The number sign ("#")
+ character has been moved back to the reserved set as a result of
+ reintegrating the fragment syntax.
+
+ The ABNF has been corrected to allow the path component to be empty.
+ This also allows an absolute-URI to consist of nothing after the
+ "scheme:", as is present in practice with the "dav:" namespace
+ [RFC2518] and with the "about:" scheme used internally by many WWW
+ browser implementations. The ambiguity regarding the boundary
+ between authority and path has been eliminated through the use of
+ five separate path matching rules.
+
+ Registry-based naming authorities that use the generic syntax are now
+ defined within the host rule. This change allows current
+ implementations, where whatever name provided is simply fed to the
+ local name resolution mechanism, to be consistent with the
+ specification. It also removes the need to re-specify DNS name
+ formats here. Furthermore, it allows the host component to contain
+ percent-encoded octets, which is necessary to enable
+ internationalized domain names to be provided in URIs, processed in
+ their native character encodings at the application layers above URI
+ processing, and passed to an IDNA library as a registered name in the
+ UTF-8 character encoding. The server, hostport, hostname,
+ domainlabel, toplabel, and alphanum rules have been removed.
+
+ The resolving relative references algorithm of [RFC2396] has been
+ rewritten with pseudocode for this revision to improve clarity and
+ fix the following issues:
+
+
+
+Berners-Lee, et al. Standards Track [Page 55]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ o [RFC2396] section 5.2, step 6a, failed to account for a base URI
+ with no path.
+
+ o Restored the behavior of [RFC1808] where, if the reference
+ contains an empty path and a defined query component, the target
+ URI inherits the base URI's path component.
+
+ o The determination of whether a URI reference is a same-document
+ reference has been decoupled from the URI parser, simplifying the
+ URI processing interface within applications in a way consistent
+ with the internal architecture of deployed URI processing
+ implementations. The determination is now based on comparison to
+ the base URI after transforming a reference to absolute form,
+ rather than on the format of the reference itself. This change
+ may result in more references being considered "same-document"
+ under this specification than there would be under the rules given
+ in RFC 2396, especially when normalization is used to reduce
+ aliases. However, it does not change the status of existing
+ same-document references.
+
+ o Separated the path merge routine into two routines: merge, for
+ describing combination of the base URI path with a relative-path
+ reference, and remove_dot_segments, for describing how to remove
+ the special "." and ".." segments from a composed path. The
+ remove_dot_segments algorithm is now applied to all URI reference
+ paths in order to match common implementations and to improve the
+ normalization of URIs in practice. This change only impacts the
+ parsing of abnormal references and same-scheme references wherein
+ the base URI has a non-hierarchical path.
+
+Index
+
+ A
+ ABNF 11
+ absolute 27
+ absolute-path 26
+ absolute-URI 27
+ access 9
+ authority 17, 18
+
+ B
+ base URI 28
+
+ C
+ character encoding 4
+ character 4
+ characters 8, 11
+ coded character set 4
+
+
+
+Berners-Lee, et al. Standards Track [Page 56]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ D
+ dec-octet 20
+ dereference 9
+ dot-segments 23
+
+ F
+ fragment 16, 24
+
+ G
+ gen-delims 13
+ generic syntax 6
+
+ H
+ h16 20
+ hier-part 16
+ hierarchical 10
+ host 18
+
+ I
+ identifier 5
+ IP-literal 19
+ IPv4 20
+ IPv4address 19, 20
+ IPv6 19
+ IPv6address 19, 20
+ IPvFuture 19
+
+ L
+ locator 7
+ ls32 20
+
+ M
+ merge 32
+
+ N
+ name 7
+ network-path 26
+
+ P
+ path 16, 22, 26
+ path-abempty 22
+ path-absolute 22
+ path-empty 22
+ path-noscheme 22
+ path-rootless 22
+ path-abempty 16, 22, 26
+ path-absolute 16, 22, 26
+ path-empty 16, 22, 26
+
+
+
+Berners-Lee, et al. Standards Track [Page 57]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ path-rootless 16, 22
+ pchar 23
+ pct-encoded 12
+ percent-encoding 12
+ port 22
+
+ Q
+ query 16, 23
+
+ R
+ reg-name 21
+ registered name 20
+ relative 10, 28
+ relative-path 26
+ relative-ref 26
+ remove_dot_segments 33
+ representation 9
+ reserved 12
+ resolution 9, 28
+ resource 5
+ retrieval 9
+
+ S
+ same-document 27
+ sameness 9
+ scheme 16, 17
+ segment 22, 23
+ segment-nz 23
+ segment-nz-nc 23
+ sub-delims 13
+ suffix 27
+
+ T
+ transcription 8
+
+ U
+ uniform 4
+ unreserved 13
+ URI grammar
+ absolute-URI 27
+ ALPHA 11
+ authority 18
+ CR 11
+ dec-octet 20
+ DIGIT 11
+ DQUOTE 11
+ fragment 24
+ gen-delims 13
+
+
+
+Berners-Lee, et al. Standards Track [Page 58]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+ h16 20
+ HEXDIG 11
+ hier-part 16
+ host 19
+ IP-literal 19
+ IPv4address 20
+ IPv6address 20
+ IPvFuture 19
+ LF 11
+ ls32 20
+ OCTET 11
+ path 22
+ path-abempty 22
+ path-absolute 22
+ path-empty 22
+ path-noscheme 22
+ path-rootless 22
+ pchar 23
+ pct-encoded 12
+ port 22
+ query 24
+ reg-name 21
+ relative-ref 26
+ reserved 13
+ scheme 17
+ segment 23
+ segment-nz 23
+ segment-nz-nc 23
+ SP 11
+ sub-delims 13
+ unreserved 13
+ URI 16
+ URI-reference 25
+ userinfo 18
+ URI 16
+ URI-reference 25
+ URL 7
+ URN 7
+ userinfo 18
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 59]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+Authors' Addresses
+
+ Tim Berners-Lee
+ World Wide Web Consortium
+ Massachusetts Institute of Technology
+ 77 Massachusetts Avenue
+ Cambridge, MA 02139
+ USA
+
+ Phone: +1-617-253-5702
+ Fax: +1-617-258-5999
+ EMail: timbl@w3.org
+ URI: http://www.w3.org/People/Berners-Lee/
+
+
+ Roy T. Fielding
+ Day Software
+ 5251 California Ave., Suite 110
+ Irvine, CA 92617
+ USA
+
+ Phone: +1-949-679-2960
+ Fax: +1-949-679-2972
+ EMail: fielding@gbiv.com
+ URI: http://roy.gbiv.com/
+
+
+ Larry Masinter
+ Adobe Systems Incorporated
+ 345 Park Ave
+ San Jose, CA 95110
+ USA
+
+ Phone: +1-408-536-3024
+ EMail: LMM@acm.org
+ URI: http://larry.masinter.net/
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 60]
+
+RFC 3986 URI Generic Syntax January 2005
+
+
+Full Copyright Statement
+
+ Copyright (C) The Internet Society (2005).
+
+ This document is subject to the rights, licenses and restrictions
+ contained in BCP 78, and except as set forth therein, the authors
+ retain all their rights.
+
+ This document and the information contained herein are provided on an
+ "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
+ OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
+ ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
+ INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
+ INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
+ WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+Intellectual Property
+
+ The IETF takes no position regarding the validity or scope of any
+ Intellectual Property Rights or other rights that might be claimed to
+ pertain to the implementation or use of the technology described in
+ this document or the extent to which any license under such rights
+ might or might not be available; nor does it represent that it has
+ made any independent effort to identify any such rights. Information
+ on the IETF's procedures with respect to rights in IETF Documents can
+ be found in BCP 78 and BCP 79.
+
+ Copies of IPR disclosures made to the IETF Secretariat and any
+ assurances of licenses to be made available, or the result of an
+ attempt made to obtain a general license or permission for the use of
+ such proprietary rights by implementers or users of this
+ specification can be obtained from the IETF on-line IPR repository at
+ http://www.ietf.org/ipr.
+
+ The IETF invites any interested party to bring to its attention any
+ copyrights, patents or patent applications, or other proprietary
+ rights that may cover technology that may be required to implement
+ this standard. Please address the information to the IETF at ietf-
+ ipr@ietf.org.
+
+
+Acknowledgement
+
+ Funding for the RFC Editor function is currently provided by the
+ Internet Society.
+
+
+
+
+
+
+Berners-Lee, et al. Standards Track [Page 61]
+
diff --git a/trunk/txt/rfc3987.txt b/trunk/txt/rfc3987.txt
new file mode 100644
index 00000000..f0b1513b
--- /dev/null
+++ b/trunk/txt/rfc3987.txt
@@ -0,0 +1,2579 @@
+
+
+
+
+
+
+Network Working Group M. Duerst
+Request for Comments: 3987 W3C
+Category: Standards Track M. Suignard
+ Microsoft Corporation
+ January 2005
+
+
+ Internationalized Resource Identifiers (IRIs)
+
+Status of This Memo
+
+ This document specifies an Internet standards track protocol for the
+ Internet community, and requests discussion and suggestions for
+ improvements. Please refer to the current edition of the "Internet
+ Official Protocol Standards" (STD 1) for the standardization state
+ and status of this protocol. Distribution of this memo is unlimited.
+
+Copyright Notice
+
+ Copyright (C) The Internet Society (2005).
+
+Abstract
+
+ This document defines a new protocol element, the Internationalized
+ Resource Identifier (IRI), as a complement to the Uniform Resource
+ Identifier (URI). An IRI is a sequence of characters from the
+ Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to
+ URIs is defined, which means that IRIs can be used instead of URIs,
+ where appropriate, to identify resources.
+
+ The approach of defining a new protocol element was chosen instead of
+ extending or changing the definition of URIs. This was done in order
+ to allow a clear distinction and to avoid incompatibilities with
+ existing software. Guidelines are provided for the use and
+ deployment of IRIs in various protocols, formats, and software
+ components that currently deal with URIs.
+
+Table of Contents
+
+ 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
+ 1.1. Overview and Motivation . . . . . . . . . . . . . . . . 3
+ 1.2. Applicability . . . . . . . . . . . . . . . . . . . . . 3
+ 1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . 4
+ 1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 5
+ 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 6
+ 2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . 6
+ 2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 7
+
+
+
+
+Duerst & Suignard Standards Track [Page 1]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ 3. Relationship between IRIs and URIs . . . . . . . . . . . . . . 10
+ 3.1. Mapping of IRIs to URIs . . . . . . . . . . . . . . . . 10
+ 3.2. Converting URIs to IRIs . . . . . . . . . . . . . . . . 14
+ 3.2.1. Examples . . . . . . . . . . . . . . . . . . . . 15
+ 4. Bidirectional IRIs for Right-to-Left Languages. . . . . . . . 16
+ 4.1. Logical Storage and Visual Presentation . . . . . . . . 17
+ 4.2. Bidi IRI Structure . . . . . . . . . . . . . . . . . . . 18
+ 4.3. Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . 19
+ 4.4. Examples . . . . . . . . . . . . . . . . . . . . . . . . 19
+ 5. Normalization and Comparison . . . . . . . . . . . . . . . . . 21
+ 5.1. Equivalence . . . . . . . . . . . . . . . . . . . . . . 22
+ 5.2. Preparation for Comparison . . . . . . . . . . . . . . . 22
+ 5.3. Comparison Ladder . . . . . . . . . . . . . . . . . . . 23
+ 5.3.1. Simple String Comparison . . . . . . . . . . . . 23
+ 5.3.2. Syntax-Based Normalization . . . . . . . . . . . 24
+ 5.3.3. Scheme-Based Normalization . . . . . . . . . . . 27
+ 5.3.4. Protocol-Based Normalization . . . . . . . . . . 28
+ 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 29
+ 6.1. Limitations on UCS Characters Allowed in IRIs . . . . . 29
+ 6.2. Software Interfaces and Protocols . . . . . . . . . . . 29
+ 6.3. Format of URIs and IRIs in Documents and Protocols . . . 30
+ 6.4. Use of UTF-8 for Encoding Original Characters .. . . . . 30
+ 6.5. Relative IRI References . . . . . . . . . . . . . . . . 32
+ 7. URI/IRI Processing Guidelines (informative) . . . . . . . . . 32
+ 7.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . 32
+ 7.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . 33
+ 7.3. URI/IRI Transfer between Applications . . . . . . . . . 33
+ 7.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 34
+ 7.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . 34
+ 7.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 35
+ 7.7. Interpretation of URIs and IRIs . . . . . . . . . . . . 36
+ 7.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 36
+ 8. Security Considerations . . . . . . . . . . . . . . . . . . . 37
+ 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 39
+ 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 40
+ 10.1. Normative References . . . . . . . . . . . . . . . . . . 40
+ 10.2. Informative References . . . . . . . . . . . . . . . . . 41
+ A. Design Alternatives . . . . . . . . . . . . . . . . . . . . . 44
+ A.1. New Scheme(s) . . . . . . . . . . . . . . . . . . . . . 44
+ A.2. Character Encodings Other Than UTF-8 . . . . . . . . . . 44
+ A.3. New Encoding Convention . . . . . . . . . . . . . . . . 44
+ A.4. Indicating Character Encodings in the URI/IRI . . . . . 45
+ Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 45
+ Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 46
+
+
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 2]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+1. Introduction
+
+1.1. Overview and Motivation
+
+ A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
+ sequence of characters chosen from a limited subset of the repertoire
+ of US-ASCII [ASCII] characters.
+
+ The characters in URIs are frequently used for representing words of
+ natural languages. This usage has many advantages: Such URIs are
+ easier to memorize, easier to interpret, easier to transcribe, easier
+ to create, and easier to guess. For most languages other than
+ English, however, the natural script uses characters other than A -
+ Z. For many people, handling Latin characters is as difficult as
+ handling the characters of other scripts is for those who use only
+ the Latin alphabet. Many languages with non-Latin scripts are
+ transcribed with Latin letters. These transcriptions are now often
+ used in URIs, but they introduce additional ambiguities.
+
+ The infrastructure for the appropriate handling of characters from
+ local scripts is now widely deployed in local versions of operating
+ system and application software. Software that can handle a wide
+ variety of scripts and languages at the same time is increasingly
+ common. Also, increasing numbers of protocols and formats can carry
+ a wide range of characters.
+
+ This document defines a new protocol element called Internationalized
+ Resource Identifier (IRI) by extending the syntax of URIs to a much
+ wider repertoire of characters. It also defines "internationalized"
+ versions corresponding to other constructs from [RFC3986], such as
+ URI references. The syntax of IRIs is defined in section 2, and the
+ relationship between IRIs and URIs in section 3.
+
+ Using characters outside of A - Z in IRIs brings some difficulties.
+ Section 4 discusses the special case of bidirectional IRIs, section 5
+ various forms of equivalence between IRIs, and section 6 the use of
+ IRIs in different situations. Section 7 gives additional informative
+ guidelines, and section 8 security considerations.
+
+1.2. Applicability
+
+ IRIs are designed to be compatible with recommendations for new URI
+ schemes [RFC2718]. The compatibility is provided by specifying a
+ well-defined and deterministic mapping from the IRI character
+ sequence to the functionally equivalent URI character sequence.
+ Practical use of IRIs (or IRI references) in place of URIs (or URI
+ references) depends on the following conditions being met:
+
+
+
+
+Duerst & Suignard Standards Track [Page 3]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ a. A protocol or format element should be explicitly designated to
+ be able to carry IRIs. The intent is not to introduce IRIs into
+ contexts that are not defined to accept them. For example, XML
+ schema [XMLSchema] has an explicit type "anyURI" that includes
+ IRIs and IRI references. Therefore, IRIs and IRI references can
+ be in attributes and elements of type "anyURI". On the other
+ hand, in the HTTP protocol [RFC2616], the Request URI is defined
+ as a URI, which means that direct use of IRIs is not allowed in
+ HTTP requests.
+
+ b. The protocol or format carrying the IRIs should have a mechanism
+ to represent the wide range of characters used in IRIs, either
+ natively or by some protocol- or format-specific escaping
+ mechanism (for example, numeric character references in [XML1]).
+
+ c. The URI corresponding to the IRI in question has to encode
+ original characters into octets using UTF-8. For new URI
+ schemes, this is recommended in [RFC2718]. It can apply to a
+ whole scheme (e.g., IMAP URLs [RFC2192] and POP URLs [RFC2384],
+ or the URN syntax [RFC2141]). It can apply to a specific part of
+ a URI, such as the fragment identifier (e.g., [XPointer]). It
+ can apply to a specific URI or part(s) thereof. For details,
+ please see section 6.4.
+
+1.3. Definitions
+
+ The following definitions are used in this document; they follow the
+ terms in [RFC2130], [RFC2277], and [ISO10646].
+
+ character: A member of a set of elements used for the organization,
+ control, or representation of data. For example, "LATIN CAPITAL
+ LETTER A" names a character.
+
+ octet: An ordered sequence of eight bits considered as a unit.
+
+ character repertoire: A set of characters (in the mathematical
+ sense).
+
+ sequence of characters: A sequence of characters (one after another).
+
+ sequence of octets: A sequence of octets (one after another).
+
+ character encoding: A method of representing a sequence of characters
+ as a sequence of octets (maybe with variants). Also, a method of
+ (unambiguously) converting a sequence of octets into a sequence of
+ characters.
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 4]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ charset: The name of a parameter or attribute used to identify a
+ character encoding.
+
+ UCS: Universal Character Set. The coded character set defined by
+ ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV4].
+
+ IRI reference: Denotes the common usage of an Internationalized
+ Resource Identifier. An IRI reference may be absolute or
+ relative. However, the "IRI" that results from such a reference
+ only includes absolute IRIs; any relative IRI references are
+ resolved to their absolute form. Note that in [RFC2396] URIs did
+ not include fragment identifiers, but in [RFC3986] fragment
+ identifiers are part of URIs.
+
+ running text: Human text (paragraphs, sentences, phrases) with syntax
+ according to orthographic conventions of a natural language, as
+ opposed to syntax defined for ease of processing by machines
+ (e.g., markup, programming languages).
+
+ protocol element: Any portion of a message that affects processing of
+ that message by the protocol in question.
+
+ presentation element: A presentation form corresponding to a protocol
+ element; for example, using a wider range of characters.
+
+ create (a URI or IRI): With respect to URIs and IRIs, the term is
+ used for the initial creation. This may be the initial creation
+ of a resource with a certain identifier, or the initial exposition
+ of a resource under a particular identifier.
+
+ generate (a URI or IRI): With respect to URIs and IRIs, the term is
+ used when the IRI is generated by derivation from other
+ information.
+
+1.4. Notation
+
+ RFCs and Internet Drafts currently do not allow any characters
+ outside the US-ASCII repertoire. Therefore, this document uses
+ various special notations to denote such characters in examples.
+
+ In text, characters outside US-ASCII are sometimes referenced by
+ using a prefix of 'U+', followed by four to six hexadecimal digits.
+
+ To represent characters outside US-ASCII in examples, this document
+ uses two notations: 'XML Notation' and 'Bidi Notation'.
+
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 5]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ XML Notation uses a leading '&#x', a trailing ';', and the
+ hexadecimal number of the character in the UCS in between. For
+ example, &#x44F; stands for CYRILLIC CAPITAL LETTER YA. In this
+ notation, an actual '&' is denoted by '&amp;'.
+
+ Bidi Notation is used for bidirectional examples: Lowercase letters
+ stand for Latin letters or other letters that are written left to
+ right, whereas uppercase letters represent Arabic or Hebrew letters
+ that are written right to left.
+
+ To denote actual octets in examples (as opposed to percent-encoded
+ octets), the two hex digits denoting the octet are enclosed in "<"
+ and ">". For example, the octet often denoted as 0xc9 is denoted
+ here as <c9>.
+
+ In this document, the key words "MUST", "MUST NOT", "REQUIRED",
+ "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
+ and "OPTIONAL" are to be interpreted as described in [RFC2119].
+
+2. IRI Syntax
+
+ This section defines the syntax of Internationalized Resource
+ Identifiers (IRIs).
+
+ As with URIs, an IRI is defined as a sequence of characters, not as a
+ sequence of octets. This definition accommodates the fact that IRIs
+ may be written on paper or read over the radio as well as stored or
+ transmitted digitally. The same IRI may be represented as different
+ sequences of octets in different protocols or documents if these
+ protocols or documents use different character encodings (and/or
+ transfer encodings). Using the same character encoding as the
+ containing protocol or document ensures that the characters in the
+ IRI can be handled (e.g., searched, converted, displayed) in the same
+ way as the rest of the protocol or document.
+
+2.1. Summary of IRI Syntax
+
+ IRIs are defined similarly to URIs in [RFC3986], but the class of
+ unreserved characters is extended by adding the characters of the UCS
+ (Universal Character Set, [ISO10646]) beyond U+007F, subject to the
+ limitations given in the syntax rules below and in section 6.1.
+
+ Otherwise, the syntax and use of components and reserved characters
+ is the same as that in [RFC3986]. All the operations defined in
+ [RFC3986], such as the resolution of relative references, can be
+ applied to IRIs by IRI-processing software in exactly the same way as
+ they are for URIs by URI-processing software.
+
+
+
+
+Duerst & Suignard Standards Track [Page 6]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ Characters outside the US-ASCII repertoire are not reserved and
+ therefore MUST NOT be used for syntactical purposes, such as to
+ delimit components in newly defined schemes. For example, U+00A2,
+ CENT SIGN, is not allowed as a delimiter in IRIs, because it is in
+ the 'iunreserved' category. This is similar to the fact that it is
+ not possible to use '-' as a delimiter in URIs, because it is in the
+ 'unreserved' category.
+
+2.2. ABNF for IRI References and IRIs
+
+ Although it might be possible to define IRI references and IRIs
+ merely by their transformation to URI references and URIs, they can
+ also be accepted and processed directly. Therefore, an ABNF
+ definition for IRI references (which are the most general concept and
+ the start of the grammar) and IRIs is given here. The syntax of this
+ ABNF is described in [RFC2234]. Character numbers are taken from the
+ UCS, without implying any actual binary encoding. Terminals in the
+ ABNF are characters, not bytes.
+
+ The following grammar closely follows the URI grammar in [RFC3986],
+ except that the range of unreserved characters is expanded to include
+ UCS characters, with the restriction that private UCS characters can
+ occur only in query parts. The grammar is split into two parts:
+ Rules that differ from [RFC3986] because of the above-mentioned
+ expansion, and rules that are the same as those in [RFC3986]. For
+ rules that are different than those in [RFC3986], the names of the
+ non-terminals have been changed as follows. If the non-terminal
+ contains 'URI', this has been changed to 'IRI'. Otherwise, an 'i'
+ has been prefixed.
+
+ The following rules are different from those in [RFC3986]:
+
+ IRI = scheme ":" ihier-part [ "?" iquery ]
+ [ "#" ifragment ]
+
+ ihier-part = "//" iauthority ipath-abempty
+ / ipath-absolute
+ / ipath-rootless
+ / ipath-empty
+
+ IRI-reference = IRI / irelative-ref
+
+ absolute-IRI = scheme ":" ihier-part [ "?" iquery ]
+
+ irelative-ref = irelative-part [ "?" iquery ] [ "#" ifragment ]
+
+ irelative-part = "//" iauthority ipath-abempty
+ / ipath-absolute
+
+
+
+Duerst & Suignard Standards Track [Page 7]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ / ipath-noscheme
+ / ipath-empty
+
+ iauthority = [ iuserinfo "@" ] ihost [ ":" port ]
+ iuserinfo = *( iunreserved / pct-encoded / sub-delims / ":" )
+ ihost = IP-literal / IPv4address / ireg-name
+
+ ireg-name = *( iunreserved / pct-encoded / sub-delims )
+
+ ipath = ipath-abempty ; begins with "/" or is empty
+ / ipath-absolute ; begins with "/" but not "//"
+ / ipath-noscheme ; begins with a non-colon segment
+ / ipath-rootless ; begins with a segment
+ / ipath-empty ; zero characters
+
+ ipath-abempty = *( "/" isegment )
+ ipath-absolute = "/" [ isegment-nz *( "/" isegment ) ]
+ ipath-noscheme = isegment-nz-nc *( "/" isegment )
+ ipath-rootless = isegment-nz *( "/" isegment )
+ ipath-empty = 0<ipchar>
+
+ isegment = *ipchar
+ isegment-nz = 1*ipchar
+ isegment-nz-nc = 1*( iunreserved / pct-encoded / sub-delims
+ / "@" )
+ ; non-zero-length segment without any colon ":"
+
+ ipchar = iunreserved / pct-encoded / sub-delims / ":"
+ / "@"
+
+ iquery = *( ipchar / iprivate / "/" / "?" )
+
+ ifragment = *( ipchar / "/" / "?" )
+
+ iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
+
+ ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
+ / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
+ / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
+ / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
+ / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
+ / %xD0000-DFFFD / %xE1000-EFFFD
+
+ iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD
+
+ Some productions are ambiguous. The "first-match-wins" (a.k.a.
+ "greedy") algorithm applies. For details, see [RFC3986].
+
+
+
+
+Duerst & Suignard Standards Track [Page 8]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ The following rules are the same as those in [RFC3986]:
+
+ scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
+
+ port = *DIGIT
+
+ IP-literal = "[" ( IPv6address / IPvFuture ) "]"
+
+ IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
+
+ IPv6address = 6( h16 ":" ) ls32
+ / "::" 5( h16 ":" ) ls32
+ / [ h16 ] "::" 4( h16 ":" ) ls32
+ / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
+ / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
+ / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
+ / [ *4( h16 ":" ) h16 ] "::" ls32
+ / [ *5( h16 ":" ) h16 ] "::" h16
+ / [ *6( h16 ":" ) h16 ] "::"
+
+ h16 = 1*4HEXDIG
+ ls32 = ( h16 ":" h16 ) / IPv4address
+
+ IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
+
+ dec-octet = DIGIT ; 0-9
+ / %x31-39 DIGIT ; 10-99
+ / "1" 2DIGIT ; 100-199
+ / "2" %x30-34 DIGIT ; 200-249
+ / "25" %x30-35 ; 250-255
+
+ pct-encoded = "%" HEXDIG HEXDIG
+
+ unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
+ reserved = gen-delims / sub-delims
+ gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
+ sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
+ / "*" / "+" / "," / ";" / "="
+
+ This syntax does not support IPv6 scoped addressing zone identifiers.
+
+
+
+
+
+
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 9]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+3. Relationship between IRIs and URIs
+
+ IRIs are meant to replace URIs in identifying resources for
+ protocols, formats, and software components that use a UCS-based
+ character repertoire. These protocols and components may never need
+ to use URIs directly, especially when the resource identifier is used
+ simply for identification purposes. However, when the resource
+ identifier is used for resource retrieval, it is in many cases
+ necessary to determine the associated URI, because currently most
+ retrieval mechanisms are only defined for URIs. In this case, IRIs
+ can serve as presentation elements for URI protocol elements. An
+ example would be an address bar in a Web user agent. (Additional
+ rationale is given in section 3.1.)
+
+3.1. Mapping of IRIs to URIs
+
+ This section defines how to map an IRI to a URI. Everything in this
+ section also applies to IRI references and URI references, as well as
+ to components thereof (for example, fragment identifiers).
+
+ This mapping has two purposes:
+
+ Syntaxical. Many URI schemes and components define additional
+ syntactical restrictions not captured in section 2.2.
+ Scheme-specific restrictions are applied to IRIs by converting
+ IRIs to URIs and checking the URIs against the scheme-specific
+ restrictions.
+
+ Interpretational. URIs identify resources in various ways. IRIs also
+ identify resources. When the IRI is used solely for
+ identification purposes, it is not necessary to map the IRI to a
+ URI (see section 5). However, when an IRI is used for resource
+ retrieval, the resource that the IRI locates is the same as the
+ one located by the URI obtained after converting the IRI according
+ to the procedure defined here. This means that there is no need
+ to define resolution separately on the IRI level.
+
+ Applications MUST map IRIs to URIs by using the following two steps.
+
+ Step 1. Generate a UCS character sequence from the original IRI
+ format. This step has the following three variants,
+ depending on the form of the input:
+
+ a. If the IRI is written on paper, read aloud, or otherwise
+ represented as a sequence of characters independent of
+ any character encoding, represent the IRI as a sequence
+ of characters from the UCS normalized according to
+ Normalization Form C (NFC, [UTR15]).
+
+
+
+Duerst & Suignard Standards Track [Page 10]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ b. If the IRI is in some digital representation (e.g., an
+ octet stream) in some known non-Unicode character
+ encoding, convert the IRI to a sequence of characters
+ from the UCS normalized according to NFC.
+
+ c. If the IRI is in a Unicode-based character encoding (for
+ example, UTF-8 or UTF-16), do not normalize (see section
+ 5.3.2.2 for details). Apply step 2 directly to the
+ encoded Unicode character sequence.
+
+ Step 2. For each character in 'ucschar' or 'iprivate', apply steps
+ 2.1 through 2.3 below.
+
+ 2.1. Convert the character to a sequence of one or more octets
+ using UTF-8 [RFC3629].
+
+ 2.2. Convert each octet to %HH, where HH is the hexadecimal
+ notation of the octet value. Note that this is identical
+ to the percent-encoding mechanism in section 2.1 of
+ [RFC3986]. To reduce variability, the hexadecimal notation
+ SHOULD use uppercase letters.
+
+ 2.3. Replace the original character with the resulting character
+ sequence (i.e., a sequence of %HH triplets).
+
+ The above mapping from IRIs to URIs produces URIs fully conforming to
+ [RFC3986]. The mapping is also an identity transformation for URIs
+ and is idempotent; applying the mapping a second time will not
+ change anything. Every URI is by definition an IRI.
+
+ Systems accepting IRIs MAY convert the ireg-name component of an IRI
+ as follows (before step 2 above) for schemes known to use domain
+ names in ireg-name, if the scheme definition does not allow
+ percent-encoding for ireg-name:
+
+ Replace the ireg-name part of the IRI by the part converted using the
+ ToASCII operation specified in section 4.1 of [RFC3490] on each
+ dot-separated label, and by using U+002E (FULL STOP) as a label
+ separator, with the flag UseSTD3ASCIIRules set to TRUE, and with the
+ flag AllowUnassigned set to FALSE for creating IRIs and set to TRUE
+ otherwise.
+
+
+
+
+
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 11]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ The ToASCII operation may fail, but this would mean that the IRI
+ cannot be resolved. This conversion SHOULD be used when the goal is
+ to maximize interoperability with legacy URI resolvers. For example,
+ the IRI
+
+ "http://r&#xE9;sum&#xE9;.example.org"
+
+ may be converted to
+
+ "http://xn--rsum-bpad.example.org"
+
+ instead of
+
+ "http://r%C3%A9sum%C3%A9.example.org".
+
+ An IRI with a scheme that is known to use domain names in ireg-name,
+ but where the scheme definition does not allow percent-encoding for
+ ireg-name, meets scheme-specific restrictions if either the
+ straightforward conversion or the conversion using the ToASCII
+ operation on ireg-name result in an URI that meets the scheme-
+ specific restrictions.
+
+ Such an IRI resolves to the URI obtained after converting the IRI and
+ uses the ToASCII operation on ireg-name. Implementations do not have
+ to do this conversion as long as they produce the same result.
+
+ Note: The difference between variants b and c in step 1 (using
+ normalization with NFC, versus not using any normalization)
+ accounts for the fact that in many non-Unicode character
+ encodings, some text cannot be represented directly. For example,
+ the word "Vietnam" is natively written "Vi&#x1EC7;t Nam"
+ (containing a LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW)
+ in NFC, but a direct transcoding from the windows-1258 character
+ encoding leads to "Vi&#xEA;&#x323;t Nam" (containing a LATIN SMALL
+ LETTER E WITH CIRCUMFLEX followed by a COMBINING DOT BELOW).
+ Direct transcoding of other 8-bit encodings of Vietnamese may lead
+ to other representations.
+
+ Note: The uniform treatment of the whole IRI in step 2 is important
+ to make processing independent of URI scheme. See [Gettys] for an
+ in-depth discussion.
+
+ Note: In practice, whether the general mapping (steps 1 and 2) or the
+ ToASCII operation of [RFC3490] is used for ireg-name will not be
+ noticed if mapping from IRI to URI and resolution is tightly
+ integrated (e.g., carried out in the same user agent). But
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 12]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ conversion using [RFC3490] may be able to better deal with
+ backwards compatibility issues in case mapping and resolution are
+ separated, as in the case of using an HTTP proxy.
+
+ Note: Internationalized Domain Names may be contained in parts of an
+ IRI other than the ireg-name part. It is the responsibility of
+ scheme-specific implementations (if the Internationalized Domain
+ Name is part of the scheme syntax) or of server-side
+ implementations (if the Internationalized Domain Name is part of
+ 'iquery') to apply the necessary conversions at the appropriate
+ point. Example: Trying to validate the Web page at
+ http://r&#xE9;sum&#xE9;.example.org would lead to an IRI of
+ http://validator.w3.org/check?uri=http%3A%2F%2Fr&#xE9;sum&#xE9;.
+ example.org, which would convert to a URI of
+ http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.
+ example.org. The server side implementation would be responsible
+ for making the necessary conversions to be able to retrieve the
+ Web page.
+
+ Systems accepting IRIs MAY also deal with the printable characters in
+ US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
+ "{", "}", "|", "\", "^", and "`", in step 2 above. If these
+ characters are found but are not converted, then the conversion
+ SHOULD fail. Please note that the number sign ("#"), the percent
+ sign ("%"), and the square bracket characters ("[", "]") are not part
+ of the above list and MUST NOT be converted. Protocols and formats
+ that have used earlier definitions of IRIs including these characters
+ MAY require percent-encoding of these characters as a preprocessing
+ step to extract the actual IRI from a given field. This
+ preprocessing MAY also be used by applications allowing the user to
+ enter an IRI.
+
+ Note: In this process (in step 2.3), characters allowed in URI
+ references and existing percent-encoded sequences are not encoded
+ further. (This mapping is similar to, but different from, the
+ encoding applied when arbitrary content is included in some part
+ of a URI.) For example, an IRI of
+ "http://www.example.org/red%09ros&#xE9;#red" (in XML notation) is
+ converted to
+ "http://www.example.org/red%09ros%C3%A9#red", not to something
+ like
+ "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".
+
+ Note: Some older software transcoding to UTF-8 may produce illegal
+ output for some input, in particular for characters outside the
+ BMP (Basic Multilingual Plane). As an example, for the IRI with
+ non-BMP characters (in XML Notation):
+ "http://example.com/&#x10300;&#x10301;&#x10302";
+
+
+
+Duerst & Suignard Standards Track [Page 13]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ which contains the first three letters of the Old Italic alphabet,
+ the correct conversion to a URI is
+ "http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82"
+
+3.2. Converting URIs to IRIs
+
+ In some situations, converting a URI into an equivalent IRI may be
+ desirable. This section gives a procedure for this conversion. The
+ conversion described in this section will always result in an IRI
+ that maps back to the URI used as an input for the conversion (except
+ for potential case differences in percent-encoding and for potential
+ percent-encoded unreserved characters). However, the IRI resulting
+ from this conversion may not be exactly the same as the original IRI
+ (if there ever was one).
+
+ URI-to-IRI conversion removes percent-encodings, but not all
+ percent-encodings can be eliminated. There are several reasons for
+ this:
+
+ 1. Some percent-encodings are necessary to distinguish percent-
+ encoded and unencoded uses of reserved characters.
+
+ 2. Some percent-encodings cannot be interpreted as sequences of
+ UTF-8 octets.
+
+ (Note: The octet patterns of UTF-8 are highly regular.
+ Therefore, there is a very high probability, but no guarantee,
+ that percent-encodings that can be interpreted as sequences of
+ UTF-8 octets actually originated from UTF-8. For a detailed
+ discussion, see [Duerst97].)
+
+ 3. The conversion may result in a character that is not appropriate
+ in an IRI. See sections 2.2, 4.1, and 6.1 for further details.
+
+ Conversion from a URI to an IRI is done by using the following steps
+ (or any other algorithm that produces the same result):
+
+ 1. Represent the URI as a sequence of octets in US-ASCII.
+
+ 2. Convert all percent-encodings ("%" followed by two hexadecimal
+ digits) to the corresponding octets, except those corresponding
+ to "%", characters in "reserved", and characters in US-ASCII not
+ allowed in URIs.
+
+ 3. Re-percent-encode any octet produced in step 2 that is not part
+ of a strictly legal UTF-8 octet sequence.
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 14]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ 4. Re-percent-encode all octets produced in step 3 that in UTF-8
+ represent characters that are not appropriate according to
+ sections 2.2, 4.1, and 6.1.
+
+ 5. Interpret the resulting octet sequence as a sequence of characters
+ encoded in UTF-8.
+
+ This procedure will convert as many percent-encoded characters as
+ possible to characters in an IRI. Because there are some choices
+ when step 4 is applied (see section 6.1), results may vary.
+
+ Conversions from URIs to IRIs MUST NOT use any character encoding
+ other than UTF-8 in steps 3 and 4, even if it might be possible to
+ guess from the context that another character encoding than UTF-8 was
+ used in the URI. For example, the URI
+ "http://www.example.org/r%E9sum%E9.html" might with some guessing be
+ interpreted to contain two e-acute characters encoded as iso-8859-1.
+ It must not be converted to an IRI containing these e-acute
+ characters. Otherwise, in the future the IRI will be mapped to
+ "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
+ URI from "http://www.example.org/r%E9sum%E9.html".
+
+3.2.1. Examples
+
+ This section shows various examples of converting URIs to IRIs. Each
+ example shows the result after each of the steps 1 through 5 is
+ applied. XML Notation is used for the final result. Octets are
+ denoted by "<" followed by two hexadecimal digits followed by ">".
+
+ The following example contains the sequence "%C3%BC", which is a
+ strictly legal UTF-8 sequence, and which is converted into the actual
+ character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as
+ u-umlaut).
+
+ 1. http://www.example.org/D%C3%BCrst
+
+ 2. http://www.example.org/D<c3><bc>rst
+
+ 3. http://www.example.org/D<c3><bc>rst
+
+ 4. http://www.example.org/D<c3><bc>rst
+
+ 5. http://www.example.org/D&#xFC;rst
+
+ The following example contains the sequence "%FC", which might
+ represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in the
+ iso-8859-1 character encoding. (It might represent other characters
+ in other character encodings. For example, the octet <fc> in
+
+
+
+Duerst & Suignard Standards Track [Page 15]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ iso-8859-5 represents U+045C, CYRILLIC SMALL LETTER KJE.) Because
+ <fc> is not part of a strictly legal UTF-8 sequence, it is
+ re-percent-encoded in step 3.
+
+ 1. http://www.example.org/D%FCrst
+
+ 2. http://www.example.org/D<fc>rst
+
+ 3. http://www.example.org/D%FCrst
+
+ 4. http://www.example.org/D%FCrst
+
+ 5. http://www.example.org/D%FCrst
+
+ The following example contains "%e2%80%ae", which is the percent-
+ encoded UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE.
+ Section 4.1 forbids the direct use of this character in an IRI.
+ Therefore, the corresponding octets are re-percent-encoded in step 4.
+ This example shows that the case (upper- or lowercase) of letters
+ used in percent-encodings may not be preserved. The example also
+ contains a punycode-encoded domain name label (xn--99zt52a), which is
+ not converted.
+
+ 1. http://xn--99zt52a.example.org/%e2%80%ae
+
+ 2. http://xn--99zt52a.example.org/<e2><80><ae>
+
+ 3. http://xn--99zt52a.example.org/<e2><80><ae>
+
+ 4. http://xn--99zt52a.example.org/%E2%80%AE
+
+ 5. http://xn--99zt52a.example.org/%E2%80%AE
+
+ Implementations with scheme-specific knowledge MAY convert
+ punycode-encoded domain name labels to the corresponding characters
+ by using the ToUnicode procedure. Thus, for the example above, the
+ label "xn--99zt52a" may be converted to U+7D0D U+8C46 (Japanese
+ Natto), leading to the overall IRI of
+ "http://&#x7D0D;&#x8C46;.example.org/%E2%80%AE".
+
+4. Bidirectional IRIs for Right-to-Left Languages
+
+ Some UCS characters, such as those used in the Arabic and Hebrew
+ scripts, have an inherent right-to-left (rtl) writing direction.
+ IRIs containing these characters (called bidirectional IRIs or Bidi
+ IRIs) require additional attention because of the non-trivial
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 16]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ relation between logical representation (used for digital
+ representation and for reading/spelling) and visual representation
+ (used for display/printing).
+
+ Because of the complex interaction between the logical
+ representation, the visual representation, and the syntax of a Bidi
+ IRI, a balance is needed between various requirements. The main
+ requirements are
+
+ 1. user-predictable conversion between visual and logical
+ representation;
+
+ 2. the ability to include a wide range of characters in various
+ parts of the IRI; and
+
+ 3. minor or no changes or restrictions for implementations.
+
+4.1. Logical Storage and Visual Presentation
+
+ When stored or transmitted in digital representation, bidirectional
+ IRIs MUST be in full logical order and MUST conform to the IRI syntax
+ rules (which includes the rules relevant to their scheme). This
+ ensures that bidirectional IRIs can be processed in the same way as
+ other IRIs.
+
+ Bidirectional IRIs MUST be rendered by using the Unicode
+ Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be
+ rendered in the same way as they would be if they were in a
+ left-to-right embedding; i.e., as if they were preceded by U+202A,
+ LEFT-TO-RIGHT EMBEDDING (LRE), and followed by U+202C, POP
+ DIRECTIONAL FORMATTING (PDF). Setting the embedding direction can
+ also be done in a higher-level protocol (e.g., the dir='ltr'
+ attribute in HTML).
+
+ There is no requirement to use the above embedding if the display is
+ still the same without the embedding. For example, a bidirectional
+ IRI in a text with left-to-right base directionality (such as used
+ for English or Cyrillic) that is preceded and followed by whitespace
+ and strong left-to-right characters does not need an embedding.
+ Also, a bidirectional relative IRI reference that only contains
+ strong right-to-left characters and weak characters and that starts
+ and ends with a strong right-to-left character and appears in a text
+ with right-to-left base directionality (such as used for Arabic or
+ Hebrew) and is preceded and followed by whitespace and strong
+ characters does not need an embedding.
+
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 17]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
+ sufficient to force the correct display behavior. However, the
+ details of the Unicode Bidirectional algorithm are not always easy to
+ understand. Implementers are strongly advised to err on the side of
+ caution and to use embedding in all cases where they are not
+ completely sure that the display behavior is unaffected without the
+ embedding.
+
+ The Unicode Bidirectional Algorithm ([UNI9], section 4.3) permits
+ higher-level protocols to influence bidirectional rendering. Such
+ changes by higher-level protocols MUST NOT be used if they change the
+ rendering of IRIs.
+
+ The bidirectional formatting characters that may be used before or
+ after the IRI to ensure correct display are not themselves part of
+ the IRI. IRIs MUST NOT contain bidirectional formatting characters
+ (LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual
+ rendering of the IRI but do not appear themselves. It would
+ therefore not be possible to input an IRI with such characters
+ correctly.
+
+4.2. Bidi IRI Structure
+
+ The Unicode Bidirectional Algorithm is designed mainly for running
+ text. To make sure that it does not affect the rendering of
+ bidirectional IRIs too much, some restrictions on bidirectional IRIs
+ are necessary. These restrictions are given in terms of delimiters
+ (structural characters, mostly punctuation such as "@", ".", ":", and
+ "/") and components (usually consisting mostly of letters and
+ digits).
+
+ The following syntax rules from section 2.2 correspond to components
+ for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment,
+ isegment-nz, isegment-nz-nc, ireg-name, iquery, and ifragment.
+
+ Specifications that define the syntax of any of the above components
+ MAY divide them further and define smaller parts to be components
+ according to this document. As an example, the restrictions of
+ [RFC3490] on bidirectional domain names correspond to treating each
+ label of a domain name as a component for schemes with ireg-name as a
+ domain name. Even where the components are not defined formally, it
+ may be helpful to think about some syntax in terms of components and
+ to apply the relevant restrictions. For example, for the usual
+ name/value syntax in query parts, it is convenient to treat each name
+ and each value as a component. As another example, the extensions in
+ a resource name can be treated as separate components.
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 18]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ For each component, the following restrictions apply:
+
+ 1. A component SHOULD NOT use both right-to-left and left-to-right
+ characters.
+
+ 2. A component using right-to-left characters SHOULD start and end
+ with right-to-left characters.
+
+ The above restrictions are given as shoulds, rather than as musts.
+ For IRIs that are never presented visually, they are not relevant.
+ However, for IRIs in general, they are very important to ensure
+ consistent conversion between visual presentation and logical
+ representation, in both directions.
+
+ Note: In some components, the above restrictions may actually be
+ strictly enforced. For example, [RFC3490] requires that these
+ restrictions apply to the labels of a host name for those schemes
+ where ireg-name is a host name. In some other components (for
+ example, path components) following these restrictions may not be
+ too difficult. For other components, such as parts of the query
+ part, it may be very difficult to enforce the restrictions because
+ the values of query parameters may be arbitrary character
+ sequences.
+
+ If the above restrictions cannot be satisfied otherwise, the affected
+ component can always be mapped to URI notation as described in
+ section 3.1. Please note that the whole component has to be mapped
+ (see also Example 9 below).
+
+4.3. Input of Bidi IRIs
+
+ Bidi input methods MUST generate Bidi IRIs in logical order while
+ rendering them according to section 4.1. During input, rendering
+ SHOULD be updated after every new character is input to avoid end-
+ user confusion.
+
+4.4. Examples
+
+ This section gives examples of bidirectional IRIs, in Bidi Notation.
+ It shows legal IRIs with the relationship between logical and visual
+ representation and explains how certain phenomena in this
+ relationship may look strange to somebody not familiar with
+ bidirectional behavior, but familiar to users of Arabic and Hebrew.
+ It also shows what happens if the restrictions given in section 4.2
+ are not followed. The examples below can be seen at [BidiEx], in
+ Arabic, Hebrew, and Bidi Notation variants.
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 19]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ To read the bidi text in the examples, read the visual representation
+ from left to right until you encounter a block of rtl text. Read the
+ rtl block (including slashes and other special characters) from right
+ to left, then continue at the next unread ltr character.
+
+ Example 1: A single component with rtl characters is inverted:
+ Logical representation: "http://ab.CDEFGH.ij/kl/mn/op.html"
+ Visual representation: "http://ab.HGFEDC.ij/kl/mn/op.html"
+ Components can be read one by one, and each component can be read in
+ its natural direction.
+
+ Example 2: More than one consecutive component with rtl characters is
+ inverted as a whole:
+ Logical representation: "http://ab.CDE.FGH/ij/kl/mn/op.html"
+ Visual representation: "http://ab.HGF.EDC/ij/kl/mn/op.html"
+ A sequence of rtl components is read rtl, in the same way as a
+ sequence of rtl words is read rtl in a bidi text.
+
+ Example 3: All components of an IRI (except for the scheme) are rtl.
+ All rtl components are inverted overall:
+ Logical representation: "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV"
+ Visual representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA"
+ The whole IRI (except the scheme) is read rtl. Delimiters between
+ rtl components stay between the respective components; delimiters
+ between ltr and rtl components don't move.
+
+ Example 4: Each of several sequences of rtl components is inverted on
+ its own:
+ Logical representation: "http://AB.CD.ef/gh/IJ/KL.html"
+ Visual representation: "http://DC.BA.ef/gh/LK/JI.html"
+ Each sequence of rtl components is read rtl, in the same way as each
+ sequence of rtl words in an ltr text is read rtl.
+
+ Example 5: Example 2, applied to components of different kinds:
+ Logical representation: "http://ab.cd.EF/GH/ij/kl.html"
+ Visual representation: "http://ab.cd.HG/FE/ij/kl.html"
+ The inversion of the domain name label and the path component may be
+ unexpected, but it is consistent with other bidi behavior. For
+ reassurance that the domain component really is "ab.cd.EF", it may be
+ helpful to read aloud the visual representation following the bidi
+ algorithm. After "http://ab.cd." one reads the RTL block
+ "E-F-slash-G-H", which corresponds to the logical representation.
+
+ Example 6: Same as Example 5, with more rtl components:
+ Logical representation: "http://ab.CD.EF/GH/IJ/kl.html"
+ Visual representation: "http://ab.JI/HG/FE.DC/kl.html"
+ The inversion of the domain name labels and the path components may
+ be easier to identify because the delimiters also move.
+
+
+
+Duerst & Suignard Standards Track [Page 20]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ Example 7: A single rtl component includes digits:
+ Logical representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"
+ Visual representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"
+ Numbers are written ltr in all cases but are treated as an additional
+ embedding inside a run of rtl characters. This is completely
+ consistent with usual bidirectional text.
+
+ Example 8 (not allowed): Numbers are at the start or end of an rtl
+ component:
+ Logical representation: "http://ab.cd.ef/GH1/2IJ/KL.html"
+ Visual representation: "http://ab.cd.ef/LK/JI1/2HG.html"
+ The sequence "1/2" is interpreted by the bidi algorithm as a
+ fraction, fragmenting the components and leading to confusion. There
+ are other characters that are interpreted in a special way close to
+ numbers; in particular, "+", "-", "#", "$", "%", ",", ".", and ":".
+
+ Example 9 (not allowed): The numbers in the previous example are
+ percent-encoded:
+ Logical representation: "http://ab.cd.ef/GH%31/%32IJ/KL.html",
+ Visual representation (Hebrew): "http://ab.cd.ef/%31HG/LK/JI%32.html"
+ Visual representation (Arabic): "http://ab.cd.ef/31%HG/%LK/JI32.html"
+ Depending on whether the uppercase letters represent Arabic or
+ Hebrew, the visual representation is different.
+
+ Example 10 (allowed but not recommended):
+ Logical representation: "http://ab.CDEFGH.123/kl/mn/op.html"
+ Visual representation: "http://ab.123.HGFEDC/kl/mn/op.html"
+ Components consisting of only numbers are allowed (it would be rather
+ difficult to prohibit them), but these may interact with adjacent RTL
+ components in ways that are not easy to predict.
+
+5. Normalization and Comparison
+
+ Note: The structure and much of the material for this section is
+ taken from section 6 of [RFC3986]; the differences are due to the
+ specifics of IRIs.
+
+ One of the most common operations on IRIs is simple comparison:
+ Determining whether two IRIs are equivalent without using the IRIs or
+ the mapped URIs to access their respective resource(s). A comparison
+ is performed whenever a response cache is accessed, a browser checks
+ its history to color a link, or an XML parser processes tags within a
+ namespace. Extensive normalization prior to comparison of IRIs may
+ be used by spiders and indexing engines to prune a search space or
+ reduce duplication of request actions and response storage.
+
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 21]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ IRI comparison is performed for some particular purpose. Protocols
+ or implementations that compare IRIs for different purposes will
+ often be subject to differing design trade-offs in regards to how
+ much effort should be spent in reducing aliased identifiers. This
+ section describes various methods that may be used to compare IRIs,
+ the trade-offs between them, and the types of applications that might
+ use them.
+
+5.1. Equivalence
+
+ Because IRIs exist to identify resources, presumably they should be
+ considered equivalent when they identify the same resource. However,
+ this definition of equivalence is not of much practical use, as there
+ is no way for an implementation to compare two resources unless it
+ has full knowledge or control of them. For this reason, determination
+ of equivalence or difference of IRIs is based on string comparison,
+ perhaps augmented by reference to additional rules provided by URI
+ scheme definitions. We use the terms "different" and "equivalent" to
+ describe the possible outcomes of such comparisons, but there are
+ many application-dependent versions of equivalence.
+
+ Even though it is possible to determine that two IRIs are equivalent,
+ IRI comparison is not sufficient to determine whether two IRIs
+ identify different resources. For example, an owner of two different
+ domain names could decide to serve the same resource from both,
+ resulting in two different IRIs. Therefore, comparison methods are
+ designed to minimize false negatives while strictly avoiding false
+ positives.
+
+ In testing for equivalence, applications should not directly compare
+ relative references; the references should be converted to their
+ respective target IRIs before comparison. When IRIs are compared to
+ select (or avoid) a network action, such as retrieval of a
+ representation, fragment components (if any) should be excluded from
+ the comparison.
+
+ Applications using IRIs as identity tokens with no relationship to a
+ protocol MUST use the Simple String Comparison (see section 5.3.1).
+ All other applications MUST select one of the comparison practices
+ from the Comparison Ladder (see section 5.3 or, after IRI-to-URI
+ conversion, select one of the comparison practices from the URI
+ comparison ladder in [RFC3986], section 6.2)
+
+5.2. Preparation for Comparison
+
+ Any kind of IRI comparison REQUIRES that all escapings or encodings
+ in the protocol or format that carries an IRI are resolved. This is
+ usually done when the protocol or format is parsed. Examples of such
+
+
+
+Duerst & Suignard Standards Track [Page 22]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ escapings or encodings are entities and numeric character references
+ in [HTML4] and [XML1]. As an example,
+ "http://example.org/ros&eacute;" (in HTML),
+ "http://example.org/ros&#233"; (in HTML or XML), and
+ "http://example.org/ros&#xE9"; (in HTML or XML) are all resolved into
+ what is denoted in this document (see section 1.4) as
+ "http://example.org/ros&#xE9"; (the "&#xE9;" here standing for the
+ actual e-acute character, to compensate for the fact that this
+ document cannot contain non-ASCII characters).
+
+ Similar considerations apply to encodings such as Transfer Codings in
+ HTTP (see [RFC2616]) and Content Transfer Encodings in MIME
+ ([RFC2045]), although in these cases, the encoding is based not on
+ characters but on octets, and additional care is required to make
+ sure that characters, and not just arbitrary octets, are compared
+ (see section 5.3.1).
+
+5.3. Comparison Ladder
+
+ In practice, a variety of methods are used, to test IRI equivalence.
+ These methods fall into a range distinguished by the amount of
+ processing required and the degree to which the probability of false
+ negatives is reduced. As noted above, false negatives cannot be
+ eliminated. In practice, their probability can be reduced, but this
+ reduction requires more processing and is not cost-effective for all
+ applications.
+
+ If this range of comparison practices is considered as a ladder, the
+ following discussion will climb the ladder, starting with practices
+ that are cheap but have a relatively higher chance of producing false
+ negatives, and proceeding to those that have higher computational
+ cost and lower risk of false negatives.
+
+5.3.1. Simple String Comparison
+
+ If two IRIs, when considered as character strings, are identical,
+ then it is safe to conclude that they are equivalent. This type of
+ equivalence test has very low computational cost and is in wide use
+ in a variety of applications, particularly in the domain of parsing.
+ It is also used when a definitive answer to the question of IRI
+ equivalence is needed that is independent of the scheme used and that
+ can be calculated quickly and without accessing a network. An
+ example of such a case is XML Namespaces ([XMLNamespace]).
+
+ Testing strings for equivalence requires some basic precautions. This
+ procedure is often referred to as "bit-for-bit" or "byte-for-byte"
+ comparison, which is potentially misleading. Testing strings for
+ equality is normally based on pair comparison of the characters that
+
+
+
+Duerst & Suignard Standards Track [Page 23]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ make up the strings, starting from the first and proceeding until
+ both strings are exhausted and all characters are found to be equal,
+ until a pair of characters compares unequal, or until one of the
+ strings is exhausted before the other.
+
+ This character comparison requires that each pair of characters be
+ put in comparable encoding form. For example, should one IRI be
+ stored in a byte array in UTF-8 encoding form and the second in a
+ UTF-16 encoding form, bit-for-bit comparisons applied naively will
+ produce errors. It is better to speak of equality on a
+ character-for-character rather than on a byte-for-byte or bit-for-bit
+ basis. In practical terms, character-by-character comparisons should
+ be done codepoint by codepoint after conversion to a common character
+ encoding form. When comparing character by character, the comparison
+ function MUST NOT map IRIs to URIs, because such a mapping would
+ create additional spurious equivalences. It follows that an IRI
+ SHOULD NOT be modified when being transported if there is any chance
+ that this IRI might be used as an identifier.
+
+ False negatives are caused by the production and use of IRI aliases.
+ Unnecessary aliases can be reduced, regardless of the comparison
+ method, by consistently providing IRI references in an already
+ normalized form (i.e., a form identical to what would be produced
+ after normalization is applied, as described below). Protocols and
+ data formats often limit some IRI comparisons to simple string
+ comparison, based on the theory that people and implementations will,
+ in their own best interest, be consistent in providing IRI
+ references, or at least be consistent enough to negate any efficiency
+ that might be obtained from further normalization.
+
+5.3.2. Syntax-Based Normalization
+
+ Implementations may use logic based on the definitions provided by
+ this specification to reduce the probability of false negatives. This
+ processing is moderately higher in cost than character-for-character
+ string comparison. For example, an application using this approach
+ could reasonably consider the following two IRIs equivalent:
+
+ example://a/b/c/%7Bfoo%7D/ros&#xE9;
+ eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9
+
+ Web user agents, such as browsers, typically apply this type of IRI
+ normalization when determining whether a cached response is
+ available. Syntax-based normalization includes such techniques as
+ case normalization, character normalization, percent-encoding
+ normalization, and removal of dot-segments.
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 24]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+5.3.2.1. Case Normalization
+
+ For all IRIs, the hexadecimal digits within a percent-encoding
+ triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
+ should be normalized to use uppercase letters for the digits A - F.
+
+ When an IRI uses components of the generic syntax, the component
+ syntax equivalence rules always apply; namely, that the scheme and
+ US-ASCII only host are case insensitive and therefore should be
+ normalized to lowercase. For example, the URI
+ "HTTP://www.EXAMPLE.com/" is equivalent to "http://www.example.com/".
+ Case equivalence for non-ASCII characters in IRI components that are
+ IDNs are discussed in section 5.3.3. The other generic syntax
+ components are assumed to be case sensitive unless specifically
+ defined otherwise by the scheme.
+
+ Creating schemes that allow case-insensitive syntax components
+ containing non-ASCII characters should be avoided. Case normalization
+ of non-ASCII characters can be culturally dependent and is always a
+ complex operation. The only exception concerns non-ASCII host names
+ for which the character normalization includes a mapping step derived
+ from case folding.
+
+5.3.2.2. Character Normalization
+
+ The Unicode Standard [UNIV4] defines various equivalences between
+ sequences of characters for various purposes. Unicode Standard Annex
+ #15 [UTR15] defines various Normalization Forms for these
+ equivalences, in particular Normalization Form C (NFC, Canonical
+ Decomposition, followed by Canonical Composition) and Normalization
+ Form KC (NFKC, Compatibility Decomposition, followed by Canonical
+ Composition).
+
+ Equivalence of IRIs MUST rely on the assumption that IRIs are
+ appropriately pre-character-normalized rather than apply character
+ normalization when comparing two IRIs. The exceptions are conversion
+ from a non-digital form, and conversion from a non-UCS-based
+ character encoding to a UCS-based character encoding. In these cases,
+ NFC or a normalizing transcoder using NFC MUST be used for
+ interoperability. To avoid false negatives and problems with
+ transcoding, IRIs SHOULD be created by using NFC. Using NFKC may
+ avoid even more problems; for example, by choosing half-width Latin
+ letters instead of full-width ones, and full-width instead of
+ half-width Katakana.
+
+ As an example, "http://www.example.org/r&#xE9;sum&#xE9;.html" (in XML
+ Notation) is in NFC. On the other hand,
+ "http://www.example.org/re&#x301;sume&#x301;.html" is not in NFC.
+
+
+
+Duerst & Suignard Standards Track [Page 25]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ The former uses precombined e-acute characters, and the latter uses
+ "e" characters followed by combining acute accents. Both usages are
+ defined as canonically equivalent in [UNIV4].
+
+ Note: Because it is unknown how a particular sequence of characters
+ is being treated with respect to character normalization, it would
+ be inappropriate to allow third parties to normalize an IRI
+ arbitrarily. This does not contradict the recommendation that
+ when a resource is created, its IRI should be as character
+ normalized as possible (i.e., NFC or even NFKC). This is similar
+ to the uppercase/lowercase problems. Some parts of a URI are case
+ insensitive (domain name). For others, it is unclear whether they
+ are case sensitive, case insensitive, or something in between
+ (e.g., case sensitive, but with a multiple choice selection if the
+ wrong case is used, instead of a direct negative result). The
+ best recipe is that the creator use a reasonable capitalization
+ and, when transferring the URI, capitalization never be changed.
+
+ Various IRI schemes may allow the usage of Internationalized Domain
+ Names (IDN) [RFC3490] either in the ireg-name part or elsewhere.
+ Character Normalization also applies to IDNs, as discussed in section
+ 5.3.3.
+
+5.3.2.3. Percent-Encoding Normalization
+
+ The percent-encoding mechanism (section 2.1 of [RFC3986]) is a
+ frequent source of variance among otherwise identical IRIs. In
+ addition to the case normalization issue noted above, some IRI
+ producers percent-encode octets that do not require percent-encoding,
+ resulting in IRIs that are equivalent to their non encoded
+ counterparts. These IRIs should be normalized by decoding any
+ percent-encoded octet sequence that corresponds to an unreserved
+ character, as described in section 2.3 of [RFC3986].
+
+ For actual resolution, differences in percent-encoding (except for
+ the percent-encoding of reserved characters) MUST always result in
+ the same resource. For example, "http://example.org/~user",
+ "http://example.org/%7euser", and "http://example.org/%7Euser", must
+ resolve to the same resource.
+
+ If this kind of equivalence is to be tested, the percent-encoding of
+ both IRIs to be compared has to be aligned; for example, by
+ converting both IRIs to URIs (see section 3.1), eliminating escape
+ differences in the resulting URIs, and making sure that the case of
+ the hexadecimal characters in the percent-encoding is always the same
+ (preferably uppercase). If the IRI is to be passed to another
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 26]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ application or used further in some other way, its original form MUST
+ be preserved. The conversion described here should be performed only
+ for local comparison.
+
+5.3.2.4. Path Segment Normalization
+
+ The complete path segments "." and ".." are intended only for use
+ within relative references (section 4.1 of [RFC3986]) and are removed
+ as part of the reference resolution process (section 5.2 of
+ [RFC3986]). However, some implementations may incorrectly assume
+ that reference resolution is not necessary when the reference is
+ already an IRI, and thus fail to remove dot-segments when they occur
+ in non-relative paths. IRI normalizers should remove dot-segments by
+ applying the remove_dot_segments algorithm to the path, as described
+ in section 5.2.4 of [RFC3986].
+
+5.3.3. Scheme-Based Normalization
+
+ The syntax and semantics of IRIs vary from scheme to scheme, as
+ described by the defining specification for each scheme.
+ Implementations may use scheme-specific rules, at further processing
+ cost, to reduce the probability of false negatives. For example,
+ because the "http" scheme makes use of an authority component, has a
+ default port of "80", and defines an empty path to be equivalent to
+ "/", the following four IRIs are equivalent:
+
+ http://example.com
+ http://example.com/
+ http://example.com:/
+ http://example.com:80/
+
+ In general, an IRI that uses the generic syntax for authority with an
+ empty path should be normalized to a path of "/". Likewise, an
+ explicit ":port", for which the port is empty or the default for the
+ scheme, is equivalent to one where the port and its ":" delimiter are
+ elided and thus should be removed by scheme-based normalization. For
+ example, the second IRI above is the normal form for the "http"
+ scheme.
+
+ Another case where normalization varies by scheme is in the handling
+ of an empty authority component or empty host subcomponent. For many
+ scheme specifications, an empty authority or host is considered an
+ error; for others, it is considered equivalent to "localhost" or the
+ end-user's host. When a scheme defines a default for authority and
+ an IRI reference to that default is desired, the reference should be
+ normalized to an empty authority for the sake of uniformity, brevity,
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 27]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ and internationalization. If, however, either the userinfo or port
+ subcomponents are non-empty, then the host should be given explicitly
+ even if it matches the default.
+
+ Normalization should not remove delimiters when their associated
+ component is empty unless it is licensed to do so by the scheme
+ specification. For example, the IRI "http://example.com/?" cannot be
+ assumed to be equivalent to any of the examples above. Likewise, the
+ presence or absence of delimiters within a userinfo subcomponent is
+ usually significant to its interpretation. The fragment component is
+ not subject to any scheme-based normalization; thus, two IRIs that
+ differ only by the suffix "#" are considered different regardless of
+ the scheme.
+
+ Some IRI schemes may allow the usage of Internationalized Domain
+ Names (IDN) [RFC3490] either in their ireg-name part or elsewhere.
+ When in use in IRIs, those names SHOULD be validated by using the
+ ToASCII operation defined in [RFC3490], with the flags
+ "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing an
+ invalid IDN cannot successfully be resolved. Validated IDN
+ components of IRIs SHOULD be character normalized by using the
+ Nameprep process [RFC3491]; however, for legibility purposes, they
+ SHOULD NOT be converted into ASCII Compatible Encoding (ACE).
+
+ Scheme-based normalization may also consider IDN components and their
+ conversions to punycode as equivalent. As an example,
+ "http://r&#xE9;sum&#xE9;.example.org" may be considered equivalent to
+ "http://xn--rsum-bpad.example.org".
+
+ Other scheme-specific normalizations are possible.
+
+5.3.4. Protocol-Based Normalization
+
+ Substantial effort to reduce the incidence of false negatives is
+ often cost-effective for web spiders. Consequently, they implement
+ even more aggressive techniques in IRI comparison. For example, if
+ they observe that an IRI such as
+
+ http://example.com/data
+
+ redirects to an IRI differing only in the trailing slash
+
+ http://example.com/data/
+
+ they will likely regard the two as equivalent in the future. This
+ kind of technique is only appropriate when equivalence is clearly
+ indicated by both the result of accessing the resources and the
+
+
+
+
+Duerst & Suignard Standards Track [Page 28]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ common conventions of their scheme's dereference algorithm (in this
+ case, use of redirection by HTTP origin servers to avoid problems
+ with relative references).
+
+6. Use of IRIs
+
+6.1. Limitations on UCS Characters Allowed in IRIs
+
+ This section discusses limitations on characters and character
+ sequences usable for IRIs beyond those given in section 2.2 and
+ section 4.1. The considerations in this section are relevant when
+ IRIs are created and when URIs are converted to IRIs.
+
+ a. The repertoire of characters allowed in each IRI component is
+ limited by the definition of that component. For example, the
+ definition of the scheme component does not allow characters
+ beyond US-ASCII.
+
+ (Note: In accordance with URI practice, generic IRI software
+ cannot and should not check for such limitations.)
+
+ b. The UCS contains many areas of characters for which there are
+ strong visual look-alikes. Because of the likelihood of
+ transcription errors, these also should be avoided. This
+ includes the full-width equivalents of Latin characters,
+ half-width Katakana characters for Japanese, and many others. It
+ also includes many look-alikes of "space", "delims", and
+ "unwise", characters excluded in [RFC3491].
+
+ Additional information is available from [UNIXML]. [UNIXML] is
+ written in the context of running text rather than in that of
+ identifiers. Nevertheless, it discusses many of the categories of
+ characters not appropriate for IRIs.
+
+6.2. Software Interfaces and Protocols
+
+ Although an IRI is defined as a sequence of characters, software
+ interfaces for URIs typically function on sequences of octets or
+ other kinds of code units. Thus, software interfaces and protocols
+ MUST define which character encoding is used.
+
+ Intermediate software interfaces between IRI-capable components and
+ URI-only components MUST map the IRIs per section 3.1, when
+ transferring from IRI-capable to URI-only components. This mapping
+ SHOULD be applied as late as possible. It SHOULD NOT be applied
+ between components that are known to be able to handle IRIs.
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 29]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+6.3. Format of URIs and IRIs in Documents and Protocols
+
+ Document formats that transport URIs may have to be upgraded to allow
+ the transport of IRIs. In cases where the document as a whole has a
+ native character encoding, IRIs MUST also be encoded in this
+ character encoding and converted accordingly by a parser or
+ interpreter. IRI characters not expressible in the native character
+ encoding SHOULD be escaped by using the escaping conventions of the
+ document format if such conventions are available. Alternatively,
+ they MAY be percent-encoded according to section 3.1. For example, in
+ HTML or XML, numeric character references SHOULD be used. If a
+ document as a whole has a native character encoding and that
+ character encoding is not UTF-8, then IRIs MUST NOT be placed into
+ the document in the UTF-8 character encoding.
+
+ Note: Some formats already accommodate IRIs, although they use
+ different terminology. HTML 4.0 [HTML4] defines the conversion from
+ IRIs to URIs as error-avoiding behavior. XML 1.0 [XML1], XLink
+ [XLink], XML Schema [XMLSchema], and specifications based upon them
+ allow IRIs. Also, it is expected that all relevant new W3C formats
+ and protocols will be required to handle IRIs [CharMod].
+
+6.4. Use of UTF-8 for Encoding Original Characters
+
+ This section discusses details and gives examples for point c) in
+ section 1.2. To be able to use IRIs, the URI corresponding to the
+ IRI in question has to encode original characters into octets by
+ using UTF-8. This can be specified for all URIs of a URI scheme or
+ can apply to individual URIs for schemes that do not specify how to
+ encode original characters. It can apply to the whole URI, or only
+ to some part. For background information on encoding characters into
+ URIs, see also section 2.5 of [RFC3986].
+
+ For new URI schemes, using UTF-8 is recommended in [RFC2718].
+ Examples where UTF-8 is already used are the URN syntax [RFC2141],
+ IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand,
+ because the HTTP URL scheme does not specify how to encode original
+ characters, only some HTTP URLs can have corresponding but different
+ IRIs.
+
+ For example, for a document with a URI of
+ "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to
+ construct a corresponding IRI (in XML notation, see, section 1.4):
+ "http://www.example.org/r&#xE9;sum&#xE9;.html" ("&#xE9"; stands for
+ the e-acute character, and "%C3%A9" is the UTF-8 encoded and
+ percent-encoded representation of that character). On the other
+ hand, for a document with a URI of
+
+
+
+
+Duerst & Suignard Standards Track [Page 30]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ "http://www.example.org/r%E9sum%E9.html", the percent-encoding octets
+ cannot be converted to actual characters in an IRI, as the
+ percent-encoding is not based on UTF-8.
+
+ This means that for most URI schemes, there is no need to upgrade
+ their scheme definition in order for them to work with IRIs. The
+ main case where upgrading makes sense is when a scheme definition, or
+ a particular component of a scheme, is strictly limited to the use of
+ US-ASCII characters with no provision to include non-ASCII
+ characters/octets via percent-encoding, or if a scheme definition
+ currently uses highly scheme-specific provisions for the encoding of
+ non-ASCII characters. An example of this is the mailto: scheme
+ [RFC2368].
+
+ This specification does not upgrade any scheme specifications in any
+ way; this has to be done separately. Also, note that there is no
+ such thing as an "IRI scheme"; all IRIs use URI schemes, and all URI
+ schemes can be used with IRIs, even though in some cases only by
+ using URIs directly as IRIs, without any conversion.
+
+ URI schemes can impose restrictions on the syntax of scheme-specific
+ URIs; i.e., URIs that are admissible under the generic URI syntax
+ [RFC3986] may not be admissible due to narrower syntactic constraints
+ imposed by a URI scheme specification. URI scheme definitions cannot
+ broaden the syntactic restrictions of the generic URI syntax;
+ otherwise, it would be possible to generate URIs that satisfied the
+ scheme-specific syntactic constraints without satisfying the
+ syntactic constraints of the generic URI syntax. However, additional
+ syntactic constraints imposed by URI scheme specifications are
+ applicable to IRI, as the corresponding URI resulting from the
+ mapping defined in section 3.1 MUST be a valid URI under the
+ syntactic restrictions of generic URI syntax and any narrower
+ restrictions imposed by the corresponding URI scheme specification.
+
+ The requirement for the use of UTF-8 applies to all parts of a URI
+ (with the potential exception of the ireg-name part; see section
+ 3.1). However, it is possible that the capability of IRIs to
+ represent a wide range of characters directly is used just in some
+ parts of the IRI (or IRI reference). The other parts of the IRI may
+ only contain US-ASCII characters, or they may not be based on UTF-8.
+ They may be based on another character encoding, or they may directly
+ encode raw binary data (see also [RFC2397]).
+
+ For example, it is possible to have a URI reference of
+ "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the
+ document name is encoded in iso-8859-1 based on server settings, but
+ where the fragment identifier is encoded in UTF-8 according to
+
+
+
+
+Duerst & Suignard Standards Track [Page 31]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ [XPointer]. The IRI corresponding to the above URI would be (in XML
+ notation)
+ "http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9";.
+
+ Similar considerations apply to query parts. The functionality of
+ IRIs (namely, to be able to include non-ASCII characters) can only be
+ used if the query part is encoded in UTF-8.
+
+6.5. Relative IRI References
+
+ Processing of relative IRI references against a base is handled
+ straightforwardly; the algorithms of [RFC3986] can be applied
+ directly, treating the characters additionally allowed in IRI
+ references in the same way that unreserved characters are in URI
+ references.
+
+7. URI/IRI Processing Guidelines (Informative)
+
+ This informative section provides guidelines for supporting IRIs in
+ the same software components and operations that currently process
+ URIs: Software interfaces that handle URIs, software that allows
+ users to enter URIs, software that creates or generates URIs,
+ software that displays URIs, formats and protocols that transport
+ URIs, and software that interprets URIs. These may all require
+ modification before functioning properly with IRIs. The
+ considerations in this section also apply to URI references and IRI
+ references.
+
+7.1. URI/IRI Software Interfaces
+
+ Software interfaces that handle URIs, such as URI-handling APIs and
+ protocols transferring URIs, need interfaces and protocol elements
+ that are designed to carry IRIs.
+
+ In case the current handling in an API or protocol is based on
+ US-ASCII, UTF-8 is recommended as the character encoding for IRIs, as
+ it is compatible with US-ASCII, is in accordance with the
+ recommendations of [RFC2277], and makes converting to URIs easy. In
+ any case, the API or protocol definition must clearly define the
+ character encoding to be used.
+
+ The transfer from URI-only to IRI-capable components requires no
+ mapping, although the conversion described in section 3.2 above may
+ be performed. It is preferable not to perform this inverse
+ conversion when there is a chance that this cannot be done correctly.
+
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 32]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+7.2. URI/IRI Entry
+
+ Some components allow users to enter URIs into the system by typing
+ or dictation, for example. This software must be updated to allow
+ for IRI entry.
+
+ A person viewing a visual representation of an IRI (as a sequence of
+ glyphs, in some order, in some visual display) or hearing an IRI will
+ use an entry method for characters in the user's language to input
+ the IRI. Depending on the script and the input method used, this may
+ be a more or less complicated process.
+
+ The process of IRI entry must ensure, as much as possible, that the
+ restrictions defined in section 2.2 are met. This may be done by
+ choosing appropriate input methods or variants/settings thereof, by
+ appropriately converting the characters being input, by eliminating
+ characters that cannot be converted, and/or by issuing a warning or
+ error message to the user.
+
+ As an example of variant settings, input method editors for East
+ Asian Languages usually allow the input of Latin letters and related
+ characters in full-width or half-width versions. For IRI input, the
+ input method editor should be set so that it produces half-width
+ Latin letters and punctuation and full-width Katakana.
+
+ An input field primarily or solely used for the input of URIs/IRIs
+ may allow the user to view an IRI as it is mapped to a URI. Places
+ where the input of IRIs is frequent may provide the possibility for
+ viewing an IRI as mapped to a URI. This will help users when some of
+ the software they use does not yet accept IRIs.
+
+ An IRI input component interfacing to components that handle URIs,
+ but not IRIs, must map the IRI to a URI before passing it to these
+ components.
+
+ For the input of IRIs with right-to-left characters, please see
+ section 4.3.
+
+7.3. URI/IRI Transfer between Applications
+
+ Many applications, particularly mail user agents, try to detect URIs
+ appearing in plain text. For this, they use some heuristics based on
+ URI syntax. They then allow the user to click on such URIs and
+ retrieve the corresponding resource in an appropriate (usually
+ scheme-dependent) application.
+
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 33]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ Such applications have to be upgraded to use the IRI syntax as a base
+ for heuristics. In particular, a non-ASCII character should not be
+ taken as the indication of the end of an IRI. Such applications also
+ have to make sure that they correctly convert the detected IRI from
+ the character encoding of the document or application where the IRI
+ appears to the character encoding used by the system-wide IRI
+ invocation mechanism, or to a URI (according to section 3.1) if the
+ system-wide invocation mechanism only accepts URIs.
+
+ The clipboard is another frequently used way to transfer URIs and
+ IRIs from one application to another. On most platforms, the
+ clipboard is able to store and transfer text in many languages and
+ scripts. Correctly used, the clipboard transfers characters, not
+ bytes, which will do the right thing with IRIs.
+
+7.4. URI/IRI Generation
+
+ Systems that offer resources through the Internet, where those
+ resources have logical names, sometimes automatically generate URIs
+ for the resources they offer. For example, some HTTP servers can
+ generate a directory listing for a file directory and then respond to
+ the generated URIs with the files.
+
+ Many legacy character encodings are in use in various file systems.
+ Many currently deployed systems do not transform the local character
+ representation of the underlying system before generating URIs.
+
+ For maximum interoperability, systems that generate resource
+ identifiers should make the appropriate transformations. For
+ example, if a file system contains a file named
+ "r&#xE9;sum&#xE9;.html", a server should expose this as
+ "r%C3%A9sum%C3%A9.html" in a URI, which allows use of
+ "r&#xE9;sum&#xE9;.html" in an IRI, even if locally the file name is
+ kept in a character encoding other than UTF-8.
+
+ This recommendation particularly applies to HTTP servers. For FTP
+ servers, similar considerations apply; see [RFC2640].
+
+7.5. URI/IRI Selection
+
+ In some cases, resource owners and publishers have control over the
+ IRIs used to identify their resources. This control is mostly
+ executed by controlling the resource names, such as file names,
+ directly.
+
+
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 34]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ In these cases, it is recommended to avoid choosing IRIs that are
+ easily confused. For example, for US-ASCII, the lower-case ell ("l")
+ is easily confused with the digit one ("1"), and the upper-case oh
+ ("O") is easily confused with the digit zero ("0"). Publishers
+ should avoid confusing users with "br0ken" or "1ame" identifiers.
+
+ Outside the US-ASCII repertoire, there are many more opportunities
+ for confusion; a complete set of guidelines is too lengthy to include
+ here. As long as names are limited to characters from a single
+ script, native writers of a given script or language will know best
+ when ambiguities can appear, and how they can be avoided. What may
+ look ambiguous to a stranger may be completely obvious to the average
+ native user. On the other hand, in some cases, the UCS contains
+ variants for compatibility reasons; for example, for typographic
+ purposes. These should be avoided wherever possible. Although there
+ may be exceptions, newly created resource names should generally be
+ in NFKC [UTR15] (which means that they are also in NFC).
+
+ As an example, the UCS contains the "fi" ligature at U+FB01 for
+ compatibility reasons. Wherever possible, IRIs should use the two
+ letters "f" and "i" rather than the "fi" ligature. An example where
+ the latter may be used is in the query part of an IRI for an explicit
+ search for a word written containing the "fi" ligature.
+
+ In certain cases, there is a chance that characters from different
+ scripts look the same. The best known example is the similarity of
+ the Latin "A", the Greek "Alpha", and the Cyrillic "A". To avoid
+ such cases, only IRIs should be created where all the characters in a
+ single component are used together in a given language. This usually
+ means that all of these characters will be from the same script, but
+ there are languages that mix characters from different scripts (such
+ as Japanese). This is similar to the heuristics used to distinguish
+ between letters and numbers in the examples above. Also, for Latin,
+ Greek, and Cyrillic, using lowercase letters results in fewer
+ ambiguities than using uppercase letters would.
+
+7.6. Display of URIs/IRIs
+
+ In situations where the rendering software is not expected to display
+ non-ASCII parts of the IRI correctly using the available layout and
+ font resources, these parts should be percent-encoded before being
+ displayed.
+
+ For display of Bidi IRIs, please see section 4.1.
+
+
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 35]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+7.7. Interpretation of URIs and IRIs
+
+ Software that interprets IRIs as the names of local resources should
+ accept IRIs in multiple forms and convert and match them with the
+ appropriate local resource names.
+
+ First, multiple representations include both IRIs in the native
+ character encoding of the protocol and also their URI counterparts.
+
+ Second, it may include URIs constructed based on character encodings
+ other than UTF-8. These URIs may be produced by user agents that do
+ not conform to this specification and that use legacy character
+ encodings to convert non-ASCII characters to URIs. Whether this is
+ necessary, and what character encodings to cover, depends on a number
+ of factors, such as the legacy character encodings used locally and
+ the distribution of various versions of user agents. For example,
+ software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in
+ addition to UTF-8.
+
+ Third, it may include additional mappings to be more user-friendly
+ and robust against transmission errors. These would be similar to
+ how some servers currently treat URIs as case insensitive or perform
+ additional matching to account for spelling errors. For characters
+ beyond the US-ASCII repertoire, this may, for example, include
+ ignoring the accents on received IRIs or resource names. Please note
+ that such mappings, including case mappings, are language dependent.
+
+ It can be difficult to identify a resource unambiguously if too many
+ mappings are taken into consideration. However, percent-encoded and
+ not percent-encoded parts of IRIs can always be clearly
+ distinguished. Also, the regularity of UTF-8 (see [Duerst97]) makes
+ the potential for collisions lower than it may seem at first.
+
+7.8. Upgrading Strategy
+
+ Where this recommendation places further constraints on software for
+ which many instances are already deployed, it is important to
+ introduce upgrades carefully and to be aware of the various
+ interdependencies.
+
+ If IRIs cannot be interpreted correctly, they should not be created,
+ generated, or transported. This suggests that upgrading URI
+ interpreting software to accept IRIs should have highest priority.
+
+ On the other hand, a single IRI is interpreted only by a single or
+ very few interpreters that are known in advance, although it may be
+ entered and transported very widely.
+
+
+
+
+Duerst & Suignard Standards Track [Page 36]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ Therefore, IRIs benefit most from a broad upgrade of software to be
+ able to enter and transport IRIs. However, before an individual IRI
+ is published, care should be taken to upgrade the corresponding
+ interpreting software in order to cover the forms expected to be
+ received by various versions of entry and transport software.
+
+ The upgrade of generating software to generate IRIs instead of using
+ a local character encoding should happen only after the service is
+ upgraded to accept IRIs. Similarly, IRIs should only be generated
+ when the service accepts IRIs and the intervening infrastructure and
+ protocol is known to transport them safely.
+
+ Software converting from URIs to IRIs for display should be upgraded
+ only after upgraded entry software has been widely deployed to the
+ population that will see the displayed result.
+
+ Where there is a free choice of character encodings, it is often
+ possible to reduce the effort and dependencies for upgrading to IRIs
+ by using UTF-8 rather than another encoding. For example, when a new
+ file-based Web server is set up, using UTF-8 as the character
+ encoding for file names will make the transition to IRIs easier.
+ Likewise, when a new Web form is set up using UTF-8 as the character
+ encoding of the form page, the returned query URIs will use UTF-8 as
+ the character encoding (unless the user, for whatever reason, changes
+ the character encoding) and will therefore be compatible with IRIs.
+
+ These recommendations, when taken together, will allow for the
+ extension from URIs to IRIs in order to handle characters other than
+ US-ASCII while minimizing interoperability problems. For
+ considerations regarding the upgrade of URI scheme definitions, see
+ section 6.4.
+
+8. Security Considerations
+
+ The security considerations discussed in [RFC3986] also apply to
+ IRIs. In addition, the following issues require particular care for
+ IRIs.
+
+ Incorrect encoding or decoding can lead to security problems. In
+ particular, some UTF-8 decoders do not check against overlong byte
+ sequences. As an example, a "/" is encoded with the byte 0x2F both
+ in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly
+ interpret the sequence 0xC0 0xAF as a "/". A sequence such as
+
+
+
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 37]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ "%C0%AF.." may pass some security tests and then be interpreted as
+ "/.." in a path if UTF-8 decoders are fault-tolerant, if conversion
+ and checking are not done in the right order, and/or if reserved
+ characters and unreserved characters are not clearly distinguished.
+
+ There are various ways in which "spoofing" can occur with IRIs.
+ "Spoofing" means that somebody may add a resource name that looks the
+ same or similar to the user, but that points to a different resource.
+ The added resource may pretend to be the real resource by looking
+ very similar but may contain all kinds of changes that may be
+ difficult to spot and that can cause all kinds of problems. Most
+ spoofing possibilities for IRIs are extensions of those for URIs.
+
+ Spoofing can occur for various reasons. First, a user's
+ normalization expectations or actual normalization when entering an
+ IRI or transcoding an IRI from a legacy character encoding do not
+ match the normalization used on the server side. Conceptually, this
+ is no different from the problems surrounding the use of
+ case-insensitive web servers. For example, a popular web page with a
+ mixed-case name ("http://big.example.com/PopularPage.html") might be
+ "spoofed" by someone who is able to create
+ "http://big.example.com/popularpage.html". However, the use of
+ unnormalized character sequences, and of additional mappings for user
+ convenience, may increase the chance for spoofing. Protocols and
+ servers that allow the creation of resources with names that are not
+ normalized are particularly vulnerable to such attacks. This is an
+ inherent security problem of the relevant protocol, server, or
+ resource and is not specific to IRIs, but it is mentioned here for
+ completeness.
+
+ Spoofing can occur in various IRI components, such as the domain name
+ part or a path part. For considerations specific to the domain name
+ part, see [RFC3491]. For the path part, administrators of sites that
+ allow independent users to create resources in the same sub area may
+ have to be careful to check for spoofing.
+
+ Spoofing can occur because in the UCS many characters look very
+ similar. Details are discussed in Section 7.5. Again, this is very
+ similar to spoofing possibilities on US-ASCII, e.g., using "br0ken"
+ or "1ame" URIs.
+
+ Spoofing can occur when URIs with percent-encodings based on various
+ character encodings are accepted to deal with older user agents. In
+ some cases, particularly for Latin-based resource names, this is
+ usually easy to detect because UTF-8-encoded names, when interpreted
+ and viewed as legacy character encodings, produce mostly garbage.
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 38]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ When concurrently used character encodings have a similar structure
+ but there are no characters that have exactly the same encoding,
+ detection is more difficult.
+
+ Spoofing can occur with bidirectional IRIs, if the restrictions in
+ section 4.2 are not followed. The same visual representation may be
+ interpreted as different logical representations, and vice versa. It
+ is also very important that a correct Unicode bidirectional
+ implementation be used.
+
+9. Acknowledgements
+
+ We would like to thank Larry Masinter for his work as coauthor of
+ many earlier versions of this document (draft-masinter-url-i18n-xx).
+
+ The discussion on the issue addressed here started a long time ago.
+ There was a thread in the HTML working group in August 1995 (under
+ the topic of "Globalizing URIs") and in the www-international mailing
+ list in July 1996 (under the topic of "Internationalization and
+ URLs"), and there were ad-hoc meetings at the Unicode conferences in
+ September 1995 and September 1997.
+
+ Many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding,
+ Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim
+ Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie
+ Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman, Russ Housley,
+ Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex Texin, Graham Klyne,
+ Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam Costello, Dan
+ Oscarson, Elliotte Rusty Harold, Mike J. Brown, Roy Badami, Jonathan
+ Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio, Chris
+ Haynes, Walter Underwood, and many others for help with understanding
+ the issues and possible solutions, and with getting the details
+ right.
+
+ This document is a product of the Internationalization Working Group
+ (I18N WG) of the World Wide Web Consortium (W3C). Thanks to the
+ members of the W3C I18N Working Group and Interest Group for their
+ contributions and their work on [CharMod]. Thanks also go to the
+ members of many other W3C Working Groups for adopting IRIs, and to
+ the members of the Montreal IAB Workshop on Internationalization and
+ Localization for their review.
+
+
+
+
+
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 39]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+10. References
+
+10.1. Normative References
+
+ [ASCII] American National Standards Institute, "Coded
+ Character Set -- 7-bit American Standard Code for
+ Information Interchange", ANSI X3.4, 1986.
+
+ [ISO10646] International Organization for Standardization,
+ "ISO/IEC 10646:2003: Information Technology -
+ Universal Multiple-Octet Coded Character Set (UCS)",
+ ISO Standard 10646, December 2003.
+
+ [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
+ Requirement Levels", BCP 14, RFC 2119, March 1997.
+
+ [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
+ Specifications: ABNF", RFC 2234, November 1997.
+
+ [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
+ "Internationalizing Domain Names in Applications
+ (IDNA)", RFC 3490, March 2003.
+
+ [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
+ Profile for Internationalized Domain Names (IDN)", RFC
+ 3491, March 2003.
+
+ [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
+ 10646", STD 63, RFC 3629, November 2003.
+
+ [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter,
+ "Uniform Resource Identifier (URI): Generic Syntax",
+ STD 66, RFC 3986, January 2005.
+
+ [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode
+ Standard Annex #9, March 2004,
+ <http://www.unicode.org/reports/tr9/tr9-13.html>.
+
+ [UNIV4] The Unicode Consortium, "The Unicode Standard, Version
+ 4.0.1, defined by: The Unicode Standard, Version 4.0
+ (Reading, MA, Addison-Wesley, 2003. ISBN
+ 0-321-18578-1), as amended by Unicode 4.0.1
+ (http://www.unicode.org/versions/Unicode4.0.1/)",
+ March 2004.
+
+
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 40]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ [UTR15] Davis, M. and M. Duerst, "Unicode Normalization
+ Forms", Unicode Standard Annex #15, April 2003,
+ <http://www.unicode.org/unicode/reports/
+ tr15/tr15-23.html>.
+
+10.2. Informative References
+
+ [BidiEx] "Examples of bidirectional IRIs",
+ <http://www.w3.org/International/iri-edit/
+ BidiExamples>.
+
+ [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T.
+ Texin, "Character Model for the World Wide Web:
+ Resource Identifiers", World Wide Web Consortium
+ Candidate Recommendation, November 2004,
+ <http://www.w3.org/TR/charmod-resid>.
+
+ [Duerst97] Duerst, M., "The Properties and Promises of UTF-8",
+ Proc. 11th International Unicode Conference, San Jose
+ , September 1997,
+ <http://www.ifi.unizh.ch/mml/mduerst/papers/
+ PDF/IUC11-UTF-8.pdf>.
+
+ [Gettys] Gettys, J., "URI Model Consequences",
+ <http://www.w3.org/DesignIssues/ModelConsequences>.
+
+ [HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01
+ Specification", World Wide Web Consortium
+ Recommendation, December 1999,
+ <http://www.w3.org/TR/html401/appendix/
+ notes.html#h-B.2>.
+
+ [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet
+ Mail Extensions (MIME) Part One: Format of Internet
+ Message Bodies", RFC 2045, November 1996.
+
+ [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
+ Atkinson, R., Crispin, M., and P. Svanberg, "The
+ Report of the IAB Character Set Workshop held 29
+ February - 1 March, 1996", RFC 2130, April 1997.
+
+ [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997.
+
+ [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September
+ 1997.
+
+ [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
+ Languages", BCP 18, RFC 2277, January 1998.
+
+
+
+Duerst & Suignard Standards Track [Page 41]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ [RFC2368] Hoffman, P., Masinter, L., and J. Zawinski, "The
+ mailto URL scheme", RFC 2368, July 1998.
+
+ [RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998.
+
+ [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter,
+ "Uniform Resource Identifiers (URI): Generic Syntax",
+ RFC 2396, August 1998.
+
+ [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397,
+ August 1998.
+
+ [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
+ Masinter, L., Leach, P., and T. Berners-Lee,
+ "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616,
+ June 1999.
+
+ [RFC2640] Curtin, B., "Internationalization of the File Transfer
+ Protocol", RFC 2640, July 1999.
+
+ [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D., and R.
+ Petke, "Guidelines for new URL Schemes", RFC 2718,
+ November 1999.
+
+ [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other
+ Markup Languages", Unicode Technical Report #20, World
+ Wide Web Consortium Note, June 2003,
+ <http://www.w3.org/TR/unicode-xml/>.
+
+ [XLink] DeRose, S., Maler, E., and D. Orchard, "XML Linking
+ Language (XLink) Version 1.0", World Wide Web
+ Consortium Recommendation, June 2001,
+ <http://www.w3.org/TR/xlink/#link-locators>.
+
+ [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E.,
+ and F. Yergeau, "Extensible Markup Language (XML) 1.0
+ (Third Edition)", World Wide Web Consortium
+ Recommendation, February 2004,
+ <http://www.w3.org/TR/REC-xml#sec-external-ent>.
+
+ [XMLNamespace] Bray, T., Hollander, D., and A. Layman, "Namespaces in
+ XML", World Wide Web Consortium Recommendation,
+ January 1999, <http://www.w3.org/TR/REC-xml-names>.
+
+ [XMLSchema] Biron, P. and A. Malhotra, "XML Schema Part 2:
+ Datatypes", World Wide Web Consortium Recommendation,
+ May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>.
+
+
+
+
+Duerst & Suignard Standards Track [Page 42]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ [XPointer] Grosso, P., Maler, E., Marsh, J. and N. Walsh,
+ "XPointer Framework", World Wide Web Consortium
+ Recommendation, March 2003,
+ <http://www.w3.org/TR/xptr-framework/#escaping>.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 43]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+Appendix A. Design Alternatives
+
+ This section shortly summarizes major design alternatives and the
+ reasons for why they were not chosen.
+
+Appendix A.1. New Scheme(s)
+
+ Introducing new schemes (for example, httpi:, ftpi:,...) or a new
+ metascheme (e.g., i:, leading to URI/IRI prefixes such as i:http:,
+ i:ftp:,...) was proposed to make IRI-to-URI conversion scheme
+ dependent or to distinguish between percent-encodings resulting from
+ IRI-to-URI conversion and percent-encodings from legacy character
+ encodings.
+
+ New schemes are not needed to distinguish URIs from true IRIs (i.e.,
+ IRIs that contain non-ASCII characters). The benefit of being able
+ to detect the origin of percent-encodings is marginal, as UTF-8 can
+ be detected with very high reliability. Deploying new schemes is
+ extremely hard, so not requiring new schemes for IRIs makes
+ deployment of IRIs vastly easier. Making conversion scheme dependent
+ is highly inadvisable and would be encouraged by separate schemes for
+ IRIs. Using a uniform convention for conversion from IRIs to URIs
+ makes IRI implementation orthogonal to the introduction of actual new
+ schemes.
+
+Appendix A.2. Character Encodings Other Than UTF-8
+
+ At an early stage, UTF-7 was considered as an alternative to UTF-8
+ when IRIs are converted to URIs. UTF-7 would not have needed
+ percent-encoding and in most cases would have been shorter than
+ percent-encoded UTF-8.
+
+ Using UTF-8 avoids a double layering and overloading of the use of
+ the "+" character. UTF-8 is fully compatible with US-ASCII and has
+ therefore been recommended by the IETF, and is being used widely.
+
+ UTF-7 has never been used much and is now clearly being discouraged.
+ Requiring implementations to convert from UTF-8 to UTF-7 and back
+ would be an additional implementation burden.
+
+Appendix A.3. New Encoding Convention
+
+ Instead of using the existing percent-encoding convention of URIs,
+ which is based on octets, the idea was to create a new encoding
+ convention; for example, to use "%u" to introduce UCS code points.
+
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 44]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+ Using the existing octet-based percent-encoding mechanism does not
+ need an upgrade of the URI syntax and does not need corresponding
+ server upgrades.
+
+Appendix A.4. Indicating Character Encodings in the URI/IRI
+
+ Some proposals suggested indicating the character encodings used in
+ an URI or IRI with some new syntactic convention in the URI itself,
+ similar to the "charset" parameter for e-mails and Web pages. As an
+ example, the label in square brackets in
+ "http://www.example.org/ros[iso-8859-1]&#xE9"; indicated that the
+ following "&#xE9"; had to be interpreted as iso-8859-1.
+
+ If UTF-8 is used exclusively, an upgrade to the URI syntax is not
+ needed. It avoids potentially multiple labels that have to be copied
+ correctly in all cases, even on the side of a bus or on a napkin,
+ leading to usability problems (and being prohibitively annoying).
+ Exclusively using UTF-8 also reduces transcoding errors and
+ confusion.
+
+Authors' Addresses
+
+ Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever
+ possible, for example as "D&#252;rst" in XML and
+ HTML.)
+ World Wide Web Consortium
+ 5322 Endo
+ Fujisawa, Kanagawa 252-8520
+ Japan
+
+ Phone: +81 466 49 1170
+ Fax: +81 466 49 1171
+ EMail: duerst@w3.org
+ URI: http://www.w3.org/People/D%C3%BCrst/
+ (Note: This is the percent-encoded form of an IRI.)
+
+
+ Michel Suignard
+ Microsoft Corporation
+ One Microsoft Way
+ Redmond, WA 98052
+ U.S.A.
+
+ Phone: +1 425 882-8080
+ EMail: michelsu@microsoft.com
+ URI: http://www.suignard.com
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 45]
+
+RFC 3987 Internationalized Resource Identifiers January 2005
+
+
+Full Copyright Statement
+
+ Copyright (C) The Internet Society (2005).
+
+ This document is subject to the rights, licenses and restrictions
+ contained in BCP 78, and except as set forth therein, the authors
+ retain all their rights.
+
+ This document and the information contained herein are provided on an
+ "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
+ OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
+ ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
+ INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
+ INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
+ WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+Intellectual Property
+
+ The IETF takes no position regarding the validity or scope of any
+ Intellectual Property Rights or other rights that might be claimed to
+ pertain to the implementation or use of the technology described in
+ this document or the extent to which any license under such rights
+ might or might not be available; nor does it represent that it has
+ made any independent effort to identify any such rights. Information
+ on the IETF's procedures with respect to rights in IETF Documents can
+ be found in BCP 78 and BCP 79.
+
+ Copies of IPR disclosures made to the IETF Secretariat and any
+ assurances of licenses to be made available, or the result of an
+ attempt made to obtain a general license or permission for the use of
+ such proprietary rights by implementers or users of this
+ specification can be obtained from the IETF on-line IPR repository at
+ http://www.ietf.org/ipr.
+
+ The IETF invites any interested party to bring to its attention any
+ copyrights, patents or patent applications, or other proprietary
+ rights that may cover technology that may be required to implement
+ this standard. Please address the information to the IETF at ietf-
+ ipr@ietf.org.
+
+
+Acknowledgement
+
+ Funding for the RFC Editor function is currently provided by the
+ Internet Society.
+
+
+
+
+
+
+Duerst & Suignard Standards Track [Page 46]
+
diff --git a/trunk/txt/vfs-ideas.txt b/trunk/txt/vfs-ideas.txt
new file mode 100644
index 00000000..7b41a064
--- /dev/null
+++ b/trunk/txt/vfs-ideas.txt
@@ -0,0 +1,425 @@
+Subject: Plans for gnome-vfs replacement
+
+Recently there has been a lot of discussions about the platform and
+the correct stacking order and quality of the modules. Gnome-vfs
+is a clear problem in this discussion. Having spent the last 4 years
+as the gnome-vfs maintainer, and even longer as the primary gnome-vfs
+user (in Nautilus) I'm well aware of the problems it has. I think that
+we've reached a point where the problems in the gnome-vfs architecture
+and its position in the stack are now ranking as one of the most
+problematic aspects of the gnome platform, especially considering the
+enhancements and quality improvements seen in other parts of the
+platform.
+
+So, I think the time has come for a serious look at what gnome-vfs
+could be. I've spent much time last week thinking about the weaknesses
+and problems of the current gnome-vfs and possibilities inherent in a
+redesign, both having learnt from 7 years of gnome-vfs existance and
+the improvements in the platform (both Gnome and surrounding
+technologies) since 1999 when it was designed.
+
+As soon as you spend some time looking at this problem is evident that
+to solve the platform ordering issues we really need a clean cut from
+the current gnome-vfs. I think the ideal level for a VFS would be in
+glib, in a separate library similar to gthread or gobject. That way
+gtk+ would be able to integrate with it and all gnome apps would have
+access to it, but it wouldn't affect small apps that just wants to use
+the glib core. Furthermore, not being libglib lets us use GObjects in
+the vfs, which means we can make a more modern API. Of course, this
+places quite some limitations on the vfs, especially in terms of
+dependencies and how integration with a UI should work.
+
+Any thoughs on the design of a vfs to replace gnome-vfs must be based
+on a solid understanding of the problems with the current system, to
+avoid redoing old mistakes and to make sure that we solve all the
+known problems of the current system. So, I'm gonna start by
+describing what I see as the main architectural and "hard" problems in
+gnome-vfs.
+
+The first, and most often discussed problem is of course
+compatibility. The various desktops use different vfs implementations,
+so they can't read each others files. Many applications use no VFS
+at all and have no way to access files on vfs shares. And the
+existence of multiple vfs implementations makes it unlikely that they
+will start using one.
+
+Gnome-vfs has no concept of Display Names, something which is very
+useful in a gui based system. In some places we auto-generate virtual
+desktop files to get this feature, but that is very hackish and need
+support in all apps to understand them. This is also quite closely
+related to handling filename charset encoding, another very weak point
+in gnome-vfs. The ideal way to reference a file is the actual real
+identifer on disk, database, remote share or what have you, as that
+can be passed between implementations, used with other access methods,
+etc. But to display something useful to the user you really need a
+user understandable utf-8 encoded string and a way to map that to/from
+the filename.
+
+There is no support for icons at the level of gnome-vfs. This means
+that all users above the vfs must implement it themselves (e.g. in
+nautilus and in the file selector). Each implementation have its own
+bugs, maintainance load and risk for different behaviour. It also
+means that vfs backends cannot supply their own icons, something which
+might be very useful for e.g. a network share based on some new fancy
+web service.
+
+The abstraction that gnome-vfs use is very similar to the posix model,
+which is problematic in several ways. The posix model matches poorly
+with the sort of highlevel operations that gnome applications wants to
+do with the vfs. The vfs is not typically used for what I would call
+"implementation files", which are things like configuration files,
+data files shipped with apps, system files, etc, but rather for what
+I'd like to call "user document files". These are the kind of files
+you open, save, or download from the internet to look at. Applications
+that use these would like highlevel operations that match the kind
+of operations you use on them, like read-entire-file, save-file,
+copy-file, etc.
+
+The posix model is also a bad match when implementing gnome-vfs
+modules. It requires some features from the implementation that can be
+difficult or impossible to implement. For example, its not really
+possible to support seeking when writing to a file on a webdav share,
+because a webdav put operation is essentially just streaming a copy of
+the new file contents. We currently work around this by locally
+caching all the file data being written and then sending it all in the
+close() call. This is clearly suboptimal for a lot of reasons, like
+applications not expecting close() to take a long time, and not
+checking its error conditions very closely.
+
+Another problem is that the posix model doesn't contain explicit
+operations for some of the things applications need to do. So instead
+applications rely on well known knowledge of the behaviour of posix
+to implement these operations. However, such behaviour might not be
+guaranteed on some gnome-vfs backends. A common example is the atomic
+save operation. A typical way to implement this on posix is to write
+to a temporary file in the same directory, and then rename over the
+target file, thus guaranteeing an atomic replacement of the file on
+disk, or a failure that didn't affect the original file. Of course,
+if a gnome-vfs application would use this and the backend was a
+webdav share to a subversion repository you would get some really
+weird versioning history in the repository for no good reason. If the
+backend had its own implementation of the save operation we could get
+both optimal behaviour on each backend, and an application API that
+doesn't require arcane knowledge of the atomicity of renames.
+
+One of the most problematic aspects of gnome-vfs is its authentication
+framework. The way it works is that you register callbacks to handle
+the authentication dialog, and whenever any operation needs to do
+authentication these callbacks will be called. The idea is that a
+console application would register a set of callbacks that print
+prompts on the console, and a Gtk+ application would have a set of
+callbacks that displays dialogs. There is a set of standard dialog
+based callbacks in libgnomeui that you can install by calling
+gnome_authentication_manager_init(). From an initial look this seems
+like a reasonable approach, but it turns out that this creates a host
+of different problems.
+
+One problem is how you connect this to the application. A lot of
+people are unaware that you have to call gnome_auth_manager_init() to
+get authentication dialogs, or don't want to depend on libgnomeui to
+do so. So a lot of applications don't work with authentication. Those
+who do call it generally have pretty poor integration with the
+authentication dialogs. For instance, the general authentication
+dialogs can't be marked as parents of whatever dialog caused the
+authentication (because they have know whay of knowing what caused
+it), and all sorts of problems appear when there is a modal dialog
+displayed already.
+
+Another problem is the combination of blocking gnome-vfs calls and
+authentication. When calling a blocking operation like read() and it
+results in a password dialog we have to start up a recursive mainloop
+to display it. Not only is this unexpected for the application, it
+also brings with it all the type of reentrancy issues that we had in
+bonobo. Even worse, there is no way to make this threadsafe. To make
+it threadsafe the callback would have to take the gdk lock before
+doing any Gtk+ calls, but this would cause a deadlock if the
+application called it with the gdk lock held. If we don't take the
+gdk lock then you can't do blocking vfs calls on any thread but the
+mainloop, or you have to take the gdk lock on any gnome-vfs call.
+
+The authentication callbacks can appear at *any* gnome-vfs entry
+point, which makes it very hard to write gnome-vfs applications that
+don't accidentally trigger a lot of authentication dialogs. For
+instance, the tree sidebar in nautilus has to take particular care not
+to stat or otherwise look at the toplevel items until the user
+explicitly expands them, otherwise you'd get authentication dialogs
+every time you opened a window. Its also easy to get multiple
+authentication dialogs for the same entity.
+
+The way threads are used in gnome-vfs is problematic, both from the
+point of view of writing backends, and for users of the library. For
+users it forces the use of threading, even if the application doesn't
+use the asynchronous calls that use the threading. It also enforces
+the need for a gnome_vfs_init() function, as thread initialization
+must be done very early.
+
+For backend implementations the use of threads forces every
+backend to be threadsafe. Many of the backends are inherently single
+threaded, either because they use non-threadsafe libraries like the
+smb backend, or because the server being wrapped forces serialized
+access (like an ftp backend where you really only want one connection
+to the server).
+
+Backends run in context of the application using gnome-vfs, which can
+be a gtk+ app, but as well a console application, so they have no
+control or guarantee of their environment. For instance, they cannot
+rely on the existance of a mainloop, so there is no way to use
+e.g. timeouts to handle invalidation of caches. One way we have tried
+to solve this is to move some backends to the gnome-vfs daemon, where
+they can rely on the existance of the mainloop.
+
+Gnome-vfs use something called "gnome vfs uris" to identify
+files. These are similar, but not entierly identical to the types of
+uri used in webbrowsers. For instance, we often make us our own types
+of URIs when there is no official standard for them (although such
+standards might appear later, with incompatible behaviour). We also
+have a "well defined" posix-like type of behaviour that isn't the same
+as for web uris. The most extreme example would be mailto:, but even
+things like ftp:// uris are different. The ftp uri rfc explains how
+ftp:///dir/file.txt refers to $(login_dir)/dir/file.txt, and that you
+have to use ftp:///%2fdir/file.txt to refer to the absolute path
+/dir/file.txt on the server. Clearly we can't have pathname handling
+semantics that vary depending on the backend (no app would get it
+right), so we ignore the rfcs on this.
+
+Then there is the thing with escaping and unescaping uris. Although
+technically not very complex it is just are very hard to get right all
+the time. Among the most common questions on the gnome-vfs list is
+what the various escape/unescape functions does, what arguments has to
+be escaped, and how to display uris "nicely" (i.e. without escapes,
+although that makes them invalid uris). This is made extra complicated
+due to the poor handling of filename encodings and display names, and
+the fact that only "less common" cases (like spaces in filenames)
+break if you get it wrong.
+
+Last but not least, the fact that gnome-vfs uses something called a
+"uri" gives people the wrong impression of what the library is
+designed for. It causes people to complain when it doesn't have some
+support for mailto: links, and it makes people want support for
+cookies, extra http headers and other things typically used by a web
+browser. This isn't really the kind of use that vfs is targeted at. A
+library specific to that sort use would probably fit these apps much
+better.
+
+Most gnome-vfs state is tied to the application that uses it, which I
+think is quite unexpected by the user. For instance, when you log into
+a network share in nautilus and then click on a file to open it, the
+opening application will have to re-connect and re-authenticate to
+the share, much to the users surprise. I really think most people
+expect a login like that is somehow session global. We do sometimes
+misuse gnome-keyring to "solve" the authentication issue, but even
+then we still have multiple connections to the network share, which
+can cause problem, for instance with ftp shares that use round-robin
+dns where the mirrors aren't fully sync:ed up. Again, some backends
+(smb) are now in the daemon which solves this issue.
+
+gnome_vfs_xfer() is possibly the worst-API call in the whole gnome
+platform. Its a single, buggy, do-it-all function with shitloads of
+combinations of flags and arguments to do all sort of things, with
+little or no semantic specifications or testcases. Its also to a large
+extent unnecessary for most applications and could easily be part of
+the file manager instead of a generic library. I'm also not sure that
+the "first do preflight calculation, then execute operation" model it
+uses is right. It is inherently racy, since the target or source could
+easily change during the preflight, and it makes error reporting and
+handling much more complicated.
+
+The behaviour of symlink resolution in the UI has been discussed many
+times. Should clicking on a symlink "foo" in $dir go to $dir/foo or to
+the target directory. The Nautilus maintainers has decided that the
+best way to approach this is to have symlinks be used for "filesystem
+implementation" (like a symlink for /home -> /mnt/hdb2) and thus not
+be resolved on activation. However, we should (this hasn't been
+finished yet) support a different form of links (called "shortcuts" in
+the UI) that always resolve on activation. At the moment there is no
+support for anything like that in gnome-vfs, so we abuse desktop files
+for this. We even generate virtual in-memory desktop files in the smb
+backend to get this behaviour. Proper support for shortcuts in the
+vfs API would let apps automatically work without ugly desktop file
+hacks.
+
+Over the years gnome-vfs has accumulated a lot of cruft. It links to a
+lot of libraries, including openssl, gconf+ORBit2, avahi, dbus, popt,
+libxml, kerberos, libz and libresolv. Very few applications need all
+of these, yet every application that uses gnome-vfs links to all of
+them. Furthermore, some of the functionallity in gnome-vfs, like the
+wrapper for dns-sd, resolving, network utilities, ls parsing
+functions, ssl support, pty handling are perhaps not best suited for a
+vfs library, nor do they always have great apis and quality
+implementations. We could definately clean this up and minimize the
+APIs.
+
+At some point in time gnome_vfs_uri_is_local() started detecting and
+returning TRUE for NFS mounts and other type of local network
+mounts. This is both slow and unexpected, and has led to problems and
+unnecessary changes in many places.
+
+The way the cancellation API for asynchronous operations is set up
+creates races and fragile code. The main issue is that if you call
+cancel before the operation callback has been called the callback will
+not be called. However, the callback typically wants to free some sort
+of user_data object passed to it, so that has to be handled also when
+you call cancellation. Couple this with the fact that there is no
+destroy notifies and you can't cancel after the operation callback has
+been called and you get an extremely tricky setup of combined
+callbacks. Furthermore, if threads are used there are some inherent
+races wrt detecting if the callback has been called when cancel is
+called, making it essentially impossible to get this right.
+
+There are also a bunch of issues with the current gnome-vfs that could
+technically be fixed like support for hidden file flags,
+backend-extensible metadata, no standard vfs dialogs like progress
+bars, etc.
+
+Last week I started thinking about a new design for a gnome-vfs
+replacement that would solve most of these issues, and at the same
+time gives a correct ordering of the platform stack. I've come up with
+a highlevel architecture that I think will work, even though I haven't
+yet finished it in detail or gotten the API totally worked out. Its
+somewhat of a radical departure from gnome-vfs as it is today, so
+brace with me as I try to explain the model and the ideas behind it.
+
+The gnome-vfs model is what I would call stateless. You can at any
+time throw a URI at it and it will do everything required to access
+the location. There is no need to, nor is there a way to set up
+anything like a "session" with a remote share. Of course, in practice
+this is not the way network shares work, so all sorts of session
+initiation, caching and other magic happens under the covers to make
+it look stateless. This is the source of all the problems with the
+gnome-vfs authentication model.
+
+I'd like to propose using a stateful model, where you have to
+explicitly initiate a session ("mount" a share) before you can start
+accessing files. This will give a well specified time when all forms
+of authentication will happen, when applications expect it and when
+they can use a more expressive and suitable API for this kind of
+operation. The actual i/o operations will then never cause any sort of
+authentication issues, and can thus be purely non-graphical
+(i.e. glib-only apps can do i/o). I imagine all/most actual mounting
+of shares will happen in the file manager and the file selector, or at
+gnome-session startup, so applications don't really need to handle
+this themselves.
+
+Not only is the model stateful. I'd like all state to be session
+global. That is, all mounts and network connections are shared between
+all applications in the session. So, if you pass a file reference from
+one app to another there is no need to log in again or anything like
+that. I think this is what users expect.
+
+Having a global stateful model means all non-local vfs accesses go
+through the vfs daemon. This works pretty well with the smb backend in
+the current gnome-vfs, and smb is the backend most likely to have high
+bandwidth traffic, so this doesn't seem to be a large performance
+problem. Although we do have to take the performance aspect into
+consideration when designing the daemon.
+
+In order to avoid all the problems with threading described above the
+vfs daemon will not use threads. In fact, I think the best approach is
+to let each active mountpoint be its own process. That way we get
+robustness (one mount can't crash the others) and simplify the backend
+creation greatly (each backend fully controls its context). It also
+will let us do concurrent access to e.g. two smb shares (like a copy
+from one to the other). We can't really do this atm since the thread
+lock in the smb backend serializes such access. But with two smb
+processes this is not a problem.
+
+There might be an issue with using separate processes for the
+mountpoints bloating up the desktop, but I don't think that it will be
+much of a problem. None of these processes will use the gui libraries
+that are the real sources of unshared dirty memory use. I tried a
+simple process that just used gobject and ran a mainloop. It only used
+78k of dirty memory. Also, each server need only link to and
+initialize the few libraries it needs, further keeping memory use down
+and avoiding bloat in all applications (e.g. apps need not link to
+openssl).
+
+As a consequence of the stateful model we don't need the stateless
+properties that URIs has as identifier. To avoid all the problems
+comming from the use of URIs we use a much simpler form of
+identifier. Namely filenames, in a hierarchical tree with
+mountpoints. These filenames are from an extended set of strings that
+includes the set of normal filenames, but also includes some platform
+dependent extensions. On win32 the full set might be some form of
+stringified version of the ITEMIDLIST from the windows shell api, and
+on unix we would use some out of band prefix to mark a non-local
+filename.
+
+For example, we could be to use "//.network/" as a prefix for the vfs
+filename namespace. A smb share might then be accessed as
+"//.network/smb/computer:share/dir/file.txt", or a ftp share as
+"//.network/ftp/alex@ftp.gnome.org/dir/file.txt". With a setup like
+"//.network/$method/$mount_object/" it would be quite easy to find the
+process handling the mount. Just ask for a dbus named object like
+"org.glib.gvsf.smb.computer:share". It is also very easy to detect
+local filenames and short-circuit to in-process i/o.
+
+These filenames would be the real identifier for the files, and as
+such not really presentable to the user as it. You'd need to ask for
+the display name via the vfs to get a user readable utf8-encoded
+string for display.
+
+The set of operations on files and folders would be both simplified
+and extended. We'd remove complicated things like read+write access to
+a file, and give less posix-like guarantees. We also make seek and
+truncate support optional in the backend. But then we will extend the
+set of operations possible to allow things like copy on the remote
+side (to avoid a download+upload operation on copy) and to have
+a set of highlevel operations that applications want, like "save" that
+implements the best way to save for each particular backend.
+
+We support metadata like display name, mimetypes, icon, and some
+general information like length and mtime. But we make support for
+getting the full "struct stat" buffer backend optional, as that isn't
+a good abstraction for most backends. Also, the API will be designed
+on the idea that network latency is expensive, so that there will only
+be one call to stat() or readdir() needed to read all the metadata
+requested by the application. (Whereas posix will have readdir return
+only the names and force you to stat each file in a separate
+roundtrip.)
+
+We likely don't want the full gnome/unix vfs implementation in
+glib, instead glib will only ship an implementation of the vfs API for
+local file access, and one that communicates to the vfs
+daemon(s). Then we ship the daemon and the implementations of the
+various backends externally.
+
+We will also write a single gnome-vfs backend that allows access to
+all the glib vfs shares by using a uri like gvfs:///XXX that just maps
+to //.network/XXX. We can also implement a similar backend for kio so
+that kde applications can read and write to the shares.
+
+Furthermore, if FUSE is supported on the system we can write a FUSE
+filesystem so that we can access the files as $HOME/.network/XXX. This
+can be made extra nice if the application (like e.g. acrobat) uses
+the gtk+ file selector but not the vfs by having the file selector
+detect a filename like this and reverse-mapping it into a vfs pathname
+and use the vfs for folder access.
+
+I've been doing some initial sketching of the glib API, and I've
+started by introducing base GInputStream and GOutputStream similar to
+the stream objects in Java and .Net. These have async i/o support and
+will make the API for reading and writing files nicer and more
+modern. There is also a GSeekable interface that streams can
+optionally implement if they support seeking.
+
+I've also introduced a GFile object that wraps a file path. This means
+you don't have to do tedious string operations on the pathnames to
+navigate the filesystem. It also means we can use the openat() class
+of file operations when traversing the filesystem tree, avoiding some
+forms of races when we do things like recursive copies.
+
+To support the stateful model and still have some form of caching we
+will also need to add some cache specific api so that you can trigger
+a reload of information from a directory. Otherwise a reload operation
+in the file manager wouldn't always get the latest state on something
+like a ftp share where we cache things aggressively.
+
+I have some initial code here for some of the basic APIs, but its far
+from finished and I'd like to spend some more time working on it
+before I present it. However, I think the general architecture is
+pretty sound and in a state where it can be discussed.
+
+Hopefully this description of my plans is enought to make people
+understand some of my ideas and allow us to start and discussion about
+the future of gnome-vfs. Also, consider it a heads up that I and other
+people will likely be working on this this in the future.
diff --git a/trunk/txt/vfs-names.txt b/trunk/txt/vfs-names.txt
new file mode 100644
index 00000000..25833d7c
--- /dev/null
+++ b/trunk/txt/vfs-names.txt
@@ -0,0 +1,142 @@
+Local filenames (in utf8 mode)
+1) standard: /etc/passwd
+2) utf8 and spaces: "/tmp/a åäö.txt" (encoding==utf8)
+3) latin-1 and spaces: "/tmp/a åäö.txt" (encoding==iso8859-1)
+4) filename without encoding: "/tmp/bad:\001\010\011\012\013" (as a C string)
+5) mountpoint: /mnt/cdrom (cd has title "CD Title")
+
+Ftp mount to ftp.gnome.org
+(where filenames are stored as utf8, this is detected by using
+ ftp protocol extensions (there is an rfc) or by having the user
+ specify the encoding at mount time)
+
+6) normal dir: /pub/sources
+7) valid utf8 name: /dir/a file öää.txt
+8) latin-1 name: /dir/a file öää.txt
+
+Ftp mount to ftp.gnome.org (with filenames in latin-1)
+9) latin-1 name: /dir/a file öää.txt
+
+backend that stores display name separate from real name. Examples
+could be a flickr backend, a file backend that handles desktop files,
+or a virtual location like computer:// (which is implemented using
+virtual desktop files atm).
+
+10) /tmp/foo.desktop (with Name[en]="Display Name")
+
+special cases:
+ftp names relative to login dir
+
+Places where display filenames (i.e utf-8 strings) are used:
+
+A) Absolute filename, for editing (nautilus text entry, file selector entry)
+B) Semi-Absolute filename, for display (nautilus window title)
+C) Relative file name, for display (in nautilus/file selector icon/list view)
+D) Relative file name, for editing (rename in nautilus)
+E) Relative file name, for creating absolute name (filename completion for a)
+ This needs to know the exact form of the parent (i.e. it differs for filename vs uri).
+ I won't list this below as its always the same as A from the last slash to the end.
+
+This is how these work with gnome-vfs uris:
+
+ A B C D
+1) file:///etc/passwd passwd passwd passwd
+2) file:///tmp/a%20%C3%B6%C3%A4%C3%A4.txt a åäö.txt a åäö.txt a åäö.txt
+3) file:///tmp/a%20%E5%E4%F6.txt a ???.txt a ???.txt (invalid unicode) a ???.txt
+4) file:///tmp/bad%3A%01%08%09%0A%0B bad:????? bad:????? (invalid unicode) bad:?????
+5) file:///mnt/cdrom CD Title (cdrom) CD Title (cdrom) CD Title
+6) ftp://ftp.gnome.org/pub/sources sources on ftp.gnome.org sources sources
+7) ftp://ftp.gnome.org/dir/a%20%C3%B6%C3%A4%C3%A4.txt a åäö.txt on ftp.gnome.org a åäö.txt a åäö.txt
+8) ftp://ftp.gnome.org/dir/a%20%E5%E4%F6.txt a ???.txt on ftp.gnome.org a ???.txt (invalid unicode) a ???.txt
+9) ftp://ftp.gnome.org/dir/a%20%E5%E4%F6.txt a åäö.txt on ftp.gnome.org a åäö.txt a åäö.txt
+10)file:///tmp/foo.desktop Display Name Display Name Display Name
+
+The stuff in column A is pretty insane. It works fine as an identifier
+for the computer to use, but nobody would want to have to type that in
+or look at that all the time. That is why Nautilus also allows
+entering some filenames as absolute unix pathnames, although not all
+filenames can be specified this way. If used when possible the column
+looks like this:
+
+ A
+1) /etc/passwd
+2) /tmp/a åäö.txt
+3) file:///tmp/a%20%E5%E4%F6.txt
+4) file:///tmp/bad%3A%01%08%09%0A%0B
+5) /mnt/cdrom
+6) ftp://ftp.gnome.org/pub/sources
+7) ftp://ftp.gnome.org/dir/a%20%C3%B6%C3%A4%C3%A4.txt
+8) ftp://ftp.gnome.org/dir/a%20%E5%E4%F6.txt
+9) ftp://ftp.gnome.org/dir/a%20%E5%E4%F6.txt
+10)/tmp/foo.desktop
+
+As we see this helps for most normal local paths, but it becomes
+problematic when the filenames are in the wrong encoding. For
+non-local files it doesn't help at all. We still have to look at these
+horrible escapes, even when we know the encoding of the filename.
+
+The examples 7-9 in this version shows the problem with URIs. Suppose
+we allowed an invalid URI like "ftp://ftp.gnome.org/dir/a åäö.txt"
+(utf8-encoded string). Given the state inherent in the mountpoint we
+know what encoding is used for the ftp server, so if someone types it
+in we know which file they mean. However, suppose someone pastes a URI
+like that into firefox, or mails it to someone, now we can't
+reconstruct the real valid URI anymore. If you drag and drop it
+however, the code can send the real valid uri so that firefox can load
+it correctly.
+
+So, this introduces two kinds of of URIs that are "mostly similar" but
+breaks in many nonobvious cases. This is very unfortunate, imho not
+acceptable. I think its ok to accept a URI typed in like
+"ftp://ftp.gnome.org/dir/a åäö.txt" and convert it to the right uri,
+but its not right to display such a uri in the nautilus location bar,
+as that can result in that invalid uri getting into other places.
+
+Since I dislike showing invalid URIs in the UI I think it makes sense
+to create a new absolute pathname display and entry format. Ideally
+such a system should allow any ascii or utf8 local filename to be
+represented as itself. Furthermore it would allow input of URIs, but
+immediately convert them to the display format (similar to how
+inputing a file:// uri in nautilus displays as a normal filename).
+
+One solution would be to use some other prefix than / for
+non-local files, and to use some form of escaping only for non-utf8
+chars and non-printables. Here is an example:
+
+ A
+1) /etc/passwd
+2) /tmp/a åäö.txt
+3) /tmp/a \xE5\xE4\xF6.txt
+4) /tmp/bad:\x01\x08\x09\x0A\x0B
+5) /mnt/cdrom
+6) :ftp:ftp.gnome.org/pub/sources
+7) :ftp:ftp.gnome.org/dir/a åäö.txt
+8) :ftp:ftp.gnome.org/dir/a \xE5\xE4\xF6.txt
+9) :ftp:ftp.gnome.org/dir/a åäö.txt
+10)/tmp/foo.desktop
+
+Under the hood this would use proper, valid escaped URIs. However, we
+would display things in the UI that made some sense to users, only
+falling back to escaping in the last possible case.
+
+The API could look something like:
+
+GFile *g_file_new_from_filename (char *filename);
+GFile *g_file_new_from_uri (char *uri);
+GFile *g_file_parse_display_name (char *display_name);
+
+Another approach (mentioned by Jürg Billeter on irc yesterday) is to
+move from a pure textual representation of the full uri to a more
+structured UI. For example the ftp://ftp.gnome.org/ part of the URI
+could be converted to a single item in the entry looking like
+[#ftp.gnome.org] (where # is an ftp icon). Then the rest of the entry
+would edit just the path on the ftp server, as a local filename. The
+disadvantage here is that its a bit harder to know how to type in a
+full pathname including what method to use and what server (you'd type
+in a URI). This isn't necessarily a huge problem if you rarely type in
+remote URIs (instead you can follow links, browse the network, add
+favourites, etc).
+
+I don't know how hard this is to do from a Gtk+ perspective
+though. Its somewhat similar to what the evolution address entry does.
+