Tagged for release 1.2.0GVFS_1_2_0

svn path=/tags/GVFS_1_2_0/; revision=2331
author: Alexander Larsson <alexl@src.gnome.org> 2009-03-16 11:43:23 +0000
committer: Alexander Larsson <alexl@src.gnome.org> 2009-03-16 11:43:23 +0000
commit: 4ad537c5c3e17e1efe289020d7dc6cd0efae42c5 (patch)
tree: 891f2ec720f5ae321762965a00d352ad0a1592a2 /trunk/txt
parent: 4c59b80ab2b0e942bd45ff12f238038293d21821 (diff)
download: gvfs-82d3197d52d9a1f8a1a1b928e2550444138d088b.tar.gz
6 files changed, 6770 insertions, 0 deletions
diff --git a/trunk/txt/gvfs_dbus.txt b/trunk/txt/gvfs_dbus.txt
new file mode 100644
index 00000000..86ac19ac
--- /dev/null
+++ b/trunk/txt/gvfs_dbus.txt
@@ -0,0 +1,65 @@
+how to chain to simple stuff
+
+how to parse uris (i.e. map to mounts)
+
+what connections do we have:
+shared dbus connection
+connection to main daemom
+connection to each mount daemon
+
+"fast ops" (uri->gfile) vs blocking ops (read, open etc) and how to avoid slow blocking fast
+
+
+each thread has, on demand:
+connection to main daemon
+connection to some mount daemons
+
+global state:
+cache of previously used mountpoints
+
+
+how to mount
+
+how to store/restore permanent mounts with the session => store as drives (mountpoints), not volumes!
+
+Don't always want to log in to all mounts on login? (mounpoints!)
+
+computer:// handled in main daemon?
+
+No volume monitor in public API, only computer:// ?
+Problems:
+* mounted (desktop/computer:, trash dir)
+* unmounted/pre_unmount (desktop/computer:, close windows on unmounted volumes, trash dir)
+* map path to volume (close windows on unmounted volumes, check for readonly mount, get volume name)
+* get all drives/volumes (detecting where to show eject, mount, unmount menu items,
+			  tree view, places sidebar, display volume icon in pathbar)
+* eject/unmount ops
+* needs eject
+
+unmounted URI => return a mountpoint object?
+
+GMountOperation, async mount operation object
+signals => passwd, question, keyring?
+
+GFile mountpoint => GMountOperation
+
+What process calls gnome-keyring? 
+
+
+
+--------------------
+
+GFile creation => decompose URI, no i/o
+
+on i/o:
+ * figure out mountpoint (for now, always toplevel uri location)
+ * if we have a local dbus connection to that, use it, otherwise:
+   + create (if needed) local session dbus connection
+   + ask for mount daemon for new session
+     - If not existing, error on i/o, return mountpoint type on get_info
+   + set up new local connection with the mount daemon
+ * send dbus message
+ * recieve answer, if has magic flag, followed by fd sendmsg() (created by socketpair())
+
+ 
+ 
diff --git a/trunk/txt/ops.txt b/trunk/txt/ops.txt
new file mode 100644
index 00000000..c1f04c34
--- /dev/null
+++ b/trunk/txt/ops.txt
@@ -0,0 +1,140 @@
+type: File, Folder, Symlink, Shortcut, Mountable, special (fifo, socket, chardec, blockdev)
+flags: hidden, 	
+GFileInfo {
+   type get_type()
+   char *get_name()
+   char *get_display_name()
+   char *get_icon() /* string? what about win32, remote icons etc */
+   
+   gint64 get_file_size()
+   char *get_mime_type()
+   char *get_link_target()
+   can_read()/write()/delete()/rename()/maybe: move()/copy()
+   flags get_flags()
+   time_t get_modification_time()
+   gboolean get_unix_stat ()
+   char *get_attribute()
+   char **get_attributes(char *namespace)
+   char **get_all_attributes() /* form namespace:attrname -> string */
+}	
+
+GFSInfo {
+   char *get_fs_type()
+   gint64 get_free_space()
+   gint64 get_total_space()
+   char * get_hal_uid()
+   can_unmount()
+   can_eject()
+   must_eject()
+   
+}
+
+GFile *g_file_for_path (char *path)
+GFile *g_file_for_uri (char *uri)
+GFile *g_file_parse_display_name (char *display_name)
+
+GFile {
+ char *get_path()
+ is_native() => is_file:///
+ 
+ char *get_uri ();
+ char *get_absolute_display_name ()
+ 
+ set_keep_open(boolean keep_open)
+   
+ GFile *get_parent ()
+ GFile *get_child (char *name)
+ GFileEnumerator *enumerate_children(flags, attributes... "*", "vfs:*;dav:*;foo:bar")
+ GFileInfo *get_info (flags, attributes...)
+ 
+ void reload()
+ GInputStream *read()
+
+ GOutputStream *append_to() /* optional (not on webdav) */
+ GOutputStream *create()
+ GSaveStream *replace(mtime, backup_name, )
+/* permissions are all set minus umask, except replace which
+   saves old permissions */
+
+/* ?? */
+ GFile *resolve_symlink(char *symlink_target);
+   
+/* output ops */
+ write/save
+ rename
+ move
+ copy
+ delete
+ mkdir
+ rmkdir
+ display name -> filename (for new files)
+ set attrs
+
+ /* other ops: */
+ monitor(flags) + signals
+ mount/unmount
+ list volumes
+
+ Maybe:
+ GFile *new_from_uri(path, flags) (file:/// uris)
+
+}
+
+
+names:
+  URIs == raw filename (no encoding), all escaped
+  We generate display absolute paths as filenames if possible, otherwise
+  as IRIs. This means we can display nice URIs for native utf8 backends
+  and filenames. However, URIs for non-utf8 shares will look bad. If we know
+  the encoding we can still get nice non-absolut display names though.
+  
+  In client we store names as mountpoint + non-escaped no-encoding string.
+  Non-uri display name handling done in daemon
+  
+GStatable iface for fstat() support
+GSaveStream, with get_final_file_info()
+
+open for writing:
+
+append vs truncate
+fail on existing or replace
+
+mtime match
+mtime return
+backup (suffix+prefix)
+create filename from display name
+unique name
+keep inode or be atomic?
+
+filename_for_display_name()
+write_append() /* optional (not on webdav) */
+write_new()
+write_replace()
+
+
+
+
+ftp supports:
+ overwrite
+ append
+ generate unique name
+ 
+http+webdav supports:
+  overwrite
+  append in recent versions
+  get mtime, length, mimetype, atime on read open
+  
+
+
+
+
+async thread work:
+
+function to run in thread
+data to pass to thread
+cancel identifier
+pass in cancel func + data
+way for function to communicate with mainloop (of specific context)
+does mainloop notifiers block on ack?
+
+  
diff --git a/trunk/txt/rfc3986.txt b/trunk/txt/rfc3986.txt
new file mode 100644
index 00000000..c56ed4eb
--- /dev/null
+++ b/trunk/txt/rfc3986.txt
@@ -0,0 +1,3419 @@
+
+
+
+
+
+
+Network Working Group                                     T. Berners-Lee
+Request for Comments: 3986                                       W3C/MIT
+STD: 66                                                      R. Fielding
+Updates: 1738                                               Day Software
+Obsoletes: 2732, 2396, 1808                                  L. Masinter
+Category: Standards Track                                  Adobe Systems
+                                                            January 2005
+
+
+           Uniform Resource Identifier (URI): Generic Syntax
+
+Status of This Memo
+
+   This document specifies an Internet standards track protocol for the
+   Internet community, and requests discussion and suggestions for
+   improvements.  Please refer to the current edition of the "Internet
+   Official Protocol Standards" (STD 1) for the standardization state
+   and status of this protocol.  Distribution of this memo is unlimited.
+
+Copyright Notice
+
+   Copyright (C) The Internet Society (2005).
+
+Abstract
+
+   A Uniform Resource Identifier (URI) is a compact sequence of
+   characters that identifies an abstract or physical resource.  This
+   specification defines the generic URI syntax and a process for
+   resolving URI references that might be in relative form, along with
+   guidelines and security considerations for the use of URIs on the
+   Internet.  The URI syntax defines a grammar that is a superset of all
+   valid URIs, allowing an implementation to parse the common components
+   of a URI reference without knowing the scheme-specific requirements
+   of every possible identifier.  This specification does not define a
+   generative grammar for URIs; that task is performed by the individual
+   specifications of each URI scheme.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                     [Page 1]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+Table of Contents
+
+   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
+       1.1.  Overview of URIs . . . . . . . . . . . . . . . . . . . .  4
+             1.1.1.  Generic Syntax . . . . . . . . . . . . . . . . .  6
+             1.1.2.  Examples . . . . . . . . . . . . . . . . . . . .  7
+             1.1.3.  URI, URL, and URN  . . . . . . . . . . . . . . .  7
+       1.2.  Design Considerations  . . . . . . . . . . . . . . . . .  8
+             1.2.1.  Transcription  . . . . . . . . . . . . . . . . .  8
+             1.2.2.  Separating Identification from Interaction . . .  9
+             1.2.3.  Hierarchical Identifiers . . . . . . . . . . . . 10
+       1.3.  Syntax Notation  . . . . . . . . . . . . . . . . . . . . 11
+   2.  Characters . . . . . . . . . . . . . . . . . . . . . . . . . . 11
+       2.1.  Percent-Encoding . . . . . . . . . . . . . . . . . . . . 12
+       2.2.  Reserved Characters  . . . . . . . . . . . . . . . . . . 12
+       2.3.  Unreserved Characters  . . . . . . . . . . . . . . . . . 13
+       2.4.  When to Encode or Decode . . . . . . . . . . . . . . . . 14
+       2.5.  Identifying Data . . . . . . . . . . . . . . . . . . . . 14
+   3.  Syntax Components  . . . . . . . . . . . . . . . . . . . . . . 16
+       3.1.  Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 17
+       3.2.  Authority  . . . . . . . . . . . . . . . . . . . . . . . 17
+             3.2.1.  User Information . . . . . . . . . . . . . . . . 18
+             3.2.2.  Host . . . . . . . . . . . . . . . . . . . . . . 18
+             3.2.3.  Port . . . . . . . . . . . . . . . . . . . . . . 22
+       3.3.  Path . . . . . . . . . . . . . . . . . . . . . . . . . . 22
+       3.4.  Query  . . . . . . . . . . . . . . . . . . . . . . . . . 23
+       3.5.  Fragment . . . . . . . . . . . . . . . . . . . . . . . . 24
+   4.  Usage  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
+       4.1.  URI Reference  . . . . . . . . . . . . . . . . . . . . . 25
+       4.2.  Relative Reference . . . . . . . . . . . . . . . . . . . 26
+       4.3.  Absolute URI . . . . . . . . . . . . . . . . . . . . . . 27
+       4.4.  Same-Document Reference  . . . . . . . . . . . . . . . . 27
+       4.5.  Suffix Reference . . . . . . . . . . . . . . . . . . . . 27
+   5.  Reference Resolution . . . . . . . . . . . . . . . . . . . . . 28
+       5.1.  Establishing a Base URI  . . . . . . . . . . . . . . . . 28
+             5.1.1.  Base URI Embedded in Content . . . . . . . . . . 29
+             5.1.2.  Base URI from the Encapsulating Entity . . . . . 29
+             5.1.3.  Base URI from the Retrieval URI  . . . . . . . . 30
+             5.1.4.  Default Base URI . . . . . . . . . . . . . . . . 30
+       5.2.  Relative Resolution  . . . . . . . . . . . . . . . . . . 30
+             5.2.1.  Pre-parse the Base URI . . . . . . . . . . . . . 31
+             5.2.2.  Transform References . . . . . . . . . . . . . . 31
+             5.2.3.  Merge Paths  . . . . . . . . . . . . . . . . . . 32
+             5.2.4.  Remove Dot Segments  . . . . . . . . . . . . . . 33
+       5.3.  Component Recomposition  . . . . . . . . . . . . . . . . 35
+       5.4.  Reference Resolution Examples  . . . . . . . . . . . . . 35
+             5.4.1.  Normal Examples  . . . . . . . . . . . . . . . . 36
+             5.4.2.  Abnormal Examples  . . . . . . . . . . . . . . . 36
+
+
+
+Berners-Lee, et al.         Standards Track                     [Page 2]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   6.  Normalization and Comparison . . . . . . . . . . . . . . . . . 38
+       6.1.  Equivalence  . . . . . . . . . . . . . . . . . . . . . . 38
+       6.2.  Comparison Ladder  . . . . . . . . . . . . . . . . . . . 39
+             6.2.1.  Simple String Comparison . . . . . . . . . . . . 39
+             6.2.2.  Syntax-Based Normalization . . . . . . . . . . . 40
+             6.2.3.  Scheme-Based Normalization . . . . . . . . . . . 41
+             6.2.4.  Protocol-Based Normalization . . . . . . . . . . 42
+   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 43
+       7.1.  Reliability and Consistency  . . . . . . . . . . . . . . 43
+       7.2.  Malicious Construction . . . . . . . . . . . . . . . . . 43
+       7.3.  Back-End Transcoding . . . . . . . . . . . . . . . . . . 44
+       7.4.  Rare IP Address Formats  . . . . . . . . . . . . . . . . 45
+       7.5.  Sensitive Information  . . . . . . . . . . . . . . . . . 45
+       7.6.  Semantic Attacks . . . . . . . . . . . . . . . . . . . . 45
+   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 46
+   9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 46
+   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 46
+       10.1. Normative References . . . . . . . . . . . . . . . . . . 46
+       10.2. Informative References . . . . . . . . . . . . . . . . . 47
+   A.  Collected ABNF for URI . . . . . . . . . . . . . . . . . . . . 49
+   B.  Parsing a URI Reference with a Regular Expression  . . . . . . 50
+   C.  Delimiting a URI in Context  . . . . . . . . . . . . . . . . . 51
+   D.  Changes from RFC 2396  . . . . . . . . . . . . . . . . . . . . 53
+       D.1.  Additions  . . . . . . . . . . . . . . . . . . . . . . . 53
+       D.2.  Modifications  . . . . . . . . . . . . . . . . . . . . . 53
+   Index  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
+   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 60
+   Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 61
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                     [Page 3]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+1.  Introduction
+
+   A Uniform Resource Identifier (URI) provides a simple and extensible
+   means for identifying a resource.  This specification of URI syntax
+   and semantics is derived from concepts introduced by the World Wide
+   Web global information initiative, whose use of these identifiers
+   dates from 1990 and is described in "Universal Resource Identifiers
+   in WWW" [RFC1630].  The syntax is designed to meet the
+   recommendations laid out in "Functional Recommendations for Internet
+   Resource Locators" [RFC1736] and "Functional Requirements for Uniform
+   Resource Names" [RFC1737].
+
+   This document obsoletes [RFC2396], which merged "Uniform Resource
+   Locators" [RFC1738] and "Relative Uniform Resource Locators"
+   [RFC1808] in order to define a single, generic syntax for all URIs.
+   It obsoletes [RFC2732], which introduced syntax for an IPv6 address.
+   It excludes portions of RFC 1738 that defined the specific syntax of
+   individual URI schemes; those portions will be updated as separate
+   documents.  The process for registration of new URI schemes is
+   defined separately by [BCP35].  Advice for designers of new URI
+   schemes can be found in [RFC2718].  All significant changes from RFC
+   2396 are noted in Appendix D.
+
+   This specification uses the terms "character" and "coded character
+   set" in accordance with the definitions provided in [BCP19], and
+   "character encoding" in place of what [BCP19] refers to as a
+   "charset".
+
+1.1.  Overview of URIs
+
+   URIs are characterized as follows:
+
+   Uniform
+
+      Uniformity provides several benefits.  It allows different types
+      of resource identifiers to be used in the same context, even when
+      the mechanisms used to access those resources may differ.  It
+      allows uniform semantic interpretation of common syntactic
+      conventions across different types of resource identifiers.  It
+      allows introduction of new types of resource identifiers without
+      interfering with the way that existing identifiers are used.  It
+      allows the identifiers to be reused in many different contexts,
+      thus permitting new applications or protocols to leverage a pre-
+      existing, large, and widely used set of resource identifiers.
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                     [Page 4]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   Resource
+
+      This specification does not limit the scope of what might be a
+      resource; rather, the term "resource" is used in a general sense
+      for whatever might be identified by a URI.  Familiar examples
+      include an electronic document, an image, a source of information
+      with a consistent purpose (e.g., "today's weather report for Los
+      Angeles"), a service (e.g., an HTTP-to-SMS gateway), and a
+      collection of other resources.  A resource is not necessarily
+      accessible via the Internet; e.g., human beings, corporations, and
+      bound books in a library can also be resources.  Likewise,
+      abstract concepts can be resources, such as the operators and
+      operands of a mathematical equation, the types of a relationship
+      (e.g., "parent" or "employee"), or numeric values (e.g., zero,
+      one, and infinity).
+
+   Identifier
+
+      An identifier embodies the information required to distinguish
+      what is being identified from all other things within its scope of
+      identification.  Our use of the terms "identify" and "identifying"
+      refer to this purpose of distinguishing one resource from all
+      other resources, regardless of how that purpose is accomplished
+      (e.g., by name, address, or context).  These terms should not be
+      mistaken as an assumption that an identifier defines or embodies
+      the identity of what is referenced, though that may be the case
+      for some identifiers.  Nor should it be assumed that a system
+      using URIs will access the resource identified: in many cases,
+      URIs are used to denote resources without any intention that they
+      be accessed.  Likewise, the "one" resource identified might not be
+      singular in nature (e.g., a resource might be a named set or a
+      mapping that varies over time).
+
+   A URI is an identifier consisting of a sequence of characters
+   matching the syntax rule named <URI> in Section 3.  It enables
+   uniform identification of resources via a separately defined
+   extensible set of naming schemes (Section 3.1).  How that
+   identification is accomplished, assigned, or enabled is delegated to
+   each scheme specification.
+
+   This specification does not place any limits on the nature of a
+   resource, the reasons why an application might seek to refer to a
+   resource, or the kinds of systems that might use URIs for the sake of
+   identifying resources.  This specification does not require that a
+   URI persists in identifying the same resource over time, though that
+   is a common goal of all URI schemes.  Nevertheless, nothing in this
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                     [Page 5]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   specification prevents an application from limiting itself to
+   particular types of resources, or to a subset of URIs that maintains
+   characteristics desired by that application.
+
+   URIs have a global scope and are interpreted consistently regardless
+   of context, though the result of that interpretation may be in
+   relation to the end-user's context.  For example, "http://localhost/"
+   has the same interpretation for every user of that reference, even
+   though the network interface corresponding to "localhost" may be
+   different for each end-user: interpretation is independent of access.
+   However, an action made on the basis of that reference will take
+   place in relation to the end-user's context, which implies that an
+   action intended to refer to a globally unique thing must use a URI
+   that distinguishes that resource from all other things.  URIs that
+   identify in relation to the end-user's local context should only be
+   used when the context itself is a defining aspect of the resource,
+   such as when an on-line help manual refers to a file on the end-
+   user's file system (e.g., "file:///etc/hosts").
+
+1.1.1.  Generic Syntax
+
+   Each URI begins with a scheme name, as defined in Section 3.1, that
+   refers to a specification for assigning identifiers within that
+   scheme.  As such, the URI syntax is a federated and extensible naming
+   system wherein each scheme's specification may further restrict the
+   syntax and semantics of identifiers using that scheme.
+
+   This specification defines those elements of the URI syntax that are
+   required of all URI schemes or are common to many URI schemes.  It
+   thus defines the syntax and semantics needed to implement a scheme-
+   independent parsing mechanism for URI references, by which the
+   scheme-dependent handling of a URI can be postponed until the
+   scheme-dependent semantics are needed.  Likewise, protocols and data
+   formats that make use of URI references can refer to this
+   specification as a definition for the range of syntax allowed for all
+   URIs, including those schemes that have yet to be defined.  This
+   decouples the evolution of identification schemes from the evolution
+   of protocols, data formats, and implementations that make use of
+   URIs.
+
+   A parser of the generic URI syntax can parse any URI reference into
+   its major components.  Once the scheme is determined, further
+   scheme-specific parsing can be performed on the components.  In other
+   words, the URI generic syntax is a superset of the syntax of all URI
+   schemes.
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                     [Page 6]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+1.1.2.  Examples
+
+   The following example URIs illustrate several URI schemes and
+   variations in their common syntax components:
+
+      ftp://ftp.is.co.za/rfc/rfc1808.txt
+
+      http://www.ietf.org/rfc/rfc2396.txt
+
+      ldap://[2001:db8::7]/c=GB?objectClass?one
+
+      mailto:John.Doe@example.com
+
+      news:comp.infosystems.www.servers.unix
+
+      tel:+1-816-555-1212
+
+      telnet://192.0.2.16:80/
+
+      urn:oasis:names:specification:docbook:dtd:xml:4.1.2
+
+
+1.1.3.  URI, URL, and URN
+
+   A URI can be further classified as a locator, a name, or both.  The
+   term "Uniform Resource Locator" (URL) refers to the subset of URIs
+   that, in addition to identifying a resource, provide a means of
+   locating the resource by describing its primary access mechanism
+   (e.g., its network "location").  The term "Uniform Resource Name"
+   (URN) has been used historically to refer to both URIs under the
+   "urn" scheme [RFC2141], which are required to remain globally unique
+   and persistent even when the resource ceases to exist or becomes
+   unavailable, and to any other URI with the properties of a name.
+
+   An individual scheme does not have to be classified as being just one
+   of "name" or "locator".  Instances of URIs from any given scheme may
+   have the characteristics of names or locators or both, often
+   depending on the persistence and care in the assignment of
+   identifiers by the naming authority, rather than on any quality of
+   the scheme.  Future specifications and related documentation should
+   use the general term "URI" rather than the more restrictive terms
+   "URL" and "URN" [RFC3305].
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                     [Page 7]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+1.2.  Design Considerations
+
+1.2.1.  Transcription
+
+   The URI syntax has been designed with global transcription as one of
+   its main considerations.  A URI is a sequence of characters from a
+   very limited set: the letters of the basic Latin alphabet, digits,
+   and a few special characters.  A URI may be represented in a variety
+   of ways; e.g., ink on paper, pixels on a screen, or a sequence of
+   character encoding octets.  The interpretation of a URI depends only
+   on the characters used and not on how those characters are
+   represented in a network protocol.
+
+   The goal of transcription can be described by a simple scenario.
+   Imagine two colleagues, Sam and Kim, sitting in a pub at an
+   international conference and exchanging research ideas.  Sam asks Kim
+   for a location to get more information, so Kim writes the URI for the
+   research site on a napkin.  Upon returning home, Sam takes out the
+   napkin and types the URI into a computer, which then retrieves the
+   information to which Kim referred.
+
+   There are several design considerations revealed by the scenario:
+
+   o  A URI is a sequence of characters that is not always represented
+      as a sequence of octets.
+
+   o  A URI might be transcribed from a non-network source and thus
+      should consist of characters that are most likely able to be
+      entered into a computer, within the constraints imposed by
+      keyboards (and related input devices) across languages and
+      locales.
+
+   o  A URI often has to be remembered by people, and it is easier for
+      people to remember a URI when it consists of meaningful or
+      familiar components.
+
+   These design considerations are not always in alignment.  For
+   example, it is often the case that the most meaningful name for a URI
+   component would require characters that cannot be typed into some
+   systems.  The ability to transcribe a resource identifier from one
+   medium to another has been considered more important than having a
+   URI consist of the most meaningful of components.
+
+   In local or regional contexts and with improving technology, users
+   might benefit from being able to use a wider range of characters;
+   such use is not defined by this specification.  Percent-encoded
+   octets (Section 2.1) may be used within a URI to represent characters
+   outside the range of the US-ASCII coded character set if this
+
+
+
+Berners-Lee, et al.         Standards Track                     [Page 8]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   representation is allowed by the scheme or by the protocol element in
+   which the URI is referenced.  Such a definition should specify the
+   character encoding used to map those characters to octets prior to
+   being percent-encoded for the URI.
+
+1.2.2.  Separating Identification from Interaction
+
+   A common misunderstanding of URIs is that they are only used to refer
+   to accessible resources.  The URI itself only provides
+   identification; access to the resource is neither guaranteed nor
+   implied by the presence of a URI.  Instead, any operation associated
+   with a URI reference is defined by the protocol element, data format
+   attribute, or natural language text in which it appears.
+
+   Given a URI, a system may attempt to perform a variety of operations
+   on the resource, as might be characterized by words such as "access",
+   "update", "replace", or "find attributes".  Such operations are
+   defined by the protocols that make use of URIs, not by this
+   specification.  However, we do use a few general terms for describing
+   common operations on URIs.  URI "resolution" is the process of
+   determining an access mechanism and the appropriate parameters
+   necessary to dereference a URI; this resolution may require several
+   iterations.  To use that access mechanism to perform an action on the
+   URI's resource is to "dereference" the URI.
+
+   When URIs are used within information retrieval systems to identify
+   sources of information, the most common form of URI dereference is
+   "retrieval": making use of a URI in order to retrieve a
+   representation of its associated resource.  A "representation" is a
+   sequence of octets, along with representation metadata describing
+   those octets, that constitutes a record of the state of the resource
+   at the time when the representation is generated.  Retrieval is
+   achieved by a process that might include using the URI as a cache key
+   to check for a locally cached representation, resolution of the URI
+   to determine an appropriate access mechanism (if any), and
+   dereference of the URI for the sake of applying a retrieval
+   operation.  Depending on the protocols used to perform the retrieval,
+   additional information might be supplied about the resource (resource
+   metadata) and its relation to other resources.
+
+   URI references in information retrieval systems are designed to be
+   late-binding: the result of an access is generally determined when it
+   is accessed and may vary over time or due to other aspects of the
+   interaction.  These references are created in order to be used in the
+   future: what is being identified is not some specific result that was
+   obtained in the past, but rather some characteristic that is expected
+   to be true for future results.  In such cases, the resource referred
+   to by the URI is actually a sameness of characteristics as observed
+
+
+
+Berners-Lee, et al.         Standards Track                     [Page 9]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   over time, perhaps elucidated by additional comments or assertions
+   made by the resource provider.
+
+   Although many URI schemes are named after protocols, this does not
+   imply that use of these URIs will result in access to the resource
+   via the named protocol.  URIs are often used simply for the sake of
+   identification.  Even when a URI is used to retrieve a representation
+   of a resource, that access might be through gateways, proxies,
+   caches, and name resolution services that are independent of the
+   protocol associated with the scheme name.  The resolution of some
+   URIs may require the use of more than one protocol (e.g., both DNS
+   and HTTP are typically used to access an "http" URI's origin server
+   when a representation isn't found in a local cache).
+
+1.2.3.  Hierarchical Identifiers
+
+   The URI syntax is organized hierarchically, with components listed in
+   order of decreasing significance from left to right.  For some URI
+   schemes, the visible hierarchy is limited to the scheme itself:
+   everything after the scheme component delimiter (":") is considered
+   opaque to URI processing.  Other URI schemes make the hierarchy
+   explicit and visible to generic parsing algorithms.
+
+   The generic syntax uses the slash ("/"), question mark ("?"), and
+   number sign ("#") characters to delimit components that are
+   significant to the generic parser's hierarchical interpretation of an
+   identifier.  In addition to aiding the readability of such
+   identifiers through the consistent use of familiar syntax, this
+   uniform representation of hierarchy across naming schemes allows
+   scheme-independent references to be made relative to that hierarchy.
+
+   It is often the case that a group or "tree" of documents has been
+   constructed to serve a common purpose, wherein the vast majority of
+   URI references in these documents point to resources within the tree
+   rather than outside it.  Similarly, documents located at a particular
+   site are much more likely to refer to other resources at that site
+   than to resources at remote sites.  Relative referencing of URIs
+   allows document trees to be partially independent of their location
+   and access scheme.  For instance, it is possible for a single set of
+   hypertext documents to be simultaneously accessible and traversable
+   via each of the "file", "http", and "ftp" schemes if the documents
+   refer to each other with relative references.  Furthermore, such
+   document trees can be moved, as a whole, without changing any of the
+   relative references.
+
+   A relative reference (Section 4.2) refers to a resource by describing
+   the difference within a hierarchical name space between the reference
+   context and the target URI.  The reference resolution algorithm,
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 10]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   presented in Section 5, defines how such a reference is transformed
+   to the target URI.  As relative references can only be used within
+   the context of a hierarchical URI, designers of new URI schemes
+   should use a syntax consistent with the generic syntax's hierarchical
+   components unless there are compelling reasons to forbid relative
+   referencing within that scheme.
+
+      NOTE: Previous specifications used the terms "partial URI" and
+      "relative URI" to denote a relative reference to a URI.  As some
+      readers misunderstood those terms to mean that relative URIs are a
+      subset of URIs rather than a method of referencing URIs, this
+      specification simply refers to them as relative references.
+
+   All URI references are parsed by generic syntax parsers when used.
+   However, because hierarchical processing has no effect on an absolute
+   URI used in a reference unless it contains one or more dot-segments
+   (complete path segments of "." or "..", as described in Section 3.3),
+   URI scheme specifications can define opaque identifiers by
+   disallowing use of slash characters, question mark characters, and
+   the URIs "scheme:." and "scheme:..".
+
+1.3.  Syntax Notation
+
+   This specification uses the Augmented Backus-Naur Form (ABNF)
+   notation of [RFC2234], including the following core ABNF syntax rules
+   defined by that specification: ALPHA (letters), CR (carriage return),
+   DIGIT (decimal digits), DQUOTE (double quote), HEXDIG (hexadecimal
+   digits), LF (line feed), and SP (space).  The complete URI syntax is
+   collected in Appendix A.
+
+2.  Characters
+
+   The URI syntax provides a method of encoding data, presumably for the
+   sake of identifying a resource, as a sequence of characters.  The URI
+   characters are, in turn, frequently encoded as octets for transport
+   or presentation.  This specification does not mandate any particular
+   character encoding for mapping between URI characters and the octets
+   used to store or transmit those characters.  When a URI appears in a
+   protocol element, the character encoding is defined by that protocol;
+   without such a definition, a URI is assumed to be in the same
+   character encoding as the surrounding text.
+
+   The ABNF notation defines its terminal values to be non-negative
+   integers (codepoints) based on the US-ASCII coded character set
+   [ASCII].  Because a URI is a sequence of characters, we must invert
+   that relation in order to understand the URI syntax.  Therefore, the
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 11]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   integer values used by the ABNF must be mapped back to their
+   corresponding characters via US-ASCII in order to complete the syntax
+   rules.
+
+   A URI is composed from a limited set of characters consisting of
+   digits, letters, and a few graphic symbols.  A reserved subset of
+   those characters may be used to delimit syntax components within a
+   URI while the remaining characters, including both the unreserved set
+   and those reserved characters not acting as delimiters, define each
+   component's identifying data.
+
+2.1.  Percent-Encoding
+
+   A percent-encoding mechanism is used to represent a data octet in a
+   component when that octet's corresponding character is outside the
+   allowed set or is being used as a delimiter of, or within, the
+   component.  A percent-encoded octet is encoded as a character
+   triplet, consisting of the percent character "%" followed by the two
+   hexadecimal digits representing that octet's numeric value.  For
+   example, "%20" is the percent-encoding for the binary octet
+   "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space
+   character (SP).  Section 2.4 describes when percent-encoding and
+   decoding is applied.
+
+      pct-encoded = "%" HEXDIG HEXDIG
+
+   The uppercase hexadecimal digits 'A' through 'F' are equivalent to
+   the lowercase digits 'a' through 'f', respectively.  If two URIs
+   differ only in the case of hexadecimal digits used in percent-encoded
+   octets, they are equivalent.  For consistency, URI producers and
+   normalizers should use uppercase hexadecimal digits for all percent-
+   encodings.
+
+2.2.  Reserved Characters
+
+   URIs include components and subcomponents that are delimited by
+   characters in the "reserved" set.  These characters are called
+   "reserved" because they may (or may not) be defined as delimiters by
+   the generic syntax, by each scheme-specific syntax, or by the
+   implementation-specific syntax of a URI's dereferencing algorithm.
+   If data for a URI component would conflict with a reserved
+   character's purpose as a delimiter, then the conflicting data must be
+   percent-encoded before the URI is formed.
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 12]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+      reserved    = gen-delims / sub-delims
+
+      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
+
+      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
+                  / "*" / "+" / "," / ";" / "="
+
+   The purpose of reserved characters is to provide a set of delimiting
+   characters that are distinguishable from other data within a URI.
+   URIs that differ in the replacement of a reserved character with its
+   corresponding percent-encoded octet are not equivalent.  Percent-
+   encoding a reserved character, or decoding a percent-encoded octet
+   that corresponds to a reserved character, will change how the URI is
+   interpreted by most applications.  Thus, characters in the reserved
+   set are protected from normalization and are therefore safe to be
+   used by scheme-specific and producer-specific algorithms for
+   delimiting data subcomponents within a URI.
+
+   A subset of the reserved characters (gen-delims) is used as
+   delimiters of the generic URI components described in Section 3.  A
+   component's ABNF syntax rule will not use the reserved or gen-delims
+   rule names directly; instead, each syntax rule lists the characters
+   allowed within that component (i.e., not delimiting it), and any of
+   those characters that are also in the reserved set are "reserved" for
+   use as subcomponent delimiters within the component.  Only the most
+   common subcomponents are defined by this specification; other
+   subcomponents may be defined by a URI scheme's specification, or by
+   the implementation-specific syntax of a URI's dereferencing
+   algorithm, provided that such subcomponents are delimited by
+   characters in the reserved set allowed within that component.
+
+   URI producing applications should percent-encode data octets that
+   correspond to characters in the reserved set unless these characters
+   are specifically allowed by the URI scheme to represent data in that
+   component.  If a reserved character is found in a URI component and
+   no delimiting role is known for that character, then it must be
+   interpreted as representing the data octet corresponding to that
+   character's encoding in US-ASCII.
+
+2.3.  Unreserved Characters
+
+   Characters that are allowed in a URI but do not have a reserved
+   purpose are called unreserved.  These include uppercase and lowercase
+   letters, decimal digits, hyphen, period, underscore, and tilde.
+
+      unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 13]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   URIs that differ in the replacement of an unreserved character with
+   its corresponding percent-encoded US-ASCII octet are equivalent: they
+   identify the same resource.  However, URI comparison implementations
+   do not always perform normalization prior to comparison (see Section
+   6).  For consistency, percent-encoded octets in the ranges of ALPHA
+   (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
+   underscore (%5F), or tilde (%7E) should not be created by URI
+   producers and, when found in a URI, should be decoded to their
+   corresponding unreserved characters by URI normalizers.
+
+2.4.  When to Encode or Decode
+
+   Under normal circumstances, the only time when octets within a URI
+   are percent-encoded is during the process of producing the URI from
+   its component parts.  This is when an implementation determines which
+   of the reserved characters are to be used as subcomponent delimiters
+   and which can be safely used as data.  Once produced, a URI is always
+   in its percent-encoded form.
+
+   When a URI is dereferenced, the components and subcomponents
+   significant to the scheme-specific dereferencing process (if any)
+   must be parsed and separated before the percent-encoded octets within
+   those components can be safely decoded, as otherwise the data may be
+   mistaken for component delimiters.  The only exception is for
+   percent-encoded octets corresponding to characters in the unreserved
+   set, which can be decoded at any time.  For example, the octet
+   corresponding to the tilde ("~") character is often encoded as "%7E"
+   by older URI processing implementations; the "%7E" can be replaced by
+   "~" without changing its interpretation.
+
+   Because the percent ("%") character serves as the indicator for
+   percent-encoded octets, it must be percent-encoded as "%25" for that
+   octet to be used as data within a URI.  Implementations must not
+   percent-encode or decode the same string more than once, as decoding
+   an already decoded string might lead to misinterpreting a percent
+   data octet as the beginning of a percent-encoding, or vice versa in
+   the case of percent-encoding an already percent-encoded string.
+
+2.5.  Identifying Data
+
+   URI characters provide identifying data for each of the URI
+   components, serving as an external interface for identification
+   between systems.  Although the presence and nature of the URI
+   production interface is hidden from clients that use its URIs (and is
+   thus beyond the scope of the interoperability requirements defined by
+   this specification), it is a frequent source of confusion and errors
+   in the interpretation of URI character issues.  Implementers have to
+   be aware that there are multiple character encodings involved in the
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 14]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   production and transmission of URIs: local name and data encoding,
+   public interface encoding, URI character encoding, data format
+   encoding, and protocol encoding.
+
+   Local names, such as file system names, are stored with a local
+   character encoding.  URI producing applications (e.g., origin
+   servers) will typically use the local encoding as the basis for
+   producing meaningful names.  The URI producer will transform the
+   local encoding to one that is suitable for a public interface and
+   then transform the public interface encoding into the restricted set
+   of URI characters (reserved, unreserved, and percent-encodings).
+   Those characters are, in turn, encoded as octets to be used as a
+   reference within a data format (e.g., a document charset), and such
+   data formats are often subsequently encoded for transmission over
+   Internet protocols.
+
+   For most systems, an unreserved character appearing within a URI
+   component is interpreted as representing the data octet corresponding
+   to that character's encoding in US-ASCII.  Consumers of URIs assume
+   that the letter "X" corresponds to the octet "01011000", and even
+   when that assumption is incorrect, there is no harm in making it.  A
+   system that internally provides identifiers in the form of a
+   different character encoding, such as EBCDIC, will generally perform
+   character translation of textual identifiers to UTF-8 [STD63] (or
+   some other superset of the US-ASCII character encoding) at an
+   internal interface, thereby providing more meaningful identifiers
+   than those resulting from simply percent-encoding the original
+   octets.
+
+   For example, consider an information service that provides data,
+   stored locally using an EBCDIC-based file system, to clients on the
+   Internet through an HTTP server.  When an author creates a file with
+   the name "Laguna Beach" on that file system, the "http" URI
+   corresponding to that resource is expected to contain the meaningful
+   string "Laguna%20Beach".  If, however, that server produces URIs by
+   using an overly simplistic raw octet mapping, then the result would
+   be a URI containing "%D3%81%87%A4%95%81@%C2%85%81%83%88".  An
+   internal transcoding interface fixes this problem by transcoding the
+   local name to a superset of US-ASCII prior to producing the URI.
+   Naturally, proper interpretation of an incoming URI on such an
+   interface requires that percent-encoded octets be decoded (e.g.,
+   "%20" to SP) before the reverse transcoding is applied to obtain the
+   local name.
+
+   In some cases, the internal interface between a URI component and the
+   identifying data that it has been crafted to represent is much less
+   direct than a character encoding translation.  For example, portions
+   of a URI might reflect a query on non-ASCII data, or numeric
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 15]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   coordinates on a map.  Likewise, a URI scheme may define components
+   with additional encoding requirements that are applied prior to
+   forming the component and producing the URI.
+
+   When a new URI scheme defines a component that represents textual
+   data consisting of characters from the Universal Character Set [UCS],
+   the data should first be encoded as octets according to the UTF-8
+   character encoding [STD63]; then only those octets that do not
+   correspond to characters in the unreserved set should be percent-
+   encoded.  For example, the character A would be represented as "A",
+   the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
+   as "%C3%80", and the character KATAKANA LETTER A would be represented
+   as "%E3%82%A2".
+
+3.  Syntax Components
+
+   The generic URI syntax consists of a hierarchical sequence of
+   components referred to as the scheme, authority, path, query, and
+   fragment.
+
+      URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
+
+      hier-part   = "//" authority path-abempty
+                  / path-absolute
+                  / path-rootless
+                  / path-empty
+
+   The scheme and path components are required, though the path may be
+   empty (no characters).  When authority is present, the path must
+   either be empty or begin with a slash ("/") character.  When
+   authority is not present, the path cannot begin with two slash
+   characters ("//").  These restrictions result in five different ABNF
+   rules for a path (Section 3.3), only one of which will match any
+   given URI reference.
+
+   The following are two example URIs and their component parts:
+
+         foo://example.com:8042/over/there?name=ferret#nose
+         \_/   \______________/\_________/ \_________/ \__/
+          |           |            |            |        |
+       scheme     authority       path        query   fragment
+          |   _____________________|__
+         / \ /                        \
+         urn:example:animal:ferret:nose
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 16]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+3.1.  Scheme
+
+   Each URI begins with a scheme name that refers to a specification for
+   assigning identifiers within that scheme.  As such, the URI syntax is
+   a federated and extensible naming system wherein each scheme's
+   specification may further restrict the syntax and semantics of
+   identifiers using that scheme.
+
+   Scheme names consist of a sequence of characters beginning with a
+   letter and followed by any combination of letters, digits, plus
+   ("+"), period ("."), or hyphen ("-").  Although schemes are case-
+   insensitive, the canonical form is lowercase and documents that
+   specify schemes must do so with lowercase letters.  An implementation
+   should accept uppercase letters as equivalent to lowercase in scheme
+   names (e.g., allow "HTTP" as well as "http") for the sake of
+   robustness but should only produce lowercase scheme names for
+   consistency.
+
+      scheme      = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
+
+   Individual schemes are not specified by this document.  The process
+   for registration of new URI schemes is defined separately by [BCP35].
+   The scheme registry maintains the mapping between scheme names and
+   their specifications.  Advice for designers of new URI schemes can be
+   found in [RFC2718].  URI scheme specifications must define their own
+   syntax so that all strings matching their scheme-specific syntax will
+   also match the <absolute-URI> grammar, as described in Section 4.3.
+
+   When presented with a URI that violates one or more scheme-specific
+   restrictions, the scheme-specific resolution process should flag the
+   reference as an error rather than ignore the unused parts; doing so
+   reduces the number of equivalent URIs and helps detect abuses of the
+   generic syntax, which might indicate that the URI has been
+   constructed to mislead the user (Section 7.6).
+
+3.2.  Authority
+
+   Many URI schemes include a hierarchical element for a naming
+   authority so that governance of the name space defined by the
+   remainder of the URI is delegated to that authority (which may, in
+   turn, delegate it further).  The generic syntax provides a common
+   means for distinguishing an authority based on a registered name or
+   server address, along with optional port and user information.
+
+   The authority component is preceded by a double slash ("//") and is
+   terminated by the next slash ("/"), question mark ("?"), or number
+   sign ("#") character, or by the end of the URI.
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 17]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+      authority   = [ userinfo "@" ] host [ ":" port ]
+
+   URI producers and normalizers should omit the ":" delimiter that
+   separates host from port if the port component is empty.  Some
+   schemes do not allow the userinfo and/or port subcomponents.
+
+   If a URI contains an authority component, then the path component
+   must either be empty or begin with a slash ("/") character.  Non-
+   validating parsers (those that merely separate a URI reference into
+   its major components) will often ignore the subcomponent structure of
+   authority, treating it as an opaque string from the double-slash to
+   the first terminating delimiter, until such time as the URI is
+   dereferenced.
+
+3.2.1.  User Information
+
+   The userinfo subcomponent may consist of a user name and, optionally,
+   scheme-specific information about how to gain authorization to access
+   the resource.  The user information, if present, is followed by a
+   commercial at-sign ("@") that delimits it from the host.
+
+      userinfo    = *( unreserved / pct-encoded / sub-delims / ":" )
+
+   Use of the format "user:password" in the userinfo field is
+   deprecated.  Applications should not render as clear text any data
+   after the first colon (":") character found within a userinfo
+   subcomponent unless the data after the colon is the empty string
+   (indicating no password).  Applications may choose to ignore or
+   reject such data when it is received as part of a reference and
+   should reject the storage of such data in unencrypted form.  The
+   passing of authentication information in clear text has proven to be
+   a security risk in almost every case where it has been used.
+
+   Applications that render a URI for the sake of user feedback, such as
+   in graphical hypertext browsing, should render userinfo in a way that
+   is distinguished from the rest of a URI, when feasible.  Such
+   rendering will assist the user in cases where the userinfo has been
+   misleadingly crafted to look like a trusted domain name
+   (Section 7.6).
+
+3.2.2.  Host
+
+   The host subcomponent of authority is identified by an IP literal
+   encapsulated within square brackets, an IPv4 address in dotted-
+   decimal form, or a registered name.  The host subcomponent is case-
+   insensitive.  The presence of a host subcomponent within a URI does
+   not imply that the scheme requires access to the given host on the
+   Internet.  In many cases, the host syntax is used only for the sake
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 18]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   of reusing the existing registration process created and deployed for
+   DNS, thus obtaining a globally unique name without the cost of
+   deploying another registry.  However, such use comes with its own
+   costs: domain name ownership may change over time for reasons not
+   anticipated by the URI producer.  In other cases, the data within the
+   host component identifies a registered name that has nothing to do
+   with an Internet host.  We use the name "host" for the ABNF rule
+   because that is its most common purpose, not its only purpose.
+
+      host        = IP-literal / IPv4address / reg-name
+
+   The syntax rule for host is ambiguous because it does not completely
+   distinguish between an IPv4address and a reg-name.  In order to
+   disambiguate the syntax, we apply the "first-match-wins" algorithm:
+   If host matches the rule for IPv4address, then it should be
+   considered an IPv4 address literal and not a reg-name.  Although host
+   is case-insensitive, producers and normalizers should use lowercase
+   for registered names and hexadecimal addresses for the sake of
+   uniformity, while only using uppercase letters for percent-encodings.
+
+   A host identified by an Internet Protocol literal address, version 6
+   [RFC3513] or later, is distinguished by enclosing the IP literal
+   within square brackets ("[" and "]").  This is the only place where
+   square bracket characters are allowed in the URI syntax.  In
+   anticipation of future, as-yet-undefined IP literal address formats,
+   an implementation may use an optional version flag to indicate such a
+   format explicitly rather than rely on heuristic determination.
+
+      IP-literal = "[" ( IPv6address / IPvFuture  ) "]"
+
+      IPvFuture  = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
+
+   The version flag does not indicate the IP version; rather, it
+   indicates future versions of the literal format.  As such,
+   implementations must not provide the version flag for the existing
+   IPv4 and IPv6 literal address forms described below.  If a URI
+   containing an IP-literal that starts with "v" (case-insensitive),
+   indicating that the version flag is present, is dereferenced by an
+   application that does not know the meaning of that version flag, then
+   the application should return an appropriate error for "address
+   mechanism not supported".
+
+   A host identified by an IPv6 literal address is represented inside
+   the square brackets without a preceding version flag.  The ABNF
+   provided here is a translation of the text definition of an IPv6
+   literal address provided in [RFC3513].  This syntax does not support
+   IPv6 scoped addressing zone identifiers.
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 19]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   A 128-bit IPv6 address is divided into eight 16-bit pieces.  Each
+   piece is represented numerically in case-insensitive hexadecimal,
+   using one to four hexadecimal digits (leading zeroes are permitted).
+   The eight encoded pieces are given most-significant first, separated
+   by colon characters.  Optionally, the least-significant two pieces
+   may instead be represented in IPv4 address textual format.  A
+   sequence of one or more consecutive zero-valued 16-bit pieces within
+   the address may be elided, omitting all their digits and leaving
+   exactly two consecutive colons in their place to mark the elision.
+
+      IPv6address =                            6( h16 ":" ) ls32
+                  /                       "::" 5( h16 ":" ) ls32
+                  / [               h16 ] "::" 4( h16 ":" ) ls32
+                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
+                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
+                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
+                  / [ *4( h16 ":" ) h16 ] "::"              ls32
+                  / [ *5( h16 ":" ) h16 ] "::"              h16
+                  / [ *6( h16 ":" ) h16 ] "::"
+
+      ls32        = ( h16 ":" h16 ) / IPv4address
+                  ; least-significant 32 bits of address
+
+      h16         = 1*4HEXDIG
+                  ; 16 bits of address represented in hexadecimal
+
+   A host identified by an IPv4 literal address is represented in
+   dotted-decimal notation (a sequence of four decimal numbers in the
+   range 0 to 255, separated by "."), as described in [RFC1123] by
+   reference to [RFC0952].  Note that other forms of dotted notation may
+   be interpreted on some platforms, as described in Section 7.4, but
+   only the dotted-decimal form of four octets is allowed by this
+   grammar.
+
+      IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
+
+      dec-octet   = DIGIT                 ; 0-9
+                  / %x31-39 DIGIT         ; 10-99
+                  / "1" 2DIGIT            ; 100-199
+                  / "2" %x30-34 DIGIT     ; 200-249
+                  / "25" %x30-35          ; 250-255
+
+   A host identified by a registered name is a sequence of characters
+   usually intended for lookup within a locally defined host or service
+   name registry, though the URI's scheme-specific semantics may require
+   that a specific registry (or fixed name table) be used instead.  The
+   most common name registry mechanism is the Domain Name System (DNS).
+   A registered name intended for lookup in the DNS uses the syntax
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 20]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   defined in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123].
+   Such a name consists of a sequence of domain labels separated by ".",
+   each domain label starting and ending with an alphanumeric character
+   and possibly also containing "-" characters.  The rightmost domain
+   label of a fully qualified domain name in DNS may be followed by a
+   single "." and should be if it is necessary to distinguish between
+   the complete domain name and some local domain.
+
+      reg-name    = *( unreserved / pct-encoded / sub-delims )
+
+   If the URI scheme defines a default for host, then that default
+   applies when the host subcomponent is undefined or when the
+   registered name is empty (zero length).  For example, the "file" URI
+   scheme is defined so that no authority, an empty host, and
+   "localhost" all mean the end-user's machine, whereas the "http"
+   scheme considers a missing authority or empty host invalid.
+
+   This specification does not mandate a particular registered name
+   lookup technology and therefore does not restrict the syntax of reg-
+   name beyond what is necessary for interoperability.  Instead, it
+   delegates the issue of registered name syntax conformance to the
+   operating system of each application performing URI resolution, and
+   that operating system decides what it will allow for the purpose of
+   host identification.  A URI resolution implementation might use DNS,
+   host tables, yellow pages, NetInfo, WINS, or any other system for
+   lookup of registered names.  However, a globally scoped naming
+   system, such as DNS fully qualified domain names, is necessary for
+   URIs intended to have global scope.  URI producers should use names
+   that conform to the DNS syntax, even when use of DNS is not
+   immediately apparent, and should limit these names to no more than
+   255 characters in length.
+
+   The reg-name syntax allows percent-encoded octets in order to
+   represent non-ASCII registered names in a uniform way that is
+   independent of the underlying name resolution technology.  Non-ASCII
+   characters must first be encoded according to UTF-8 [STD63], and then
+   each octet of the corresponding UTF-8 sequence must be percent-
+   encoded to be represented as URI characters.  URI producing
+   applications must not use percent-encoding in host unless it is used
+   to represent a UTF-8 character sequence.  When a non-ASCII registered
+   name represents an internationalized domain name intended for
+   resolution via the DNS, the name must be transformed to the IDNA
+   encoding [RFC3490] prior to name lookup.  URI producers should
+   provide these registered names in the IDNA encoding, rather than a
+   percent-encoding, if they wish to maximize interoperability with
+   legacy URI resolvers.
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 21]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+3.2.3.  Port
+
+   The port subcomponent of authority is designated by an optional port
+   number in decimal following the host and delimited from it by a
+   single colon (":") character.
+
+      port        = *DIGIT
+
+   A scheme may define a default port.  For example, the "http" scheme
+   defines a default port of "80", corresponding to its reserved TCP
+   port number.  The type of port designated by the port number (e.g.,
+   TCP, UDP, SCTP) is defined by the URI scheme.  URI producers and
+   normalizers should omit the port component and its ":" delimiter if
+   port is empty or if its value would be the same as that of the
+   scheme's default.
+
+3.3.  Path
+
+   The path component contains data, usually organized in hierarchical
+   form, that, along with data in the non-hierarchical query component
+   (Section 3.4), serves to identify a resource within the scope of the
+   URI's scheme and naming authority (if any).  The path is terminated
+   by the first question mark ("?") or number sign ("#") character, or
+   by the end of the URI.
+
+   If a URI contains an authority component, then the path component
+   must either be empty or begin with a slash ("/") character.  If a URI
+   does not contain an authority component, then the path cannot begin
+   with two slash characters ("//").  In addition, a URI reference
+   (Section 4.1) may be a relative-path reference, in which case the
+   first path segment cannot contain a colon (":") character.  The ABNF
+   requires five separate rules to disambiguate these cases, only one of
+   which will match the path substring within a given URI reference.  We
+   use the generic term "path component" to describe the URI substring
+   matched by the parser to one of these rules.
+
+      path          = path-abempty    ; begins with "/" or is empty
+                    / path-absolute   ; begins with "/" but not "//"
+                    / path-noscheme   ; begins with a non-colon segment
+                    / path-rootless   ; begins with a segment
+                    / path-empty      ; zero characters
+
+      path-abempty  = *( "/" segment )
+      path-absolute = "/" [ segment-nz *( "/" segment ) ]
+      path-noscheme = segment-nz-nc *( "/" segment )
+      path-rootless = segment-nz *( "/" segment )
+      path-empty    = 0<pchar>
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 22]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+      segment       = *pchar
+      segment-nz    = 1*pchar
+      segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
+                    ; non-zero-length segment without any colon ":"
+
+      pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
+
+   A path consists of a sequence of path segments separated by a slash
+   ("/") character.  A path is always defined for a URI, though the
+   defined path may be empty (zero length).  Use of the slash character
+   to indicate hierarchy is only required when a URI will be used as the
+   context for relative references.  For example, the URI
+   <mailto:fred@example.com> has a path of "fred@example.com", whereas
+   the URI <foo://info.example.com?fred> has an empty path.
+
+   The path segments "." and "..", also known as dot-segments, are
+   defined for relative reference within the path name hierarchy.  They
+   are intended for use at the beginning of a relative-path reference
+   (Section 4.2) to indicate relative position within the hierarchical
+   tree of names.  This is similar to their role within some operating
+   systems' file directory structures to indicate the current directory
+   and parent directory, respectively.  However, unlike in a file
+   system, these dot-segments are only interpreted within the URI path
+   hierarchy and are removed as part of the resolution process (Section
+   5.2).
+
+   Aside from dot-segments in hierarchical paths, a path segment is
+   considered opaque by the generic syntax.  URI producing applications
+   often use the reserved characters allowed in a segment to delimit
+   scheme-specific or dereference-handler-specific subcomponents.  For
+   example, the semicolon (";") and equals ("=") reserved characters are
+   often used to delimit parameters and parameter values applicable to
+   that segment.  The comma (",") reserved character is often used for
+   similar purposes.  For example, one URI producer might use a segment
+   such as "name;v=1.1" to indicate a reference to version 1.1 of
+   "name", whereas another might use a segment such as "name,1.1" to
+   indicate the same.  Parameter types may be defined by scheme-specific
+   semantics, but in most cases the syntax of a parameter is specific to
+   the implementation of the URI's dereferencing algorithm.
+
+3.4.  Query
+
+   The query component contains non-hierarchical data that, along with
+   data in the path component (Section 3.3), serves to identify a
+   resource within the scope of the URI's scheme and naming authority
+   (if any).  The query component is indicated by the first question
+   mark ("?") character and terminated by a number sign ("#") character
+   or by the end of the URI.
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 23]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+      query       = *( pchar / "/" / "?" )
+
+   The characters slash ("/") and question mark ("?") may represent data
+   within the query component.  Beware that some older, erroneous
+   implementations may not handle such data correctly when it is used as
+   the base URI for relative references (Section 5.1), apparently
+   because they fail to distinguish query data from path data when
+   looking for hierarchical separators.  However, as query components
+   are often used to carry identifying information in the form of
+   "key=value" pairs and one frequently used value is a reference to
+   another URI, it is sometimes better for usability to avoid percent-
+   encoding those characters.
+
+3.5.  Fragment
+
+   The fragment identifier component of a URI allows indirect
+   identification of a secondary resource by reference to a primary
+   resource and additional identifying information.  The identified
+   secondary resource may be some portion or subset of the primary
+   resource, some view on representations of the primary resource, or
+   some other resource defined or described by those representations.  A
+   fragment identifier component is indicated by the presence of a
+   number sign ("#") character and terminated by the end of the URI.
+
+      fragment    = *( pchar / "/" / "?" )
+
+   The semantics of a fragment identifier are defined by the set of
+   representations that might result from a retrieval action on the
+   primary resource.  The fragment's format and resolution is therefore
+   dependent on the media type [RFC2046] of a potentially retrieved
+   representation, even though such a retrieval is only performed if the
+   URI is dereferenced.  If no such representation exists, then the
+   semantics of the fragment are considered unknown and are effectively
+   unconstrained.  Fragment identifier semantics are independent of the
+   URI scheme and thus cannot be redefined by scheme specifications.
+
+   Individual media types may define their own restrictions on or
+   structures within the fragment identifier syntax for specifying
+   different types of subsets, views, or external references that are
+   identifiable as secondary resources by that media type.  If the
+   primary resource has multiple representations, as is often the case
+   for resources whose representation is selected based on attributes of
+   the retrieval request (a.k.a., content negotiation), then whatever is
+   identified by the fragment should be consistent across all of those
+   representations.  Each representation should either define the
+   fragment so that it corresponds to the same secondary resource,
+   regardless of how it is represented, or should leave the fragment
+   undefined (i.e., not found).
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 24]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   As with any URI, use of a fragment identifier component does not
+   imply that a retrieval action will take place.  A URI with a fragment
+   identifier may be used to refer to the secondary resource without any
+   implication that the primary resource is accessible or will ever be
+   accessed.
+
+   Fragment identifiers have a special role in information retrieval
+   systems as the primary form of client-side indirect referencing,
+   allowing an author to specifically identify aspects of an existing
+   resource that are only indirectly provided by the resource owner.  As
+   such, the fragment identifier is not used in the scheme-specific
+   processing of a URI; instead, the fragment identifier is separated
+   from the rest of the URI prior to a dereference, and thus the
+   identifying information within the fragment itself is dereferenced
+   solely by the user agent, regardless of the URI scheme.  Although
+   this separate handling is often perceived to be a loss of
+   information, particularly for accurate redirection of references as
+   resources move over time, it also serves to prevent information
+   providers from denying reference authors the right to refer to
+   information within a resource selectively.  Indirect referencing also
+   provides additional flexibility and extensibility to systems that use
+   URIs, as new media types are easier to define and deploy than new
+   schemes of identification.
+
+   The characters slash ("/") and question mark ("?") are allowed to
+   represent data within the fragment identifier.  Beware that some
+   older, erroneous implementations may not handle this data correctly
+   when it is used as the base URI for relative references (Section
+   5.1).
+
+4.  Usage
+
+   When applications make reference to a URI, they do not always use the
+   full form of reference defined by the "URI" syntax rule.  To save
+   space and take advantage of hierarchical locality, many Internet
+   protocol elements and media type formats allow an abbreviation of a
+   URI, whereas others restrict the syntax to a particular form of URI.
+   We define the most common forms of reference syntax in this
+   specification because they impact and depend upon the design of the
+   generic syntax, requiring a uniform parsing algorithm in order to be
+   interpreted consistently.
+
+4.1.  URI Reference
+
+   URI-reference is used to denote the most common usage of a resource
+   identifier.
+
+      URI-reference = URI / relative-ref
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 25]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   A URI-reference is either a URI or a relative reference.  If the
+   URI-reference's prefix does not match the syntax of a scheme followed
+   by its colon separator, then the URI-reference is a relative
+   reference.
+
+   A URI-reference is typically parsed first into the five URI
+   components, in order to determine what components are present and
+   whether the reference is relative.  Then, each component is parsed
+   for its subparts and their validation.  The ABNF of URI-reference,
+   along with the "first-match-wins" disambiguation rule, is sufficient
+   to define a validating parser for the generic syntax.  Readers
+   familiar with regular expressions should see Appendix B for an
+   example of a non-validating URI-reference parser that will take any
+   given string and extract the URI components.
+
+4.2.  Relative Reference
+
+   A relative reference takes advantage of the hierarchical syntax
+   (Section 1.2.3) to express a URI reference relative to the name space
+   of another hierarchical URI.
+
+      relative-ref  = relative-part [ "?" query ] [ "#" fragment ]
+
+      relative-part = "//" authority path-abempty
+                    / path-absolute
+                    / path-noscheme
+                    / path-empty
+
+   The URI referred to by a relative reference, also known as the target
+   URI, is obtained by applying the reference resolution algorithm of
+   Section 5.
+
+   A relative reference that begins with two slash characters is termed
+   a network-path reference; such references are rarely used.  A
+   relative reference that begins with a single slash character is
+   termed an absolute-path reference.  A relative reference that does
+   not begin with a slash character is termed a relative-path reference.
+
+   A path segment that contains a colon character (e.g., "this:that")
+   cannot be used as the first segment of a relative-path reference, as
+   it would be mistaken for a scheme name.  Such a segment must be
+   preceded by a dot-segment (e.g., "./this:that") to make a relative-
+   path reference.
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 26]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+4.3.  Absolute URI
+
+   Some protocol elements allow only the absolute form of a URI without
+   a fragment identifier.  For example, defining a base URI for later
+   use by relative references calls for an absolute-URI syntax rule that
+   does not allow a fragment.
+
+      absolute-URI  = scheme ":" hier-part [ "?" query ]
+
+   URI scheme specifications must define their own syntax so that all
+   strings matching their scheme-specific syntax will also match the
+   <absolute-URI> grammar.  Scheme specifications will not define
+   fragment identifier syntax or usage, regardless of its applicability
+   to resources identifiable via that scheme, as fragment identification
+   is orthogonal to scheme definition.  However, scheme specifications
+   are encouraged to include a wide range of examples, including
+   examples that show use of the scheme's URIs with fragment identifiers
+   when such usage is appropriate.
+
+4.4.  Same-Document Reference
+
+   When a URI reference refers to a URI that is, aside from its fragment
+   component (if any), identical to the base URI (Section 5.1), that
+   reference is called a "same-document" reference.  The most frequent
+   examples of same-document references are relative references that are
+   empty or include only the number sign ("#") separator followed by a
+   fragment identifier.
+
+   When a same-document reference is dereferenced for a retrieval
+   action, the target of that reference is defined to be within the same
+   entity (representation, document, or message) as the reference;
+   therefore, a dereference should not result in a new retrieval action.
+
+   Normalization of the base and target URIs prior to their comparison,
+   as described in Sections 6.2.2 and 6.2.3, is allowed but rarely
+   performed in practice.  Normalization may increase the set of same-
+   document references, which may be of benefit to some caching
+   applications.  As such, reference authors should not assume that a
+   slightly different, though equivalent, reference URI will (or will
+   not) be interpreted as a same-document reference by any given
+   application.
+
+4.5.  Suffix Reference
+
+   The URI syntax is designed for unambiguous reference to resources and
+   extensibility via the URI scheme.  However, as URI identification and
+   usage have become commonplace, traditional media (television, radio,
+   newspapers, billboards, etc.) have increasingly used a suffix of the
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 27]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   URI as a reference, consisting of only the authority and path
+   portions of the URI, such as
+
+      www.w3.org/Addressing/
+
+   or simply a DNS registered name on its own.  Such references are
+   primarily intended for human interpretation rather than for machines,
+   with the assumption that context-based heuristics are sufficient to
+   complete the URI (e.g., most registered names beginning with "www"
+   are likely to have a URI prefix of "http://").  Although there is no
+   standard set of heuristics for disambiguating a URI suffix, many
+   client implementations allow them to be entered by the user and
+   heuristically resolved.
+
+   Although this practice of using suffix references is common, it
+   should be avoided whenever possible and should never be used in
+   situations where long-term references are expected.  The heuristics
+   noted above will change over time, particularly when a new URI scheme
+   becomes popular, and are often incorrect when used out of context.
+   Furthermore, they can lead to security issues along the lines of
+   those described in [RFC1535].
+
+   As a URI suffix has the same syntax as a relative-path reference, a
+   suffix reference cannot be used in contexts where a relative
+   reference is expected.  As a result, suffix references are limited to
+   places where there is no defined base URI, such as dialog boxes and
+   off-line advertisements.
+
+5.  Reference Resolution
+
+   This section defines the process of resolving a URI reference within
+   a context that allows relative references so that the result is a
+   string matching the <URI> syntax rule of Section 3.
+
+5.1.  Establishing a Base URI
+
+   The term "relative" implies that a "base URI" exists against which
+   the relative reference is applied.  Aside from fragment-only
+   references (Section 4.4), relative references are only usable when a
+   base URI is known.  A base URI must be established by the parser
+   prior to parsing URI references that might be relative.  A base URI
+   must conform to the <absolute-URI> syntax rule (Section 4.3).  If the
+   base URI is obtained from a URI reference, then that reference must
+   be converted to absolute form and stripped of any fragment component
+   prior to its use as a base URI.
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 28]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   The base URI of a reference can be established in one of four ways,
+   discussed below in order of precedence.  The order of precedence can
+   be thought of in terms of layers, where the innermost defined base
+   URI has the highest precedence.  This can be visualized graphically
+   as follows:
+
+         .----------------------------------------------------------.
+         |  .----------------------------------------------------.  |
+         |  |  .----------------------------------------------.  |  |
+         |  |  |  .----------------------------------------.  |  |  |
+         |  |  |  |  .----------------------------------.  |  |  |  |
+         |  |  |  |  |       <relative-reference>       |  |  |  |  |
+         |  |  |  |  `----------------------------------'  |  |  |  |
+         |  |  |  | (5.1.1) Base URI embedded in content   |  |  |  |
+         |  |  |  `----------------------------------------'  |  |  |
+         |  |  | (5.1.2) Base URI of the encapsulating entity |  |  |
+         |  |  |         (message, representation, or none)   |  |  |
+         |  |  `----------------------------------------------'  |  |
+         |  | (5.1.3) URI used to retrieve the entity            |  |
+         |  `----------------------------------------------------'  |
+         | (5.1.4) Default Base URI (application-dependent)         |
+         `----------------------------------------------------------'
+
+5.1.1.  Base URI Embedded in Content
+
+   Within certain media types, a base URI for relative references can be
+   embedded within the content itself so that it can be readily obtained
+   by a parser.  This can be useful for descriptive documents, such as
+   tables of contents, which may be transmitted to others through
+   protocols other than their usual retrieval context (e.g., email or
+   USENET news).
+
+   It is beyond the scope of this specification to specify how, for each
+   media type, a base URI can be embedded.  The appropriate syntax, when
+   available, is described by the data format specification associated
+   with each media type.
+
+5.1.2.  Base URI from the Encapsulating Entity
+
+   If no base URI is embedded, the base URI is defined by the
+   representation's retrieval context.  For a document that is enclosed
+   within another entity, such as a message or archive, the retrieval
+   context is that entity.  Thus, the default base URI of a
+   representation is the base URI of the entity in which the
+   representation is encapsulated.
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 29]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   A mechanism for embedding a base URI within MIME container types
+   (e.g., the message and multipart types) is defined by MHTML
+   [RFC2557].  Protocols that do not use the MIME message header syntax,
+   but that do allow some form of tagged metadata to be included within
+   messages, may define their own syntax for defining a base URI as part
+   of a message.
+
+5.1.3.  Base URI from the Retrieval URI
+
+   If no base URI is embedded and the representation is not encapsulated
+   within some other entity, then, if a URI was used to retrieve the
+   representation, that URI shall be considered the base URI.  Note that
+   if the retrieval was the result of a redirected request, the last URI
+   used (i.e., the URI that resulted in the actual retrieval of the
+   representation) is the base URI.
+
+5.1.4.  Default Base URI
+
+   If none of the conditions described above apply, then the base URI is
+   defined by the context of the application.  As this definition is
+   necessarily application-dependent, failing to define a base URI by
+   using one of the other methods may result in the same content being
+   interpreted differently by different types of applications.
+
+   A sender of a representation containing relative references is
+   responsible for ensuring that a base URI for those references can be
+   established.  Aside from fragment-only references, relative
+   references can only be used reliably in situations where the base URI
+   is well defined.
+
+5.2.  Relative Resolution
+
+   This section describes an algorithm for converting a URI reference
+   that might be relative to a given base URI into the parsed components
+   of the reference's target.  The components can then be recomposed, as
+   described in Section 5.3, to form the target URI.  This algorithm
+   provides definitive results that can be used to test the output of
+   other implementations.  Applications may implement relative reference
+   resolution by using some other algorithm, provided that the results
+   match what would be given by this one.
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 30]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+5.2.1.  Pre-parse the Base URI
+
+   The base URI (Base) is established according to the procedure of
+   Section 5.1 and parsed into the five main components described in
+   Section 3.  Note that only the scheme component is required to be
+   present in a base URI; the other components may be empty or
+   undefined.  A component is undefined if its associated delimiter does
+   not appear in the URI reference; the path component is never
+   undefined, though it may be empty.
+
+   Normalization of the base URI, as described in Sections 6.2.2 and
+   6.2.3, is optional.  A URI reference must be transformed to its
+   target URI before it can be normalized.
+
+5.2.2.  Transform References
+
+   For each URI reference (R), the following pseudocode describes an
+   algorithm for transforming R into its target URI (T):
+
+      -- The URI reference is parsed into the five URI components
+      --
+      (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R);
+
+      -- A non-strict parser may ignore a scheme in the reference
+      -- if it is identical to the base URI's scheme.
+      --
+      if ((not strict) and (R.scheme == Base.scheme)) then
+         undefine(R.scheme);
+      endif;
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 31]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+      if defined(R.scheme) then
+         T.scheme    = R.scheme;
+         T.authority = R.authority;
+         T.path      = remove_dot_segments(R.path);
+         T.query     = R.query;
+      else
+         if defined(R.authority) then
+            T.authority = R.authority;
+            T.path      = remove_dot_segments(R.path);
+            T.query     = R.query;
+         else
+            if (R.path == "") then
+               T.path = Base.path;
+               if defined(R.query) then
+                  T.query = R.query;
+               else
+                  T.query = Base.query;
+               endif;
+            else
+               if (R.path starts-with "/") then
+                  T.path = remove_dot_segments(R.path);
+               else
+                  T.path = merge(Base.path, R.path);
+                  T.path = remove_dot_segments(T.path);
+               endif;
+               T.query = R.query;
+            endif;
+            T.authority = Base.authority;
+         endif;
+         T.scheme = Base.scheme;
+      endif;
+
+      T.fragment = R.fragment;
+
+5.2.3.  Merge Paths
+
+   The pseudocode above refers to a "merge" routine for merging a
+   relative-path reference with the path of the base URI.  This is
+   accomplished as follows:
+
+   o  If the base URI has a defined authority component and an empty
+      path, then return a string consisting of "/" concatenated with the
+      reference's path; otherwise,
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 32]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   o  return a string consisting of the reference's path component
+      appended to all but the last segment of the base URI's path (i.e.,
+      excluding any characters after the right-most "/" in the base URI
+      path, or excluding the entire base URI path if it does not contain
+      any "/" characters).
+
+5.2.4.  Remove Dot Segments
+
+   The pseudocode also refers to a "remove_dot_segments" routine for
+   interpreting and removing the special "." and ".." complete path
+   segments from a referenced path.  This is done after the path is
+   extracted from a reference, whether or not the path was relative, in
+   order to remove any invalid or extraneous dot-segments prior to
+   forming the target URI.  Although there are many ways to accomplish
+   this removal process, we describe a simple method using two string
+   buffers.
+
+   1.  The input buffer is initialized with the now-appended path
+       components and the output buffer is initialized to the empty
+       string.
+
+   2.  While the input buffer is not empty, loop as follows:
+
+       A.  If the input buffer begins with a prefix of "../" or "./",
+           then remove that prefix from the input buffer; otherwise,
+
+       B.  if the input buffer begins with a prefix of "/./" or "/.",
+           where "." is a complete path segment, then replace that
+           prefix with "/" in the input buffer; otherwise,
+
+       C.  if the input buffer begins with a prefix of "/../" or "/..",
+           where ".." is a complete path segment, then replace that
+           prefix with "/" in the input buffer and remove the last
+           segment and its preceding "/" (if any) from the output
+           buffer; otherwise,
+
+       D.  if the input buffer consists only of "." or "..", then remove
+           that from the input buffer; otherwise,
+
+       E.  move the first path segment in the input buffer to the end of
+           the output buffer, including the initial "/" character (if
+           any) and any subsequent characters up to, but not including,
+           the next "/" character or the end of the input buffer.
+
+   3.  Finally, the output buffer is returned as the result of
+       remove_dot_segments.
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 33]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   Note that dot-segments are intended for use in URI references to
+   express an identifier relative to the hierarchy of names in the base
+   URI.  The remove_dot_segments algorithm respects that hierarchy by
+   removing extra dot-segments rather than treat them as an error or
+   leaving them to be misinterpreted by dereference implementations.
+
+   The following illustrates how the above steps are applied for two
+   examples of merged paths, showing the state of the two buffers after
+   each step.
+
+      STEP   OUTPUT BUFFER         INPUT BUFFER
+
+       1 :                         /a/b/c/./../../g
+       2E:   /a                    /b/c/./../../g
+       2E:   /a/b                  /c/./../../g
+       2E:   /a/b/c                /./../../g
+       2B:   /a/b/c                /../../g
+       2C:   /a/b                  /../g
+       2C:   /a                    /g
+       2E:   /a/g
+
+      STEP   OUTPUT BUFFER         INPUT BUFFER
+
+       1 :                         mid/content=5/../6
+       2E:   mid                   /content=5/../6
+       2E:   mid/content=5         /../6
+       2C:   mid                   /6
+       2E:   mid/6
+
+   Some applications may find it more efficient to implement the
+   remove_dot_segments algorithm by using two segment stacks rather than
+   strings.
+
+      Note: Beware that some older, erroneous implementations will fail
+      to separate a reference's query component from its path component
+      prior to merging the base and reference paths, resulting in an
+      interoperability failure if the query component contains the
+      strings "/../" or "/./".
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 34]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+5.3.  Component Recomposition
+
+   Parsed URI components can be recomposed to obtain the corresponding
+   URI reference string.  Using pseudocode, this would be:
+
+      result = ""
+
+      if defined(scheme) then
+         append scheme to result;
+         append ":" to result;
+      endif;
+
+      if defined(authority) then
+         append "//" to result;
+         append authority to result;
+      endif;
+
+      append path to result;
+
+      if defined(query) then
+         append "?" to result;
+         append query to result;
+      endif;
+
+      if defined(fragment) then
+         append "#" to result;
+         append fragment to result;
+      endif;
+
+      return result;
+
+   Note that we are careful to preserve the distinction between a
+   component that is undefined, meaning that its separator was not
+   present in the reference, and a component that is empty, meaning that
+   the separator was present and was immediately followed by the next
+   component separator or the end of the reference.
+
+5.4.  Reference Resolution Examples
+
+   Within a representation with a well defined base URI of
+
+      http://a/b/c/d;p?q
+
+   a relative reference is transformed to its target URI as follows.
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 35]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+5.4.1.  Normal Examples
+
+      "g:h"           =  "g:h"
+      "g"             =  "http://a/b/c/g"
+      "./g"           =  "http://a/b/c/g"
+      "g/"            =  "http://a/b/c/g/"
+      "/g"            =  "http://a/g"
+      "//g"           =  "http://g"
+      "?y"            =  "http://a/b/c/d;p?y"
+      "g?y"           =  "http://a/b/c/g?y"
+      "#s"            =  "http://a/b/c/d;p?q#s"
+      "g#s"           =  "http://a/b/c/g#s"
+      "g?y#s"         =  "http://a/b/c/g?y#s"
+      ";x"            =  "http://a/b/c/;x"
+      "g;x"           =  "http://a/b/c/g;x"
+      "g;x?y#s"       =  "http://a/b/c/g;x?y#s"
+      ""              =  "http://a/b/c/d;p?q"
+      "."             =  "http://a/b/c/"
+      "./"            =  "http://a/b/c/"
+      ".."            =  "http://a/b/"
+      "../"           =  "http://a/b/"
+      "../g"          =  "http://a/b/g"
+      "../.."         =  "http://a/"
+      "../../"        =  "http://a/"
+      "../../g"       =  "http://a/g"
+
+5.4.2.  Abnormal Examples
+
+   Although the following abnormal examples are unlikely to occur in
+   normal practice, all URI parsers should be capable of resolving them
+   consistently.  Each example uses the same base as that above.
+
+   Parsers must be careful in handling cases where there are more ".."
+   segments in a relative-path reference than there are hierarchical
+   levels in the base URI's path.  Note that the ".." syntax cannot be
+   used to change the authority component of a URI.
+
+      "../../../g"    =  "http://a/g"
+      "../../../../g" =  "http://a/g"
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 36]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   Similarly, parsers must remove the dot-segments "." and ".." when
+   they are complete components of a path, but not when they are only
+   part of a segment.
+
+      "/./g"          =  "http://a/g"
+      "/../g"         =  "http://a/g"
+      "g."            =  "http://a/b/c/g."
+      ".g"            =  "http://a/b/c/.g"
+      "g.."           =  "http://a/b/c/g.."
+      "..g"           =  "http://a/b/c/..g"
+
+   Less likely are cases where the relative reference uses unnecessary
+   or nonsensical forms of the "." and ".." complete path segments.
+
+      "./../g"        =  "http://a/b/g"
+      "./g/."         =  "http://a/b/c/g/"
+      "g/./h"         =  "http://a/b/c/g/h"
+      "g/../h"        =  "http://a/b/c/h"
+      "g;x=1/./y"     =  "http://a/b/c/g;x=1/y"
+      "g;x=1/../y"    =  "http://a/b/c/y"
+
+   Some applications fail to separate the reference's query and/or
+   fragment components from the path component before merging it with
+   the base path and removing dot-segments.  This error is rarely
+   noticed, as typical usage of a fragment never includes the hierarchy
+   ("/") character and the query component is not normally used within
+   relative references.
+
+      "g?y/./x"       =  "http://a/b/c/g?y/./x"
+      "g?y/../x"      =  "http://a/b/c/g?y/../x"
+      "g#s/./x"       =  "http://a/b/c/g#s/./x"
+      "g#s/../x"      =  "http://a/b/c/g#s/../x"
+
+   Some parsers allow the scheme name to be present in a relative
+   reference if it is the same as the base URI scheme.  This is
+   considered to be a loophole in prior specifications of partial URI
+   [RFC1630].  Its use should be avoided but is allowed for backward
+   compatibility.
+
+      "http:g"        =  "http:g"         ; for strict parsers
+                      /  "http://a/b/c/g" ; for backward compatibility
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 37]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+6.  Normalization and Comparison
+
+   One of the most common operations on URIs is simple comparison:
+   determining whether two URIs are equivalent without using the URIs to
+   access their respective resource(s).  A comparison is performed every
+   time a response cache is accessed, a browser checks its history to
+   color a link, or an XML parser processes tags within a namespace.
+   Extensive normalization prior to comparison of URIs is often used by
+   spiders and indexing engines to prune a search space or to reduce
+   duplication of request actions and response storage.
+
+   URI comparison is performed for some particular purpose.  Protocols
+   or implementations that compare URIs for different purposes will
+   often be subject to differing design trade-offs in regards to how
+   much effort should be spent in reducing aliased identifiers.  This
+   section describes various methods that may be used to compare URIs,
+   the trade-offs between them, and the types of applications that might
+   use them.
+
+6.1.  Equivalence
+
+   Because URIs exist to identify resources, presumably they should be
+   considered equivalent when they identify the same resource.  However,
+   this definition of equivalence is not of much practical use, as there
+   is no way for an implementation to compare two resources unless it
+   has full knowledge or control of them.  For this reason,
+   determination of equivalence or difference of URIs is based on string
+   comparison, perhaps augmented by reference to additional rules
+   provided by URI scheme definitions.  We use the terms "different" and
+   "equivalent" to describe the possible outcomes of such comparisons,
+   but there are many application-dependent versions of equivalence.
+
+   Even though it is possible to determine that two URIs are equivalent,
+   URI comparison is not sufficient to determine whether two URIs
+   identify different resources.  For example, an owner of two different
+   domain names could decide to serve the same resource from both,
+   resulting in two different URIs.  Therefore, comparison methods are
+   designed to minimize false negatives while strictly avoiding false
+   positives.
+
+   In testing for equivalence, applications should not directly compare
+   relative references; the references should be converted to their
+   respective target URIs before comparison.  When URIs are compared to
+   select (or avoid) a network action, such as retrieval of a
+   representation, fragment components (if any) should be excluded from
+   the comparison.
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 38]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+6.2.  Comparison Ladder
+
+   A variety of methods are used in practice to test URI equivalence.
+   These methods fall into a range, distinguished by the amount of
+   processing required and the degree to which the probability of false
+   negatives is reduced.  As noted above, false negatives cannot be
+   eliminated.  In practice, their probability can be reduced, but this
+   reduction requires more processing and is not cost-effective for all
+   applications.
+
+   If this range of comparison practices is considered as a ladder, the
+   following discussion will climb the ladder, starting with practices
+   that are cheap but have a relatively higher chance of producing false
+   negatives, and proceeding to those that have higher computational
+   cost and lower risk of false negatives.
+
+6.2.1.  Simple String Comparison
+
+   If two URIs, when considered as character strings, are identical,
+   then it is safe to conclude that they are equivalent.  This type of
+   equivalence test has very low computational cost and is in wide use
+   in a variety of applications, particularly in the domain of parsing.
+
+   Testing strings for equivalence requires some basic precautions.
+   This procedure is often referred to as "bit-for-bit" or
+   "byte-for-byte" comparison, which is potentially misleading.  Testing
+   strings for equality is normally based on pair comparison of the
+   characters that make up the strings, starting from the first and
+   proceeding until both strings are exhausted and all characters are
+   found to be equal, until a pair of characters compares unequal, or
+   until one of the strings is exhausted before the other.
+
+   This character comparison requires that each pair of characters be
+   put in comparable form.  For example, should one URI be stored in a
+   byte array in EBCDIC encoding and the second in a Java String object
+   (UTF-16), bit-for-bit comparisons applied naively will produce
+   errors.  It is better to speak of equality on a character-for-
+   character basis rather than on a byte-for-byte or bit-for-bit basis.
+   In practical terms, character-by-character comparisons should be done
+   codepoint-by-codepoint after conversion to a common character
+   encoding.
+
+   False negatives are caused by the production and use of URI aliases.
+   Unnecessary aliases can be reduced, regardless of the comparison
+   method, by consistently providing URI references in an already-
+   normalized form (i.e., a form identical to what would be produced
+   after normalization is applied, as described below).
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 39]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   Protocols and data formats often limit some URI comparisons to simple
+   string comparison, based on the theory that people and
+   implementations will, in their own best interest, be consistent in
+   providing URI references, or at least consistent enough to negate any
+   efficiency that might be obtained from further normalization.
+
+6.2.2.  Syntax-Based Normalization
+
+   Implementations may use logic based on the definitions provided by
+   this specification to reduce the probability of false negatives.
+   This processing is moderately higher in cost than character-for-
+   character string comparison.  For example, an application using this
+   approach could reasonably consider the following two URIs equivalent:
+
+      example://a/b/c/%7Bfoo%7D
+      eXAMPLE://a/./b/../b/%63/%7bfoo%7d
+
+   Web user agents, such as browsers, typically apply this type of URI
+   normalization when determining whether a cached response is
+   available.  Syntax-based normalization includes such techniques as
+   case normalization, percent-encoding normalization, and removal of
+   dot-segments.
+
+6.2.2.1.  Case Normalization
+
+   For all URIs, the hexadecimal digits within a percent-encoding
+   triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
+   should be normalized to use uppercase letters for the digits A-F.
+
+   When a URI uses components of the generic syntax, the component
+   syntax equivalence rules always apply; namely, that the scheme and
+   host are case-insensitive and therefore should be normalized to
+   lowercase.  For example, the URI <HTTP://www.EXAMPLE.com/> is
+   equivalent to <http://www.example.com/>.  The other generic syntax
+   components are assumed to be case-sensitive unless specifically
+   defined otherwise by the scheme (see Section 6.2.3).
+
+6.2.2.2.  Percent-Encoding Normalization
+
+   The percent-encoding mechanism (Section 2.1) is a frequent source of
+   variance among otherwise identical URIs.  In addition to the case
+   normalization issue noted above, some URI producers percent-encode
+   octets that do not require percent-encoding, resulting in URIs that
+   are equivalent to their non-encoded counterparts.  These URIs should
+   be normalized by decoding any percent-encoded octet that corresponds
+   to an unreserved character, as described in Section 2.3.
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 40]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+6.2.2.3.  Path Segment Normalization
+
+   The complete path segments "." and ".." are intended only for use
+   within relative references (Section 4.1) and are removed as part of
+   the reference resolution process (Section 5.2).  However, some
+   deployed implementations incorrectly assume that reference resolution
+   is not necessary when the reference is already a URI and thus fail to
+   remove dot-segments when they occur in non-relative paths.  URI
+   normalizers should remove dot-segments by applying the
+   remove_dot_segments algorithm to the path, as described in
+   Section 5.2.4.
+
+6.2.3.  Scheme-Based Normalization
+
+   The syntax and semantics of URIs vary from scheme to scheme, as
+   described by the defining specification for each scheme.
+   Implementations may use scheme-specific rules, at further processing
+   cost, to reduce the probability of false negatives.  For example,
+   because the "http" scheme makes use of an authority component, has a
+   default port of "80", and defines an empty path to be equivalent to
+   "/", the following four URIs are equivalent:
+
+      http://example.com
+      http://example.com/
+      http://example.com:/
+      http://example.com:80/
+
+   In general, a URI that uses the generic syntax for authority with an
+   empty path should be normalized to a path of "/".  Likewise, an
+   explicit ":port", for which the port is empty or the default for the
+   scheme, is equivalent to one where the port and its ":" delimiter are
+   elided and thus should be removed by scheme-based normalization.  For
+   example, the second URI above is the normal form for the "http"
+   scheme.
+
+   Another case where normalization varies by scheme is in the handling
+   of an empty authority component or empty host subcomponent.  For many
+   scheme specifications, an empty authority or host is considered an
+   error; for others, it is considered equivalent to "localhost" or the
+   end-user's host.  When a scheme defines a default for authority and a
+   URI reference to that default is desired, the reference should be
+   normalized to an empty authority for the sake of uniformity, brevity,
+   and internationalization.  If, however, either the userinfo or port
+   subcomponents are non-empty, then the host should be given explicitly
+   even if it matches the default.
+
+   Normalization should not remove delimiters when their associated
+   component is empty unless licensed to do so by the scheme
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 41]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   specification.  For example, the URI "http://example.com/?" cannot be
+   assumed to be equivalent to any of the examples above.  Likewise, the
+   presence or absence of delimiters within a userinfo subcomponent is
+   usually significant to its interpretation.  The fragment component is
+   not subject to any scheme-based normalization; thus, two URIs that
+   differ only by the suffix "#" are considered different regardless of
+   the scheme.
+
+   Some schemes define additional subcomponents that consist of case-
+   insensitive data, giving an implicit license to normalizers to
+   convert this data to a common case (e.g., all lowercase).  For
+   example, URI schemes that define a subcomponent of path to contain an
+   Internet hostname, such as the "mailto" URI scheme, cause that
+   subcomponent to be case-insensitive and thus subject to case
+   normalization (e.g., "mailto:Joe@Example.COM" is equivalent to
+   "mailto:Joe@example.com", even though the generic syntax considers
+   the path component to be case-sensitive).
+
+   Other scheme-specific normalizations are possible.
+
+6.2.4.  Protocol-Based Normalization
+
+   Substantial effort to reduce the incidence of false negatives is
+   often cost-effective for web spiders.  Therefore, they implement even
+   more aggressive techniques in URI comparison.  For example, if they
+   observe that a URI such as
+
+      http://example.com/data
+
+   redirects to a URI differing only in the trailing slash
+
+      http://example.com/data/
+
+   they will likely regard the two as equivalent in the future.  This
+   kind of technique is only appropriate when equivalence is clearly
+   indicated by both the result of accessing the resources and the
+   common conventions of their scheme's dereference algorithm (in this
+   case, use of redirection by HTTP origin servers to avoid problems
+   with relative references).
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 42]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+7.  Security Considerations
+
+   A URI does not in itself pose a security threat.  However, as URIs
+   are often used to provide a compact set of instructions for access to
+   network resources, care must be taken to properly interpret the data
+   within a URI, to prevent that data from causing unintended access,
+   and to avoid including data that should not be revealed in plain
+   text.
+
+7.1.  Reliability and Consistency
+
+   There is no guarantee that once a URI has been used to retrieve
+   information, the same information will be retrievable by that URI in
+   the future.  Nor is there any guarantee that the information
+   retrievable via that URI in the future will be observably similar to
+   that retrieved in the past.  The URI syntax does not constrain how a
+   given scheme or authority apportions its namespace or maintains it
+   over time.  Such guarantees can only be obtained from the person(s)
+   controlling that namespace and the resource in question.  A specific
+   URI scheme may define additional semantics, such as name persistence,
+   if those semantics are required of all naming authorities for that
+   scheme.
+
+7.2.  Malicious Construction
+
+   It is sometimes possible to construct a URI so that an attempt to
+   perform a seemingly harmless, idempotent operation, such as the
+   retrieval of a representation, will in fact cause a possibly damaging
+   remote operation.  The unsafe URI is typically constructed by
+   specifying a port number other than that reserved for the network
+   protocol in question.  The client unwittingly contacts a site running
+   a different protocol service, and data within the URI contains
+   instructions that, when interpreted according to this other protocol,
+   cause an unexpected operation.  A frequent example of such abuse has
+   been the use of a protocol-based scheme with a port component of
+   "25", thereby fooling user agent software into sending an unintended
+   or impersonating message via an SMTP server.
+
+   Applications should prevent dereference of a URI that specifies a TCP
+   port number within the "well-known port" range (0 - 1023) unless the
+   protocol being used to dereference that URI is compatible with the
+   protocol expected on that well-known port.  Although IANA maintains a
+   registry of well-known ports, applications should make such
+   restrictions user-configurable to avoid preventing the deployment of
+   new services.
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 43]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   When a URI contains percent-encoded octets that match the delimiters
+   for a given resolution or dereference protocol (for example, CR and
+   LF characters for the TELNET protocol), these percent-encodings must
+   not be decoded before transmission across that protocol.  Transfer of
+   the percent-encoding, which might violate the protocol, is less
+   harmful than allowing decoded octets to be interpreted as additional
+   operations or parameters, perhaps triggering an unexpected and
+   possibly harmful remote operation.
+
+7.3.  Back-End Transcoding
+
+   When a URI is dereferenced, the data within it is often parsed by
+   both the user agent and one or more servers.  In HTTP, for example, a
+   typical user agent will parse a URI into its five major components,
+   access the authority's server, and send it the data within the
+   authority, path, and query components.  A typical server will take
+   that information, parse the path into segments and the query into
+   key/value pairs, and then invoke implementation-specific handlers to
+   respond to the request.  As a result, a common security concern for
+   server implementations that handle a URI, either as a whole or split
+   into separate components, is proper interpretation of the octet data
+   represented by the characters and percent-encodings within that URI.
+
+   Percent-encoded octets must be decoded at some point during the
+   dereference process.  Applications must split the URI into its
+   components and subcomponents prior to decoding the octets, as
+   otherwise the decoded octets might be mistaken for delimiters.
+   Security checks of the data within a URI should be applied after
+   decoding the octets.  Note, however, that the "%00" percent-encoding
+   (NUL) may require special handling and should be rejected if the
+   application is not expecting to receive raw data within a component.
+
+   Special care should be taken when the URI path interpretation process
+   involves the use of a back-end file system or related system
+   functions.  File systems typically assign an operational meaning to
+   special characters, such as the "/", "\", ":", "[", and "]"
+   characters, and to special device names like ".", "..", "...", "aux",
+   "lpt", etc.  In some cases, merely testing for the existence of such
+   a name will cause the operating system to pause or invoke unrelated
+   system calls, leading to significant security concerns regarding
+   denial of service and unintended data transfer.  It would be
+   impossible for this specification to list all such significant
+   characters and device names.  Implementers should research the
+   reserved names and characters for the types of storage device that
+   may be attached to their applications and restrict the use of data
+   obtained from URI components accordingly.
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 44]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+7.4.  Rare IP Address Formats
+
+   Although the URI syntax for IPv4address only allows the common
+   dotted-decimal form of IPv4 address literal, many implementations
+   that process URIs make use of platform-dependent system routines,
+   such as gethostbyname() and inet_aton(), to translate the string
+   literal to an actual IP address.  Unfortunately, such system routines
+   often allow and process a much larger set of formats than those
+   described in Section 3.2.2.
+
+   For example, many implementations allow dotted forms of three
+   numbers, wherein the last part is interpreted as a 16-bit quantity
+   and placed in the right-most two bytes of the network address (e.g.,
+   a Class B network).  Likewise, a dotted form of two numbers means
+   that the last part is interpreted as a 24-bit quantity and placed in
+   the right-most three bytes of the network address (Class A), and a
+   single number (without dots) is interpreted as a 32-bit quantity and
+   stored directly in the network address.  Adding further to the
+   confusion, some implementations allow each dotted part to be
+   interpreted as decimal, octal, or hexadecimal, as specified in the C
+   language (i.e., a leading 0x or 0X implies hexadecimal; a leading 0
+   implies octal; otherwise, the number is interpreted as decimal).
+
+   These additional IP address formats are not allowed in the URI syntax
+   due to differences between platform implementations.  However, they
+   can become a security concern if an application attempts to filter
+   access to resources based on the IP address in string literal format.
+   If this filtering is performed, literals should be converted to
+   numeric form and filtered based on the numeric value, and not on a
+   prefix or suffix of the string form.
+
+7.5.  Sensitive Information
+
+   URI producers should not provide a URI that contains a username or
+   password that is intended to be secret.  URIs are frequently
+   displayed by browsers, stored in clear text bookmarks, and logged by
+   user agent history and intermediary applications (proxies).  A
+   password appearing within the userinfo component is deprecated and
+   should be considered an error (or simply ignored) except in those
+   rare cases where the 'password' parameter is intended to be public.
+
+7.6.  Semantic Attacks
+
+   Because the userinfo subcomponent is rarely used and appears before
+   the host in the authority component, it can be used to construct a
+   URI intended to mislead a human user by appearing to identify one
+   (trusted) naming authority while actually identifying a different
+   authority hidden behind the noise.  For example
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 45]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+      ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm
+
+   might lead a human user to assume that the host is 'cnn.example.com',
+   whereas it is actually '10.0.0.1'.  Note that a misleading userinfo
+   subcomponent could be much longer than the example above.
+
+   A misleading URI, such as that above, is an attack on the user's
+   preconceived notions about the meaning of a URI rather than an attack
+   on the software itself.  User agents may be able to reduce the impact
+   of such attacks by distinguishing the various components of the URI
+   when they are rendered, such as by using a different color or tone to
+   render userinfo if any is present, though there is no panacea.  More
+   information on URI-based semantic attacks can be found in [Siedzik].
+
+8.  IANA Considerations
+
+   URI scheme names, as defined by <scheme> in Section 3.1, form a
+   registered namespace that is managed by IANA according to the
+   procedures defined in [BCP35].  No IANA actions are required by this
+   document.
+
+9.  Acknowledgements
+
+   This specification is derived from RFC 2396 [RFC2396], RFC 1808
+   [RFC1808], and RFC 1738 [RFC1738]; the acknowledgements in those
+   documents still apply.  It also incorporates the update (with
+   corrections) for IPv6 literals in the host syntax, as defined by
+   Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in
+   [RFC2732].  In addition, contributions by Gisle Aas, Reese Anschultz,
+   Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll,
+   Dan Connolly, Adam M. Costello, John Cowan, Jason Diamond, Martin
+   Duerst, Stefan Eissing, Clive D.W. Feather, Al Gilman, Tony Hammond,
+   Elliotte Harold, Pat Hayes, Henry Holtzman, Ian B. Jacobs, Michael
+   Kay, John C. Klensin, Graham Klyne, Dan Kohn, Bruce Lilly, Andrew
+   Main, Dave McAlpin, Ira McDonald, Michael Mealling, Ray Merkert,
+   Stephen Pollei, Julian Reschke, Tomas Rokicki, Miles Sabin, Kai
+   Schaetzl, Mark Thomson, Ronald Tschalaer, Norm Walsh, Marc Warne,
+   Stuart Williams, and Henry Zongaro are gratefully acknowledged.
+
+10.  References
+
+10.1.  Normative References
+
+   [ASCII]    American National Standards Institute, "Coded Character
+              Set -- 7-bit American Standard Code for Information
+              Interchange", ANSI X3.4, 1986.
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 46]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   [RFC2234]  Crocker, D. and P. Overell, "Augmented BNF for Syntax
+              Specifications: ABNF", RFC 2234, November 1997.
+
+   [STD63]    Yergeau, F., "UTF-8, a transformation format of
+              ISO 10646", STD 63, RFC 3629, November 2003.
+
+   [UCS]      International Organization for Standardization,
+              "Information Technology - Universal Multiple-Octet Coded
+              Character Set (UCS)", ISO/IEC 10646:2003, December 2003.
+
+10.2.  Informative References
+
+   [BCP19]    Freed, N. and J. Postel, "IANA Charset Registration
+              Procedures", BCP 19, RFC 2978, October 2000.
+
+   [BCP35]    Petke, R. and I. King, "Registration Procedures for URL
+              Scheme Names", BCP 35, RFC 2717, November 1999.
+
+   [RFC0952]  Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
+              host table specification", RFC 952, October 1985.
+
+   [RFC1034]  Mockapetris, P., "Domain names - concepts and facilities",
+              STD 13, RFC 1034, November 1987.
+
+   [RFC1123]  Braden, R., "Requirements for Internet Hosts - Application
+              and Support", STD 3, RFC 1123, October 1989.
+
+   [RFC1535]  Gavron, E., "A Security Problem and Proposed Correction
+              With Widely Deployed DNS Software", RFC 1535,
+              October 1993.
+
+   [RFC1630]  Berners-Lee, T., "Universal Resource Identifiers in WWW: A
+              Unifying Syntax for the Expression of Names and Addresses
+              of Objects on the Network as used in the World-Wide Web",
+              RFC 1630, June 1994.
+
+   [RFC1736]  Kunze, J., "Functional Recommendations for Internet
+              Resource Locators", RFC 1736, February 1995.
+
+   [RFC1737]  Sollins, K. and L. Masinter, "Functional Requirements for
+              Uniform Resource Names", RFC 1737, December 1994.
+
+   [RFC1738]  Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
+              Resource Locators (URL)", RFC 1738, December 1994.
+
+   [RFC1808]  Fielding, R., "Relative Uniform Resource Locators",
+              RFC 1808, June 1995.
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 47]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   [RFC2046]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
+              Extensions (MIME) Part Two: Media Types", RFC 2046,
+              November 1996.
+
+   [RFC2141]  Moats, R., "URN Syntax", RFC 2141, May 1997.
+
+   [RFC2396]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
+              Resource Identifiers (URI): Generic Syntax", RFC 2396,
+              August 1998.
+
+   [RFC2518]  Goland, Y., Whitehead, E., Faizi, A., Carter, S., and D.
+              Jensen, "HTTP Extensions for Distributed Authoring --
+              WEBDAV", RFC 2518, February 1999.
+
+   [RFC2557]  Palme, J., Hopmann, A., and N. Shelness, "MIME
+              Encapsulation of Aggregate Documents, such as HTML
+              (MHTML)", RFC 2557, March 1999.
+
+   [RFC2718]  Masinter, L., Alvestrand, H., Zigmond, D., and R. Petke,
+              "Guidelines for new URL Schemes", RFC 2718, November 1999.
+
+   [RFC2732]  Hinden, R., Carpenter, B., and L. Masinter, "Format for
+              Literal IPv6 Addresses in URL's", RFC 2732, December 1999.
+
+   [RFC3305]  Mealling, M. and R. Denenberg, "Report from the Joint
+              W3C/IETF URI Planning Interest Group: Uniform Resource
+              Identifiers (URIs), URLs, and Uniform Resource Names
+              (URNs): Clarifications and Recommendations", RFC 3305,
+              August 2002.
+
+   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
+              "Internationalizing Domain Names in Applications (IDNA)",
+              RFC 3490, March 2003.
+
+   [RFC3513]  Hinden, R. and S. Deering, "Internet Protocol Version 6
+              (IPv6) Addressing Architecture", RFC 3513, April 2003.
+
+   [Siedzik]  Siedzik, R., "Semantic Attacks: What's in a URL?",
+              April 2001, <http://www.giac.org/practical/gsec/
+              Richard_Siedzik_GSEC.pdf>.
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 48]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+Appendix A.  Collected ABNF for URI
+
+   URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
+
+   hier-part     = "//" authority path-abempty
+                 / path-absolute
+                 / path-rootless
+                 / path-empty
+
+   URI-reference = URI / relative-ref
+
+   absolute-URI  = scheme ":" hier-part [ "?" query ]
+
+   relative-ref  = relative-part [ "?" query ] [ "#" fragment ]
+
+   relative-part = "//" authority path-abempty
+                 / path-absolute
+                 / path-noscheme
+                 / path-empty
+
+   scheme        = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
+
+   authority     = [ userinfo "@" ] host [ ":" port ]
+   userinfo      = *( unreserved / pct-encoded / sub-delims / ":" )
+   host          = IP-literal / IPv4address / reg-name
+   port          = *DIGIT
+
+   IP-literal    = "[" ( IPv6address / IPvFuture  ) "]"
+
+   IPvFuture     = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
+
+   IPv6address   =                            6( h16 ":" ) ls32
+                 /                       "::" 5( h16 ":" ) ls32
+                 / [               h16 ] "::" 4( h16 ":" ) ls32
+                 / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
+                 / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
+                 / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
+                 / [ *4( h16 ":" ) h16 ] "::"              ls32
+                 / [ *5( h16 ":" ) h16 ] "::"              h16
+                 / [ *6( h16 ":" ) h16 ] "::"
+
+   h16           = 1*4HEXDIG
+   ls32          = ( h16 ":" h16 ) / IPv4address
+   IPv4address   = dec-octet "." dec-octet "." dec-octet "." dec-octet
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 49]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   dec-octet     = DIGIT                 ; 0-9
+                 / %x31-39 DIGIT         ; 10-99
+                 / "1" 2DIGIT            ; 100-199
+                 / "2" %x30-34 DIGIT     ; 200-249
+                 / "25" %x30-35          ; 250-255
+
+   reg-name      = *( unreserved / pct-encoded / sub-delims )
+
+   path          = path-abempty    ; begins with "/" or is empty
+                 / path-absolute   ; begins with "/" but not "//"
+                 / path-noscheme   ; begins with a non-colon segment
+                 / path-rootless   ; begins with a segment
+                 / path-empty      ; zero characters
+
+   path-abempty  = *( "/" segment )
+   path-absolute = "/" [ segment-nz *( "/" segment ) ]
+   path-noscheme = segment-nz-nc *( "/" segment )
+   path-rootless = segment-nz *( "/" segment )
+   path-empty    = 0<pchar>
+
+   segment       = *pchar
+   segment-nz    = 1*pchar
+   segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
+                 ; non-zero-length segment without any colon ":"
+
+   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
+
+   query         = *( pchar / "/" / "?" )
+
+   fragment      = *( pchar / "/" / "?" )
+
+   pct-encoded   = "%" HEXDIG HEXDIG
+
+   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
+   reserved      = gen-delims / sub-delims
+   gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
+   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
+                 / "*" / "+" / "," / ";" / "="
+
+Appendix B.  Parsing a URI Reference with a Regular Expression
+
+   As the "first-match-wins" algorithm is identical to the "greedy"
+   disambiguation method used by POSIX regular expressions, it is
+   natural and commonplace to use a regular expression for parsing the
+   potential five components of a URI reference.
+
+   The following line is the regular expression for breaking-down a
+   well-formed URI reference into its components.
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 50]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+      ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
+       12            3  4          5       6  7        8 9
+
+   The numbers in the second line above are only to assist readability;
+   they indicate the reference points for each subexpression (i.e., each
+   paired parenthesis).  We refer to the value matched for subexpression
+   <n> as $<n>.  For example, matching the above expression to
+
+      http://www.ics.uci.edu/pub/ietf/uri/#Related
+
+   results in the following subexpression matches:
+
+      $1 = http:
+      $2 = http
+      $3 = //www.ics.uci.edu
+      $4 = www.ics.uci.edu
+      $5 = /pub/ietf/uri/
+      $6 = <undefined>
+      $7 = <undefined>
+      $8 = #Related
+      $9 = Related
+
+   where <undefined> indicates that the component is not present, as is
+   the case for the query component in the above example.  Therefore, we
+   can determine the value of the five components as
+
+      scheme    = $2
+      authority = $4
+      path      = $5
+      query     = $7
+      fragment  = $9
+
+   Going in the opposite direction, we can recreate a URI reference from
+   its components by using the algorithm of Section 5.3.
+
+Appendix C.  Delimiting a URI in Context
+
+   URIs are often transmitted through formats that do not provide a
+   clear context for their interpretation.  For example, there are many
+   occasions when a URI is included in plain text; examples include text
+   sent in email, USENET news, and on printed paper.  In such cases, it
+   is important to be able to delimit the URI from the rest of the text,
+   and in particular from punctuation marks that might be mistaken for
+   part of the URI.
+
+   In practice, URIs are delimited in a variety of ways, but usually
+   within double-quotes "http://example.com/", angle brackets
+   <http://example.com/>, or just by using whitespace:
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 51]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+      http://example.com/
+
+   These wrappers do not form part of the URI.
+
+   In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may
+   have to be added to break a long URI across lines.  The whitespace
+   should be ignored when the URI is extracted.
+
+   No whitespace should be introduced after a hyphen ("-") character.
+   Because some typesetters and printers may (erroneously) introduce a
+   hyphen at the end of line when breaking it, the interpreter of a URI
+   containing a line break immediately after a hyphen should ignore all
+   whitespace around the line break and should be aware that the hyphen
+   may or may not actually be part of the URI.
+
+   Using <> angle brackets around each URI is especially recommended as
+   a delimiting style for a reference that contains embedded whitespace.
+
+   The prefix "URL:" (with or without a trailing space) was formerly
+   recommended as a way to help distinguish a URI from other bracketed
+   designators, though it is not commonly used in practice and is no
+   longer recommended.
+
+   For robustness, software that accepts user-typed URI should attempt
+   to recognize and strip both delimiters and embedded whitespace.
+
+   For example, the text
+
+      Yes, Jim, I found it under "http://www.w3.org/Addressing/",
+      but you can probably pick it up from <ftp://foo.example.
+      com/rfc/>.  Note the warning in <http://www.ics.uci.edu/pub/
+      ietf/uri/historical.html#WARNING>.
+
+   contains the URI references
+
+      http://www.w3.org/Addressing/
+      ftp://foo.example.com/rfc/
+      http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 52]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+Appendix D.  Changes from RFC 2396
+
+D.1.  Additions
+
+   An ABNF rule for URI has been introduced to correspond to one common
+   usage of the term: an absolute URI with optional fragment.
+
+   IPv6 (and later) literals have been added to the list of possible
+   identifiers for the host portion of an authority component, as
+   described by [RFC2732], with the addition of "[" and "]" to the
+   reserved set and a version flag to anticipate future versions of IP
+   literals.  Square brackets are now specified as reserved within the
+   authority component and are not allowed outside their use as
+   delimiters for an IP literal within host.  In order to make this
+   change without changing the technical definition of the path, query,
+   and fragment components, those rules were redefined to directly
+   specify the characters allowed.
+
+   As [RFC2732] defers to [RFC3513] for definition of an IPv6 literal
+   address, which, unfortunately, lacks an ABNF description of
+   IPv6address, we created a new ABNF rule for IPv6address that matches
+   the text representations defined by Section 2.2 of [RFC3513].
+   Likewise, the definition of IPv4address has been improved in order to
+   limit each decimal octet to the range 0-255.
+
+   Section 6, on URI normalization and comparison, has been completely
+   rewritten and extended by using input from Tim Bray and discussion
+   within the W3C Technical Architecture Group.
+
+D.2.  Modifications
+
+   The ad-hoc BNF syntax of RFC 2396 has been replaced with the ABNF of
+   [RFC2234].  This change required all rule names that formerly
+   included underscore characters to be renamed with a dash instead.  In
+   addition, a number of syntax rules have been eliminated or simplified
+   to make the overall grammar more comprehensible.  Specifications that
+   refer to the obsolete grammar rules may be understood by replacing
+   those rules according to the following table:
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 53]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   +----------------+--------------------------------------------------+
+   | obsolete rule  | translation                                      |
+   +----------------+--------------------------------------------------+
+   | absoluteURI    | absolute-URI                                     |
+   | relativeURI    | relative-part [ "?" query ]                      |
+   | hier_part      | ( "//" authority path-abempty /                  |
+   |                | path-absolute ) [ "?" query ]                    |
+   |                |                                                  |
+   | opaque_part    | path-rootless [ "?" query ]                      |
+   | net_path       | "//" authority path-abempty                      |
+   | abs_path       | path-absolute                                    |
+   | rel_path       | path-rootless                                    |
+   | rel_segment    | segment-nz-nc                                    |
+   | reg_name       | reg-name                                         |
+   | server         | authority                                        |
+   | hostport       | host [ ":" port ]                                |
+   | hostname       | reg-name                                         |
+   | path_segments  | path-abempty                                     |
+   | param          | *<pchar excluding ";">                           |
+   |                |                                                  |
+   | uric           | unreserved / pct-encoded / ";" / "?" / ":"       |
+   |                |  / "@" / "&" / "=" / "+" / "$" / "," / "/"       |
+   |                |                                                  |
+   | uric_no_slash  | unreserved / pct-encoded / ";" / "?" / ":"       |
+   |                |  / "@" / "&" / "=" / "+" / "$" / ","             |
+   |                |                                                  |
+   | mark           | "-" / "_" / "." / "!" / "~" / "*" / "'"          |
+   |                |  / "(" / ")"                                     |
+   |                |                                                  |
+   | escaped        | pct-encoded                                      |
+   | hex            | HEXDIG                                           |
+   | alphanum       | ALPHA / DIGIT                                    |
+   +----------------+--------------------------------------------------+
+
+   Use of the above obsolete rules for the definition of scheme-specific
+   syntax is deprecated.
+
+   Section 2, on characters, has been rewritten to explain what
+   characters are reserved, when they are reserved, and why they are
+   reserved, even when they are not used as delimiters by the generic
+   syntax.  The mark characters that are typically unsafe to decode,
+   including the exclamation mark ("!"), asterisk ("*"), single-quote
+   ("'"), and open and close parentheses ("(" and ")"), have been moved
+   to the reserved set in order to clarify the distinction between
+   reserved and unreserved and, hopefully, to answer the most common
+   question of scheme designers.  Likewise, the section on
+   percent-encoded characters has been rewritten, and URI normalizers
+   are now given license to decode any percent-encoded octets
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 54]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   corresponding to unreserved characters.  In general, the terms
+   "escaped" and "unescaped" have been replaced with "percent-encoded"
+   and "decoded", respectively, to reduce confusion with other forms of
+   escape mechanisms.
+
+   The ABNF for URI and URI-reference has been redesigned to make them
+   more friendly to LALR parsers and to reduce complexity.  As a result,
+   the layout form of syntax description has been removed, along with
+   the uric, uric_no_slash, opaque_part, net_path, abs_path, rel_path,
+   path_segments, rel_segment, and mark rules.  All references to
+   "opaque" URIs have been replaced with a better description of how the
+   path component may be opaque to hierarchy.  The relativeURI rule has
+   been replaced with relative-ref to avoid unnecessary confusion over
+   whether they are a subset of URI.  The ambiguity regarding the
+   parsing of URI-reference as a URI or a relative-ref with a colon in
+   the first segment has been eliminated through the use of five
+   separate path matching rules.
+
+   The fragment identifier has been moved back into the section on
+   generic syntax components and within the URI and relative-ref rules,
+   though it remains excluded from absolute-URI.  The number sign ("#")
+   character has been moved back to the reserved set as a result of
+   reintegrating the fragment syntax.
+
+   The ABNF has been corrected to allow the path component to be empty.
+   This also allows an absolute-URI to consist of nothing after the
+   "scheme:", as is present in practice with the "dav:" namespace
+   [RFC2518] and with the "about:" scheme used internally by many WWW
+   browser implementations.  The ambiguity regarding the boundary
+   between authority and path has been eliminated through the use of
+   five separate path matching rules.
+
+   Registry-based naming authorities that use the generic syntax are now
+   defined within the host rule.  This change allows current
+   implementations, where whatever name provided is simply fed to the
+   local name resolution mechanism, to be consistent with the
+   specification.  It also removes the need to re-specify DNS name
+   formats here.  Furthermore, it allows the host component to contain
+   percent-encoded octets, which is necessary to enable
+   internationalized domain names to be provided in URIs, processed in
+   their native character encodings at the application layers above URI
+   processing, and passed to an IDNA library as a registered name in the
+   UTF-8 character encoding.  The server, hostport, hostname,
+   domainlabel, toplabel, and alphanum rules have been removed.
+
+   The resolving relative references algorithm of [RFC2396] has been
+   rewritten with pseudocode for this revision to improve clarity and
+   fix the following issues:
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 55]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   o  [RFC2396] section 5.2, step 6a, failed to account for a base URI
+      with no path.
+
+   o  Restored the behavior of [RFC1808] where, if the reference
+      contains an empty path and a defined query component, the target
+      URI inherits the base URI's path component.
+
+   o  The determination of whether a URI reference is a same-document
+      reference has been decoupled from the URI parser, simplifying the
+      URI processing interface within applications in a way consistent
+      with the internal architecture of deployed URI processing
+      implementations.  The determination is now based on comparison to
+      the base URI after transforming a reference to absolute form,
+      rather than on the format of the reference itself.  This change
+      may result in more references being considered "same-document"
+      under this specification than there would be under the rules given
+      in RFC 2396, especially when normalization is used to reduce
+      aliases.  However, it does not change the status of existing
+      same-document references.
+
+   o  Separated the path merge routine into two routines: merge, for
+      describing combination of the base URI path with a relative-path
+      reference, and remove_dot_segments, for describing how to remove
+      the special "." and ".." segments from a composed path.  The
+      remove_dot_segments algorithm is now applied to all URI reference
+      paths in order to match common implementations and to improve the
+      normalization of URIs in practice.  This change only impacts the
+      parsing of abnormal references and same-scheme references wherein
+      the base URI has a non-hierarchical path.
+
+Index
+
+   A
+      ABNF  11
+      absolute  27
+      absolute-path  26
+      absolute-URI  27
+      access  9
+      authority  17, 18
+
+   B
+      base URI  28
+
+   C
+      character encoding  4
+      character  4
+      characters  8, 11
+      coded character set  4
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 56]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+   D
+      dec-octet  20
+      dereference  9
+      dot-segments  23
+
+   F
+      fragment  16, 24
+
+   G
+      gen-delims  13
+      generic syntax  6
+
+   H
+      h16  20
+      hier-part  16
+      hierarchical  10
+      host  18
+
+   I
+      identifier  5
+      IP-literal  19
+      IPv4  20
+      IPv4address  19, 20
+      IPv6  19
+      IPv6address  19, 20
+      IPvFuture  19
+
+   L
+      locator  7
+      ls32  20
+
+   M
+      merge  32
+
+   N
+      name  7
+      network-path  26
+
+   P
+      path  16, 22, 26
+         path-abempty  22
+         path-absolute  22
+         path-empty  22
+         path-noscheme  22
+         path-rootless  22
+      path-abempty  16, 22, 26
+      path-absolute  16, 22, 26
+      path-empty  16, 22, 26
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 57]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+      path-rootless  16, 22
+      pchar  23
+      pct-encoded  12
+      percent-encoding  12
+      port  22
+
+   Q
+      query  16, 23
+
+   R
+      reg-name  21
+      registered name  20
+      relative  10, 28
+      relative-path  26
+      relative-ref  26
+      remove_dot_segments  33
+      representation  9
+      reserved  12
+      resolution  9, 28
+      resource  5
+      retrieval  9
+
+   S
+      same-document  27
+      sameness  9
+      scheme  16, 17
+      segment  22, 23
+         segment-nz  23
+         segment-nz-nc  23
+      sub-delims  13
+      suffix  27
+
+   T
+      transcription  8
+
+   U
+      uniform  4
+      unreserved  13
+      URI grammar
+         absolute-URI  27
+         ALPHA  11
+         authority  18
+         CR  11
+         dec-octet  20
+         DIGIT  11
+         DQUOTE  11
+         fragment  24
+         gen-delims  13
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 58]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+         h16  20
+         HEXDIG  11
+         hier-part  16
+         host  19
+         IP-literal  19
+         IPv4address  20
+         IPv6address  20
+         IPvFuture  19
+         LF  11
+         ls32  20
+         OCTET  11
+         path  22
+         path-abempty  22
+         path-absolute  22
+         path-empty  22
+         path-noscheme  22
+         path-rootless  22
+         pchar  23
+         pct-encoded  12
+         port  22
+         query  24
+         reg-name  21
+         relative-ref  26
+         reserved  13
+         scheme  17
+         segment  23
+         segment-nz  23
+         segment-nz-nc  23
+         SP  11
+         sub-delims  13
+         unreserved  13
+         URI  16
+         URI-reference  25
+         userinfo  18
+      URI  16
+      URI-reference  25
+      URL  7
+      URN  7
+      userinfo  18
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 59]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+Authors' Addresses
+
+   Tim Berners-Lee
+   World Wide Web Consortium
+   Massachusetts Institute of Technology
+   77 Massachusetts Avenue
+   Cambridge, MA  02139
+   USA
+
+   Phone: +1-617-253-5702
+   Fax:   +1-617-258-5999
+   EMail: timbl@w3.org
+   URI:   http://www.w3.org/People/Berners-Lee/
+
+
+   Roy T. Fielding
+   Day Software
+   5251 California Ave., Suite 110
+   Irvine, CA  92617
+   USA
+
+   Phone: +1-949-679-2960
+   Fax:   +1-949-679-2972
+   EMail: fielding@gbiv.com
+   URI:   http://roy.gbiv.com/
+
+
+   Larry Masinter
+   Adobe Systems Incorporated
+   345 Park Ave
+   San Jose, CA  95110
+   USA
+
+   Phone: +1-408-536-3024
+   EMail: LMM@acm.org
+   URI:   http://larry.masinter.net/
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 60]
+
+RFC 3986                   URI Generic Syntax               January 2005
+
+
+Full Copyright Statement
+
+   Copyright (C) The Internet Society (2005).
+
+   This document is subject to the rights, licenses and restrictions
+   contained in BCP 78, and except as set forth therein, the authors
+   retain all their rights.
+
+   This document and the information contained herein are provided on an
+   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
+   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
+   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
+   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
+   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
+   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+Intellectual Property
+
+   The IETF takes no position regarding the validity or scope of any
+   Intellectual Property Rights or other rights that might be claimed to
+   pertain to the implementation or use of the technology described in
+   this document or the extent to which any license under such rights
+   might or might not be available; nor does it represent that it has
+   made any independent effort to identify any such rights.  Information
+   on the IETF's procedures with respect to rights in IETF Documents can
+   be found in BCP 78 and BCP 79.
+
+   Copies of IPR disclosures made to the IETF Secretariat and any
+   assurances of licenses to be made available, or the result of an
+   attempt made to obtain a general license or permission for the use of
+   such proprietary rights by implementers or users of this
+   specification can be obtained from the IETF on-line IPR repository at
+   http://www.ietf.org/ipr.
+
+   The IETF invites any interested party to bring to its attention any
+   copyrights, patents or patent applications, or other proprietary
+   rights that may cover technology that may be required to implement
+   this standard.  Please address the information to the IETF at ietf-
+   ipr@ietf.org.
+
+
+Acknowledgement
+
+   Funding for the RFC Editor function is currently provided by the
+   Internet Society.
+
+
+
+
+
+
+Berners-Lee, et al.         Standards Track                    [Page 61]
+
diff --git a/trunk/txt/rfc3987.txt b/trunk/txt/rfc3987.txt
new file mode 100644
index 00000000..f0b1513b
--- /dev/null
+++ b/trunk/txt/rfc3987.txt
@@ -0,0 +1,2579 @@
+
+
+
+
+
+
+Network Working Group                                          M. Duerst
+Request for Comments: 3987                                           W3C
+Category: Standards Track                                    M. Suignard
+                                                   Microsoft Corporation
+                                                            January 2005
+
+
+             Internationalized Resource Identifiers (IRIs)
+
+Status of This Memo
+
+   This document specifies an Internet standards track protocol for the
+   Internet community, and requests discussion and suggestions for
+   improvements.  Please refer to the current edition of the "Internet
+   Official Protocol Standards" (STD 1) for the standardization state
+   and status of this protocol.  Distribution of this memo is unlimited.
+
+Copyright Notice
+
+   Copyright (C) The Internet Society (2005).
+
+Abstract
+
+   This document defines a new protocol element, the Internationalized
+   Resource Identifier (IRI), as a complement to the Uniform Resource
+   Identifier (URI).  An IRI is a sequence of characters from the
+   Universal Character Set (Unicode/ISO 10646).  A mapping from IRIs to
+   URIs is defined, which means that IRIs can be used instead of URIs,
+   where appropriate, to identify resources.
+
+   The approach of defining a new protocol element was chosen instead of
+   extending or changing the definition of URIs.  This was done in order
+   to allow a clear distinction and to avoid incompatibilities with
+   existing software.  Guidelines are provided for the use and
+   deployment of IRIs in various protocols, formats, and software
+   components that currently deal with URIs.
+
+Table of Contents
+
+   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
+       1.1.  Overview and Motivation  . . . . . . . . . . . . . . . .  3
+       1.2.  Applicability  . . . . . . . . . . . . . . . . . . . . .  3
+       1.3.  Definitions  . . . . . . . . . . . . . . . . . . . . . .  4
+       1.4.  Notation . . . . . . . . . . . . . . . . . . . . . . . .  5
+   2.  IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . .  6
+       2.1.  Summary of IRI Syntax  . . . . . . . . . . . . . . . . .  6
+       2.2.  ABNF for IRI References and IRIs . . . . . . . . . . . .  7
+
+
+
+
+Duerst & Suignard           Standards Track                     [Page 1]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   3.  Relationship between IRIs and URIs . . . . . . . . . . . . . . 10
+       3.1.  Mapping of IRIs to URIs  . . . . . . . . . . . . . . . . 10
+       3.2.  Converting URIs to IRIs  . . . . . . . . . . . . . . . . 14
+             3.2.1.  Examples . . . . . . . . . . . . . . . . . . . . 15
+   4.  Bidirectional IRIs for Right-to-Left Languages.  . . . . . . . 16
+       4.1.  Logical Storage and Visual Presentation  . . . . . . . . 17
+       4.2.  Bidi IRI Structure . . . . . . . . . . . . . . . . . . . 18
+       4.3.  Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . 19
+       4.4.  Examples . . . . . . . . . . . . . . . . . . . . . . . . 19
+   5.  Normalization and Comparison . . . . . . . . . . . . . . . . . 21
+       5.1.  Equivalence  . . . . . . . . . . . . . . . . . . . . . . 22
+       5.2.  Preparation for Comparison . . . . . . . . . . . . . . . 22
+       5.3.  Comparison Ladder  . . . . . . . . . . . . . . . . . . . 23
+             5.3.1.  Simple String Comparison . . . . . . . . . . . . 23
+             5.3.2.  Syntax-Based Normalization . . . . . . . . . . . 24
+             5.3.3.  Scheme-Based Normalization . . . . . . . . . . . 27
+             5.3.4.  Protocol-Based Normalization . . . . . . . . . . 28
+   6.  Use of IRIs  . . . . . . . . . . . . . . . . . . . . . . . . . 29
+       6.1.  Limitations on UCS Characters Allowed in IRIs  . . . . . 29
+       6.2.  Software Interfaces and Protocols  . . . . . . . . . . . 29
+       6.3.  Format of URIs and IRIs in Documents and Protocols . . . 30
+       6.4.  Use of UTF-8 for Encoding Original Characters .. . . . . 30
+       6.5.  Relative IRI References  . . . . . . . . . . . . . . . . 32
+   7.  URI/IRI Processing Guidelines (informative)  . . . . . . . . . 32
+       7.1.  URI/IRI Software Interfaces  . . . . . . . . . . . . . . 32
+       7.2.  URI/IRI Entry  . . . . . . . . . . . . . . . . . . . . . 33
+       7.3.  URI/IRI Transfer between Applications  . . . . . . . . . 33
+       7.4.  URI/IRI Generation . . . . . . . . . . . . . . . . . . . 34
+       7.5.  URI/IRI Selection  . . . . . . . . . . . . . . . . . . . 34
+       7.6.  Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 35
+       7.7.  Interpretation of URIs and IRIs  . . . . . . . . . . . . 36
+       7.8.  Upgrading Strategy . . . . . . . . . . . . . . . . . . . 36
+   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 37
+   9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 39
+   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 40
+       10.1. Normative References . . . . . . . . . . . . . . . . . . 40
+       10.2. Informative References . . . . . . . . . . . . . . . . . 41
+   A.  Design Alternatives  . . . . . . . . . . . . . . . . . . . . . 44
+       A.1.  New Scheme(s)  . . . . . . . . . . . . . . . . . . . . . 44
+       A.2.  Character Encodings Other Than UTF-8 . . . . . . . . . . 44
+       A.3.  New Encoding Convention  . . . . . . . . . . . . . . . . 44
+       A.4.  Indicating Character Encodings in the URI/IRI  . . . . . 45
+   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 45
+   Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 46
+
+
+
+
+
+
+
+Duerst & Suignard           Standards Track                     [Page 2]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+1.  Introduction
+
+1.1.  Overview and Motivation
+
+   A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
+   sequence of characters chosen from a limited subset of the repertoire
+   of US-ASCII [ASCII] characters.
+
+   The characters in URIs are frequently used for representing words of
+   natural languages.  This usage has many advantages: Such URIs are
+   easier to memorize, easier to interpret, easier to transcribe, easier
+   to create, and easier to guess.  For most languages other than
+   English, however, the natural script uses characters other than A -
+   Z. For many people, handling Latin characters is as difficult as
+   handling the characters of other scripts is for those who use only
+   the Latin alphabet.  Many languages with non-Latin scripts are
+   transcribed with Latin letters.  These transcriptions are now often
+   used in URIs, but they introduce additional ambiguities.
+
+   The infrastructure for the appropriate handling of characters from
+   local scripts is now widely deployed in local versions of operating
+   system and application software.  Software that can handle a wide
+   variety of scripts and languages at the same time is increasingly
+   common.  Also, increasing numbers of protocols and formats can carry
+   a wide range of characters.
+
+   This document defines a new protocol element called Internationalized
+   Resource Identifier (IRI) by extending the syntax of URIs to a much
+   wider repertoire of characters.  It also defines "internationalized"
+   versions corresponding to other constructs from [RFC3986], such as
+   URI references.  The syntax of IRIs is defined in section 2, and the
+   relationship between IRIs and URIs in section 3.
+
+   Using characters outside of A - Z in IRIs brings some difficulties.
+   Section 4 discusses the special case of bidirectional IRIs, section 5
+   various forms of equivalence between IRIs, and section 6 the use of
+   IRIs in different situations.  Section 7 gives additional informative
+   guidelines, and section 8 security considerations.
+
+1.2.  Applicability
+
+   IRIs are designed to be compatible with recommendations for new URI
+   schemes [RFC2718].  The compatibility is provided by specifying a
+   well-defined and deterministic mapping from the IRI character
+   sequence to the functionally equivalent URI character sequence.
+   Practical use of IRIs (or IRI references) in place of URIs (or URI
+   references) depends on the following conditions being met:
+
+
+
+
+Duerst & Suignard           Standards Track                     [Page 3]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   a.  A protocol or format element should be explicitly designated to
+       be able to carry IRIs.  The intent is not to introduce IRIs into
+       contexts that are not defined to accept them.  For example, XML
+       schema [XMLSchema] has an explicit type "anyURI" that includes
+       IRIs and IRI references. Therefore, IRIs and IRI references can
+       be in attributes and elements of type "anyURI".  On the other
+       hand, in the HTTP protocol [RFC2616], the Request URI is defined
+       as a URI, which means that direct use of IRIs is not allowed in
+       HTTP requests.
+
+   b.  The protocol or format carrying the IRIs should have a mechanism
+       to represent the wide range of characters used in IRIs, either
+       natively or by some protocol- or format-specific escaping
+       mechanism (for example, numeric character references in [XML1]).
+
+   c.  The URI corresponding to the IRI in question has to encode
+       original characters into octets using UTF-8.  For new URI
+       schemes, this is recommended in [RFC2718].  It can apply to a
+       whole scheme (e.g., IMAP URLs [RFC2192] and POP URLs [RFC2384],
+       or the URN syntax [RFC2141]).  It can apply to a specific part of
+       a URI, such as the fragment identifier (e.g., [XPointer]).  It
+       can apply to a specific URI or part(s) thereof.  For details,
+       please see section 6.4.
+
+1.3.  Definitions
+
+   The following definitions are used in this document; they follow the
+   terms in [RFC2130], [RFC2277], and [ISO10646].
+
+   character: A member of a set of elements used for the organization,
+      control, or representation of data.  For example, "LATIN CAPITAL
+      LETTER A" names a character.
+
+   octet: An ordered sequence of eight bits considered as a unit.
+
+   character repertoire: A set of characters (in the mathematical
+      sense).
+
+   sequence of characters: A sequence of characters (one after another).
+
+   sequence of octets: A sequence of octets (one after another).
+
+   character encoding: A method of representing a sequence of characters
+      as a sequence of octets (maybe with variants).  Also, a method of
+      (unambiguously) converting a sequence of octets into a sequence of
+      characters.
+
+
+
+
+
+Duerst & Suignard           Standards Track                     [Page 4]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   charset: The name of a parameter or attribute used to identify a
+      character encoding.
+
+   UCS: Universal Character Set. The coded character set defined by
+      ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV4].
+
+   IRI reference: Denotes the common usage of an Internationalized
+      Resource Identifier.  An IRI reference may be absolute or
+      relative.  However, the "IRI" that results from such a reference
+      only includes absolute IRIs; any relative IRI references are
+      resolved to their absolute form.  Note that in [RFC2396] URIs did
+      not include fragment identifiers, but in [RFC3986] fragment
+      identifiers are part of URIs.
+
+   running text: Human text (paragraphs, sentences, phrases) with syntax
+      according to orthographic conventions of a natural language, as
+      opposed to syntax defined for ease of processing by machines
+      (e.g., markup, programming languages).
+
+   protocol element: Any portion of a message that affects processing of
+      that message by the protocol in question.
+
+   presentation element: A presentation form corresponding to a protocol
+      element; for example, using a wider range of characters.
+
+   create (a URI or IRI): With respect to URIs and IRIs, the term is
+      used for the initial creation.  This may be the initial creation
+      of a resource with a certain identifier, or the initial exposition
+      of a resource under a particular identifier.
+
+   generate (a URI or IRI): With respect to URIs and IRIs, the term is
+      used when the IRI is generated by derivation from other
+      information.
+
+1.4.  Notation
+
+   RFCs and Internet Drafts currently do not allow any characters
+   outside the US-ASCII repertoire.  Therefore, this document uses
+   various special notations to denote such characters in examples.
+
+   In text, characters outside US-ASCII are sometimes referenced by
+   using a prefix of 'U+', followed by four to six hexadecimal digits.
+
+   To represent characters outside US-ASCII in examples, this document
+   uses two notations: 'XML Notation' and 'Bidi Notation'.
+
+
+
+
+
+
+Duerst & Suignard           Standards Track                     [Page 5]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   XML Notation uses a leading '&#x', a trailing ';', and the
+   hexadecimal number of the character in the UCS in between.  For
+   example, &#x44F; stands for CYRILLIC CAPITAL LETTER YA.  In this
+   notation, an actual '&' is denoted by '&amp;'.
+
+   Bidi Notation is used for bidirectional examples: Lowercase letters
+   stand for Latin letters or other letters that are written left to
+   right, whereas uppercase letters represent Arabic or Hebrew letters
+   that are written right to left.
+
+   To denote actual octets in examples (as opposed to percent-encoded
+   octets), the two hex digits denoting the octet are enclosed in "<"
+   and ">".  For example, the octet often denoted as 0xc9 is denoted
+   here as <c9>.
+
+   In this document, the key words "MUST", "MUST NOT", "REQUIRED",
+   "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY",
+   and "OPTIONAL" are to be interpreted as described in [RFC2119].
+
+2.  IRI Syntax
+
+   This section defines the syntax of Internationalized Resource
+   Identifiers (IRIs).
+
+   As with URIs, an IRI is defined as a sequence of characters, not as a
+   sequence of octets.  This definition accommodates the fact that IRIs
+   may be written on paper or read over the radio as well as stored or
+   transmitted digitally.  The same IRI may be represented as different
+   sequences of octets in different protocols or documents if these
+   protocols or documents use different character encodings (and/or
+   transfer encodings).  Using the same character encoding as the
+   containing protocol or document ensures that the characters in the
+   IRI can be handled (e.g., searched, converted, displayed) in the same
+   way as the rest of the protocol or document.
+
+2.1.  Summary of IRI Syntax
+
+   IRIs are defined similarly to URIs in [RFC3986], but the class of
+   unreserved characters is extended by adding the characters of the UCS
+   (Universal Character Set, [ISO10646]) beyond U+007F, subject to the
+   limitations given in the syntax rules below and in section 6.1.
+
+   Otherwise, the syntax and use of components and reserved characters
+   is the same as that in [RFC3986].  All the operations defined in
+   [RFC3986], such as the resolution of relative references, can be
+   applied to IRIs by IRI-processing software in exactly the same way as
+   they are for URIs by URI-processing software.
+
+
+
+
+Duerst & Suignard           Standards Track                     [Page 6]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   Characters outside the US-ASCII repertoire are not reserved and
+   therefore MUST NOT be used for syntactical purposes, such as to
+   delimit components in newly defined schemes.  For example, U+00A2,
+   CENT SIGN, is not allowed as a delimiter in IRIs, because it is in
+   the 'iunreserved' category. This is similar to the fact that it is
+   not possible to use '-' as a delimiter in URIs, because it is in the
+   'unreserved' category.
+
+2.2.  ABNF for IRI References and IRIs
+
+   Although it might be possible to define IRI references and IRIs
+   merely by their transformation to URI references and URIs, they can
+   also be accepted and processed directly.  Therefore, an ABNF
+   definition for IRI references (which are the most general concept and
+   the start of the grammar) and IRIs is given here.  The syntax of this
+   ABNF is described in [RFC2234].  Character numbers are taken from the
+   UCS, without implying any actual binary encoding.  Terminals in the
+   ABNF are characters, not bytes.
+
+   The following grammar closely follows the URI grammar in [RFC3986],
+   except that the range of unreserved characters is expanded to include
+   UCS characters, with the restriction that private UCS characters can
+   occur only in query parts.  The grammar is split into two parts:
+   Rules that differ from [RFC3986] because of the above-mentioned
+   expansion, and rules that are the same as those in [RFC3986].  For
+   rules that are different than those in [RFC3986], the names of the
+   non-terminals have been changed as follows.  If the non-terminal
+   contains 'URI', this has been changed to 'IRI'.  Otherwise, an 'i'
+   has been prefixed.
+
+   The following rules are different from those in [RFC3986]:
+
+   IRI            = scheme ":" ihier-part [ "?" iquery ]
+                         [ "#" ifragment ]
+
+   ihier-part     = "//" iauthority ipath-abempty
+                  / ipath-absolute
+                  / ipath-rootless
+                  / ipath-empty
+
+   IRI-reference  = IRI / irelative-ref
+
+   absolute-IRI   = scheme ":" ihier-part [ "?" iquery ]
+
+   irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]
+
+   irelative-part = "//" iauthority ipath-abempty
+                       / ipath-absolute
+
+
+
+Duerst & Suignard           Standards Track                     [Page 7]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+                  / ipath-noscheme
+                  / ipath-empty
+
+   iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
+   iuserinfo      = *( iunreserved / pct-encoded / sub-delims / ":" )
+   ihost          = IP-literal / IPv4address / ireg-name
+
+   ireg-name      = *( iunreserved / pct-encoded / sub-delims )
+
+   ipath          = ipath-abempty   ; begins with "/" or is empty
+                  / ipath-absolute  ; begins with "/" but not "//"
+                  / ipath-noscheme  ; begins with a non-colon segment
+                  / ipath-rootless  ; begins with a segment
+                  / ipath-empty     ; zero characters
+
+   ipath-abempty  = *( "/" isegment )
+   ipath-absolute = "/" [ isegment-nz *( "/" isegment ) ]
+   ipath-noscheme = isegment-nz-nc *( "/" isegment )
+   ipath-rootless = isegment-nz *( "/" isegment )
+   ipath-empty    = 0<ipchar>
+
+   isegment       = *ipchar
+   isegment-nz    = 1*ipchar
+   isegment-nz-nc = 1*( iunreserved / pct-encoded / sub-delims
+                        / "@" )
+                  ; non-zero-length segment without any colon ":"
+
+   ipchar         = iunreserved / pct-encoded / sub-delims / ":"
+                  / "@"
+
+   iquery         = *( ipchar / iprivate / "/" / "?" )
+
+   ifragment      = *( ipchar / "/" / "?" )
+
+   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
+
+   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
+                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
+                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
+                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
+                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
+                  / %xD0000-DFFFD / %xE1000-EFFFD
+
+   iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD
+
+   Some productions are ambiguous.  The "first-match-wins" (a.k.a.
+   "greedy") algorithm applies.  For details, see [RFC3986].
+
+
+
+
+Duerst & Suignard           Standards Track                     [Page 8]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   The following rules are the same as those in [RFC3986]:
+
+   scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
+
+   port           = *DIGIT
+
+   IP-literal     = "[" ( IPv6address / IPvFuture  ) "]"
+
+   IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
+
+   IPv6address    =                            6( h16 ":" ) ls32
+                  /                       "::" 5( h16 ":" ) ls32
+                  / [               h16 ] "::" 4( h16 ":" ) ls32
+                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
+                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
+                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
+                  / [ *4( h16 ":" ) h16 ] "::"              ls32
+                  / [ *5( h16 ":" ) h16 ] "::"              h16
+                  / [ *6( h16 ":" ) h16 ] "::"
+
+   h16            = 1*4HEXDIG
+   ls32           = ( h16 ":" h16 ) / IPv4address
+
+   IPv4address    = dec-octet "." dec-octet "." dec-octet "." dec-octet
+
+   dec-octet      = DIGIT                 ; 0-9
+                  / %x31-39 DIGIT         ; 10-99
+                  / "1" 2DIGIT            ; 100-199
+                  / "2" %x30-34 DIGIT     ; 200-249
+                  / "25" %x30-35          ; 250-255
+
+   pct-encoded    = "%" HEXDIG HEXDIG
+
+   unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
+   reserved       = gen-delims / sub-delims
+   gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
+   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
+                  / "*" / "+" / "," / ";" / "="
+
+   This syntax does not support IPv6 scoped addressing zone identifiers.
+
+
+
+
+
+
+
+
+
+
+
+Duerst & Suignard           Standards Track                     [Page 9]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+3.  Relationship between IRIs and URIs
+
+   IRIs are meant to replace URIs in identifying resources for
+   protocols, formats, and software components that use a UCS-based
+   character repertoire.  These protocols and components may never need
+   to use URIs directly, especially when the resource identifier is used
+   simply for identification purposes.  However, when the resource
+   identifier is used for resource retrieval, it is in many cases
+   necessary to determine the associated URI, because currently most
+   retrieval mechanisms are only defined for URIs.  In this case, IRIs
+   can serve as presentation elements for URI protocol elements.  An
+   example would be an address bar in a Web user agent.  (Additional
+   rationale is given in section 3.1.)
+
+3.1.  Mapping of IRIs to URIs
+
+   This section defines how to map an IRI to a URI.  Everything in this
+   section also applies to IRI references and URI references, as well as
+   to components thereof (for example, fragment identifiers).
+
+   This mapping has two purposes:
+
+   Syntaxical. Many URI schemes and components define additional
+      syntactical restrictions not captured in section 2.2.
+      Scheme-specific restrictions are applied to IRIs by converting
+      IRIs to URIs and checking the URIs against the scheme-specific
+      restrictions.
+
+   Interpretational. URIs identify resources in various ways.  IRIs also
+      identify resources.  When the IRI is used solely for
+      identification purposes, it is not necessary to map the IRI to a
+      URI (see section 5).  However, when an IRI is used for resource
+      retrieval, the resource that the IRI locates is the same as the
+      one located by the URI obtained after converting the IRI according
+      to the procedure defined here.  This means that there is no need
+      to define resolution separately on the IRI level.
+
+   Applications MUST map IRIs to URIs by using the following two steps.
+
+   Step 1.  Generate a UCS character sequence from the original IRI
+            format.  This step has the following three variants,
+            depending on the form of the input:
+
+            a. If the IRI is written on paper, read aloud, or otherwise
+               represented as a sequence of characters independent of
+               any character encoding, represent the IRI as a sequence
+               of characters from the UCS normalized according to
+               Normalization Form C (NFC, [UTR15]).
+
+
+
+Duerst & Suignard           Standards Track                    [Page 10]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+            b. If the IRI is in some digital representation (e.g., an
+               octet stream) in some known non-Unicode character
+               encoding, convert the IRI to a sequence of characters
+               from the UCS normalized according to NFC.
+
+            c. If the IRI is in a Unicode-based character encoding (for
+               example, UTF-8 or UTF-16), do not normalize (see section
+               5.3.2.2 for details).  Apply step 2 directly to the
+               encoded Unicode character sequence.
+
+   Step 2.  For each character in 'ucschar' or 'iprivate', apply steps
+            2.1 through 2.3 below.
+
+       2.1.  Convert the character to a sequence of one or more octets
+             using UTF-8 [RFC3629].
+
+       2.2.  Convert each octet to %HH, where HH is the hexadecimal
+             notation of the octet value.  Note that this is identical
+             to the percent-encoding mechanism in section 2.1 of
+             [RFC3986].  To reduce variability, the hexadecimal notation
+             SHOULD use uppercase letters.
+
+       2.3.  Replace the original character with the resulting character
+             sequence (i.e., a sequence of %HH triplets).
+
+   The above mapping from IRIs to URIs produces URIs fully conforming to
+   [RFC3986].  The mapping is also an identity transformation for URIs
+   and is idempotent;  applying the mapping a second time will not
+   change anything.  Every URI is by definition an IRI.
+
+   Systems accepting IRIs MAY convert the ireg-name component of an IRI
+   as follows (before step 2 above) for schemes known to use domain
+   names in ireg-name, if the scheme definition does not allow
+   percent-encoding for ireg-name:
+
+   Replace the ireg-name part of the IRI by the part converted using the
+   ToASCII operation specified in section 4.1 of [RFC3490] on each
+   dot-separated label, and by using U+002E (FULL STOP) as a label
+   separator, with the flag UseSTD3ASCIIRules set to TRUE, and with the
+   flag AllowUnassigned set to FALSE for creating IRIs and set to TRUE
+   otherwise.
+
+
+
+
+
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 11]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   The ToASCII operation may fail, but this would mean that the IRI
+   cannot be resolved.  This conversion SHOULD be used when the goal is
+   to maximize interoperability with legacy URI resolvers.  For example,
+   the IRI
+
+   "http://r&#xE9;sum&#xE9;.example.org"
+
+   may be converted to
+
+   "http://xn--rsum-bpad.example.org"
+
+   instead of
+
+   "http://r%C3%A9sum%C3%A9.example.org".
+
+   An IRI with a scheme that is known to use domain names in ireg-name,
+   but where the scheme definition does not allow percent-encoding for
+   ireg-name, meets scheme-specific restrictions if either the
+   straightforward conversion or the conversion using the ToASCII
+   operation on ireg-name result in an URI that meets the scheme-
+   specific restrictions.
+
+   Such an IRI resolves to the URI obtained after converting the IRI and
+   uses the ToASCII operation on ireg-name.  Implementations do not have
+   to do this conversion as long as they produce the same result.
+
+   Note: The difference between variants b and c in step 1 (using
+      normalization with NFC, versus not using any normalization)
+      accounts for the fact that in many non-Unicode character
+      encodings, some text cannot be represented directly. For example,
+      the word "Vietnam" is natively written "Vi&#x1EC7;t Nam"
+      (containing a LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW)
+      in NFC, but a direct transcoding from the windows-1258 character
+      encoding leads to "Vi&#xEA;&#x323;t Nam" (containing a LATIN SMALL
+      LETTER E WITH CIRCUMFLEX followed by a COMBINING DOT BELOW).
+      Direct transcoding of other 8-bit encodings of Vietnamese may lead
+      to other representations.
+
+   Note: The uniform treatment of the whole IRI in step 2 is important
+      to make processing independent of URI scheme.  See [Gettys] for an
+      in-depth discussion.
+
+   Note: In practice, whether the general mapping (steps 1 and 2) or the
+      ToASCII operation of [RFC3490] is used for ireg-name will not be
+      noticed if mapping from IRI to URI and resolution is tightly
+      integrated (e.g., carried out in the same user agent).  But
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 12]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+      conversion using [RFC3490] may be able to better deal with
+      backwards compatibility issues in case mapping and resolution are
+      separated, as in the case of using an HTTP proxy.
+
+   Note: Internationalized Domain Names may be contained in parts of an
+      IRI other than the ireg-name part.  It is the responsibility of
+      scheme-specific implementations (if the Internationalized Domain
+      Name is part of the scheme syntax) or of server-side
+      implementations (if the Internationalized Domain Name is part of
+      'iquery') to apply the necessary conversions at the appropriate
+      point.  Example: Trying to validate the Web page at
+      http://r&#xE9;sum&#xE9;.example.org would lead to an IRI of
+      http://validator.w3.org/check?uri=http%3A%2F%2Fr&#xE9;sum&#xE9;.
+      example.org, which would convert to a URI of
+      http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.
+      example.org.  The server side implementation would be responsible
+      for making the necessary conversions to be able to retrieve the
+      Web page.
+
+   Systems accepting IRIs MAY also deal with the printable characters in
+   US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
+   "{", "}", "|", "\", "^", and "`", in step 2 above.  If these
+   characters are found but are not converted, then the conversion
+   SHOULD fail.  Please note that the number sign ("#"), the percent
+   sign ("%"), and the square bracket characters ("[", "]") are not part
+   of the above list and MUST NOT be converted.  Protocols and formats
+   that have used earlier definitions of IRIs including these characters
+   MAY require percent-encoding of these characters as a preprocessing
+   step to extract the actual IRI from a given field.  This
+   preprocessing MAY also be used by applications allowing the user to
+   enter an IRI.
+
+   Note: In this process (in step 2.3), characters allowed in URI
+      references and existing percent-encoded sequences are not encoded
+      further.  (This mapping is similar to, but different from, the
+      encoding applied when arbitrary content is included in some part
+      of a URI.)  For example, an IRI of
+      "http://www.example.org/red%09ros&#xE9;#red" (in XML notation) is
+      converted to
+      "http://www.example.org/red%09ros%C3%A9#red", not to something
+      like
+      "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".
+
+   Note: Some older software transcoding to UTF-8 may produce illegal
+      output for some input, in particular for characters outside the
+      BMP (Basic Multilingual Plane).  As an example, for the IRI with
+      non-BMP characters (in XML Notation):
+      "http://example.com/&#x10300;&#x10301;&#x10302";
+
+
+
+Duerst & Suignard           Standards Track                    [Page 13]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+      which contains the first three letters of the Old Italic alphabet,
+      the correct conversion to a URI is
+      "http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82"
+
+3.2.  Converting URIs to IRIs
+
+   In some situations, converting a URI into an equivalent IRI may be
+   desirable.  This section gives a procedure for this conversion.  The
+   conversion described in this section will always result in an IRI
+   that maps back to the URI used as an input for the conversion (except
+   for potential case differences in percent-encoding and for potential
+   percent-encoded unreserved characters).  However, the IRI resulting
+   from this conversion may not be exactly the same as the original IRI
+   (if there ever was one).
+
+   URI-to-IRI conversion removes percent-encodings, but not all
+   percent-encodings can be eliminated.  There are several reasons for
+   this:
+
+   1.  Some percent-encodings are necessary to distinguish percent-
+       encoded and unencoded uses of reserved characters.
+
+   2.  Some percent-encodings cannot be interpreted as sequences of
+       UTF-8 octets.
+
+       (Note: The octet patterns of UTF-8 are highly regular.
+       Therefore, there is a very high probability, but no guarantee,
+       that percent-encodings that can be interpreted as sequences of
+       UTF-8 octets actually originated from UTF-8.  For a detailed
+       discussion, see [Duerst97].)
+
+   3.  The conversion may result in a character that is not appropriate
+       in an IRI.  See sections 2.2, 4.1, and 6.1 for further details.
+
+   Conversion from a URI to an IRI is done by using the following steps
+   (or any other algorithm that produces the same result):
+
+   1.  Represent the URI as a sequence of octets in US-ASCII.
+
+   2.  Convert all percent-encodings ("%" followed by two hexadecimal
+       digits) to the corresponding octets, except those corresponding
+       to "%", characters in "reserved", and characters in US-ASCII not
+       allowed in URIs.
+
+   3.  Re-percent-encode any octet produced in step 2 that is not part
+       of a strictly legal UTF-8 octet sequence.
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 14]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   4. Re-percent-encode all octets produced in step 3 that in UTF-8
+      represent characters that are not appropriate according to
+      sections 2.2, 4.1, and 6.1.
+
+   5. Interpret the resulting octet sequence as a sequence of characters
+      encoded in UTF-8.
+
+   This procedure will convert as many percent-encoded characters as
+   possible to characters in an IRI.  Because there are some choices
+   when step 4 is applied (see section 6.1), results may vary.
+
+   Conversions from URIs to IRIs MUST NOT use any character encoding
+   other than UTF-8 in steps 3 and 4, even if it might be possible to
+   guess from the context that another character encoding than UTF-8 was
+   used in the URI.  For example, the URI
+   "http://www.example.org/r%E9sum%E9.html" might with some guessing be
+   interpreted to contain two e-acute characters encoded as iso-8859-1.
+   It must not be converted to an IRI containing these e-acute
+   characters.  Otherwise, in the future the IRI will be mapped to
+   "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
+   URI from "http://www.example.org/r%E9sum%E9.html".
+
+3.2.1.  Examples
+
+   This section shows various examples of converting URIs to IRIs.  Each
+   example shows the result after each of the steps 1 through 5 is
+   applied.  XML Notation is used for the final result.  Octets are
+   denoted by "<" followed by two hexadecimal digits followed by ">".
+
+   The following example contains the sequence "%C3%BC", which is a
+   strictly legal UTF-8 sequence, and which is converted into the actual
+   character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as
+   u-umlaut).
+
+   1.  http://www.example.org/D%C3%BCrst
+
+   2.  http://www.example.org/D<c3><bc>rst
+
+   3.  http://www.example.org/D<c3><bc>rst
+
+   4.  http://www.example.org/D<c3><bc>rst
+
+   5.  http://www.example.org/D&#xFC;rst
+
+   The following example contains the sequence "%FC", which might
+   represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in the
+   iso-8859-1 character encoding.  (It might represent other characters
+   in other character encodings.  For example, the octet <fc> in
+
+
+
+Duerst & Suignard           Standards Track                    [Page 15]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   iso-8859-5 represents U+045C, CYRILLIC SMALL LETTER KJE.)  Because
+   <fc> is not part of a strictly legal UTF-8 sequence, it is
+   re-percent-encoded in step 3.
+
+   1.  http://www.example.org/D%FCrst
+
+   2.  http://www.example.org/D<fc>rst
+
+   3.  http://www.example.org/D%FCrst
+
+   4.  http://www.example.org/D%FCrst
+
+   5.  http://www.example.org/D%FCrst
+
+   The following example contains "%e2%80%ae", which is the percent-
+   encoded UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE.
+   Section 4.1 forbids the direct use of this character in an IRI.
+   Therefore, the corresponding octets are re-percent-encoded in step 4.
+   This example shows that the case (upper- or lowercase) of letters
+   used in percent-encodings may not be preserved.  The example also
+   contains a punycode-encoded domain name label (xn--99zt52a), which is
+   not converted.
+
+   1.  http://xn--99zt52a.example.org/%e2%80%ae
+
+   2.  http://xn--99zt52a.example.org/<e2><80><ae>
+
+   3.  http://xn--99zt52a.example.org/<e2><80><ae>
+
+   4.  http://xn--99zt52a.example.org/%E2%80%AE
+
+   5.  http://xn--99zt52a.example.org/%E2%80%AE
+
+   Implementations with scheme-specific knowledge MAY convert
+   punycode-encoded domain name labels to the corresponding characters
+   by using the ToUnicode procedure.  Thus, for the example above, the
+   label "xn--99zt52a" may be converted to U+7D0D U+8C46 (Japanese
+   Natto), leading to the overall IRI of
+   "http://&#x7D0D;&#x8C46;.example.org/%E2%80%AE".
+
+4.  Bidirectional IRIs for Right-to-Left Languages
+
+   Some UCS characters, such as those used in the Arabic and Hebrew
+   scripts, have an inherent right-to-left (rtl) writing direction.
+   IRIs containing these characters (called bidirectional IRIs or Bidi
+   IRIs) require additional attention because of the non-trivial
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 16]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   relation between logical representation (used for digital
+   representation and for reading/spelling) and visual representation
+   (used for display/printing).
+
+   Because of the complex interaction between the logical
+   representation, the visual representation, and the syntax of a Bidi
+   IRI, a balance is needed between various requirements.  The main
+   requirements are
+
+   1.  user-predictable conversion between visual and logical
+       representation;
+
+   2.  the ability to include a wide range of characters in various
+       parts of the IRI; and
+
+   3.  minor or no changes or restrictions for implementations.
+
+4.1.  Logical Storage and Visual Presentation
+
+   When stored or transmitted in digital representation, bidirectional
+   IRIs MUST be in full logical order and MUST conform to the IRI syntax
+   rules (which includes the rules relevant to their scheme). This
+   ensures that bidirectional IRIs can be processed in the same way as
+   other IRIs.
+
+   Bidirectional IRIs MUST be rendered by using the Unicode
+   Bidirectional Algorithm [UNIV4], [UNI9].  Bidirectional IRIs MUST be
+   rendered in the same way as they would be if they were in a
+   left-to-right embedding; i.e., as if they were preceded by U+202A,
+   LEFT-TO-RIGHT EMBEDDING (LRE), and followed by U+202C, POP
+   DIRECTIONAL FORMATTING (PDF).  Setting the embedding direction can
+   also be done in a higher-level protocol (e.g., the dir='ltr'
+   attribute in HTML).
+
+   There is no requirement to use the above embedding if the display is
+   still the same without the embedding.  For example, a bidirectional
+   IRI in a text with left-to-right base directionality (such as used
+   for English or Cyrillic) that is preceded and followed by whitespace
+   and  strong left-to-right characters does not need an embedding.
+   Also, a bidirectional relative IRI reference that only contains
+   strong right-to-left characters and weak characters and that starts
+   and ends with a strong right-to-left character and appears in a text
+   with right-to-left base directionality (such as used for Arabic or
+   Hebrew) and is preceded and followed by whitespace and strong
+   characters does not need an embedding.
+
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 17]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
+   sufficient to force the correct display behavior.  However, the
+   details of the Unicode Bidirectional algorithm are not always easy to
+   understand.  Implementers are strongly advised to err on the side of
+   caution and to use embedding in all cases where they are not
+   completely sure that the display behavior is unaffected without the
+   embedding.
+
+   The Unicode Bidirectional Algorithm ([UNI9], section 4.3) permits
+   higher-level protocols to influence bidirectional rendering.  Such
+   changes by higher-level protocols MUST NOT be used if they change the
+   rendering of IRIs.
+
+   The bidirectional formatting characters that may be used before or
+   after the IRI to ensure correct display are not themselves part of
+   the IRI.  IRIs MUST NOT contain bidirectional formatting characters
+   (LRM, RLM, LRE, RLE, LRO, RLO, and PDF).  They affect the visual
+   rendering of the IRI but do not appear themselves.  It would
+   therefore not be possible to input an IRI with such characters
+   correctly.
+
+4.2.  Bidi IRI Structure
+
+   The Unicode Bidirectional Algorithm is designed mainly for running
+   text.  To make sure that it does not affect the rendering of
+   bidirectional IRIs too much, some restrictions on bidirectional IRIs
+   are necessary.  These restrictions are given in terms of delimiters
+   (structural characters, mostly punctuation such as "@", ".", ":", and
+   "/") and components (usually consisting mostly of letters and
+   digits).
+
+   The following syntax rules from section 2.2 correspond to components
+   for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment,
+   isegment-nz, isegment-nz-nc, ireg-name, iquery, and ifragment.
+
+   Specifications that define the syntax of any of the above components
+   MAY divide them further and define smaller parts to be components
+   according to this document.  As an example, the restrictions of
+   [RFC3490] on bidirectional domain names correspond to treating each
+   label of a domain name as a component for schemes with ireg-name as a
+   domain name.  Even where the components are not defined formally, it
+   may be helpful to think about some syntax in terms of components and
+   to apply the relevant restrictions.  For example, for the usual
+   name/value syntax in query parts, it is convenient to treat each name
+   and each value as a component.  As another example, the extensions in
+   a resource name can be treated as separate components.
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 18]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   For each component, the following restrictions apply:
+
+   1.  A component SHOULD NOT use both right-to-left and left-to-right
+       characters.
+
+   2.  A component using right-to-left characters SHOULD start and end
+       with right-to-left characters.
+
+   The above restrictions are given as shoulds, rather than as musts.
+   For IRIs that are never presented visually, they are not relevant.
+   However, for IRIs in general, they are very important to ensure
+   consistent conversion between visual presentation and logical
+   representation, in both directions.
+
+   Note: In some components, the above restrictions may actually be
+      strictly enforced.  For example, [RFC3490] requires that these
+      restrictions apply to the labels of a host name for those schemes
+      where ireg-name is a host name.  In some other components (for
+      example, path components) following these restrictions may not be
+      too difficult.  For other components, such as parts of the query
+      part, it may be very difficult to enforce the restrictions because
+      the values of query parameters may be arbitrary character
+      sequences.
+
+   If the above restrictions cannot be satisfied otherwise, the affected
+   component can always be mapped to URI notation as described in
+   section 3.1.  Please note that the whole component has to be mapped
+   (see also Example 9 below).
+
+4.3.  Input of Bidi IRIs
+
+   Bidi input methods MUST generate Bidi IRIs in logical order while
+   rendering them according to section 4.1.  During input, rendering
+   SHOULD be updated after every new character is input to avoid end-
+   user confusion.
+
+4.4.  Examples
+
+   This section gives examples of bidirectional IRIs, in Bidi Notation.
+   It shows legal IRIs with the relationship between logical and visual
+   representation and explains how certain phenomena in this
+   relationship may look strange to somebody not familiar with
+   bidirectional behavior, but familiar to users of Arabic and Hebrew.
+   It also shows what happens if the restrictions given in section 4.2
+   are not followed.  The examples below can be seen at [BidiEx], in
+   Arabic, Hebrew, and Bidi Notation variants.
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 19]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   To read the bidi text in the examples, read the visual representation
+   from left to right until you encounter a block of rtl text.  Read the
+   rtl block (including slashes and other special characters) from right
+   to left, then continue at the next unread ltr character.
+
+   Example 1: A single component with rtl characters is inverted:
+   Logical representation: "http://ab.CDEFGH.ij/kl/mn/op.html"
+   Visual representation: "http://ab.HGFEDC.ij/kl/mn/op.html"
+   Components can be read one by one, and each component can be read in
+   its natural direction.
+
+   Example 2: More than one consecutive component with rtl characters is
+   inverted as a whole:
+   Logical representation: "http://ab.CDE.FGH/ij/kl/mn/op.html"
+   Visual representation: "http://ab.HGF.EDC/ij/kl/mn/op.html"
+   A sequence of rtl components is read rtl, in the same way as a
+   sequence of rtl words is read rtl in a bidi text.
+
+   Example 3: All components of an IRI (except for the scheme) are rtl.
+   All rtl components are inverted overall:
+   Logical representation: "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV"
+   Visual representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA"
+   The whole IRI (except the scheme) is read rtl.  Delimiters between
+   rtl components stay between the respective components; delimiters
+   between ltr and rtl components don't move.
+
+   Example 4: Each of several sequences of rtl components is inverted on
+   its own:
+   Logical representation: "http://AB.CD.ef/gh/IJ/KL.html"
+   Visual representation: "http://DC.BA.ef/gh/LK/JI.html"
+   Each sequence of rtl components is read rtl, in the same way as each
+   sequence of rtl words in an ltr text is read rtl.
+
+   Example 5: Example 2, applied to components of different kinds:
+   Logical representation: "http://ab.cd.EF/GH/ij/kl.html"
+   Visual representation: "http://ab.cd.HG/FE/ij/kl.html"
+   The inversion of the domain name label and the path component may be
+   unexpected, but it is consistent with other bidi behavior.  For
+   reassurance that the domain component really is "ab.cd.EF", it may be
+   helpful to read aloud the visual representation following the bidi
+   algorithm.  After "http://ab.cd." one reads the RTL block
+   "E-F-slash-G-H", which corresponds to the logical representation.
+
+   Example 6: Same as Example 5, with more rtl components:
+   Logical representation: "http://ab.CD.EF/GH/IJ/kl.html"
+   Visual representation: "http://ab.JI/HG/FE.DC/kl.html"
+   The inversion of the domain name labels and the path components may
+   be easier to identify because the delimiters also move.
+
+
+
+Duerst & Suignard           Standards Track                    [Page 20]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   Example 7: A single rtl component includes digits:
+   Logical representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"
+   Visual representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"
+   Numbers are written ltr in all cases but are treated as an additional
+   embedding inside a run of rtl characters.  This is completely
+   consistent with usual bidirectional text.
+
+   Example 8 (not allowed): Numbers are at the start or end of an rtl
+   component:
+   Logical representation: "http://ab.cd.ef/GH1/2IJ/KL.html"
+   Visual representation: "http://ab.cd.ef/LK/JI1/2HG.html"
+   The sequence "1/2" is interpreted by the bidi algorithm as a
+   fraction, fragmenting the components and leading to confusion.  There
+   are other characters that are interpreted in a special way close to
+   numbers; in particular, "+", "-", "#", "$", "%", ",", ".", and ":".
+
+   Example 9 (not allowed): The numbers in the previous example are
+   percent-encoded:
+   Logical representation: "http://ab.cd.ef/GH%31/%32IJ/KL.html",
+   Visual representation (Hebrew): "http://ab.cd.ef/%31HG/LK/JI%32.html"
+   Visual representation (Arabic): "http://ab.cd.ef/31%HG/%LK/JI32.html"
+   Depending on whether the uppercase letters represent Arabic or
+   Hebrew, the visual representation is different.
+
+   Example 10 (allowed but not recommended):
+   Logical representation: "http://ab.CDEFGH.123/kl/mn/op.html"
+   Visual representation: "http://ab.123.HGFEDC/kl/mn/op.html"
+   Components consisting of only numbers are allowed (it would be rather
+   difficult to prohibit them), but these may interact with adjacent RTL
+   components in ways that are not easy to predict.
+
+5.  Normalization and Comparison
+
+      Note: The structure and much of the material for this section is
+      taken from section 6 of [RFC3986]; the differences are due to the
+      specifics of IRIs.
+
+   One of the most common operations on IRIs is simple comparison:
+   Determining whether two IRIs are equivalent without using the IRIs or
+   the mapped URIs to access their respective resource(s).  A comparison
+   is performed whenever a response cache is accessed, a browser checks
+   its history to color a link, or an XML parser processes tags within a
+   namespace.  Extensive normalization prior to comparison of IRIs may
+   be used by spiders and indexing engines to prune a search space or
+   reduce duplication of request actions and response storage.
+
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 21]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   IRI comparison is performed for some particular purpose.  Protocols
+   or implementations that compare IRIs for different purposes will
+   often be subject to differing design trade-offs in regards to how
+   much effort should be spent in reducing aliased identifiers.  This
+   section describes various methods that may be used to compare IRIs,
+   the trade-offs between them, and the types of applications that might
+   use them.
+
+5.1.  Equivalence
+
+   Because IRIs exist to identify resources, presumably they should be
+   considered equivalent when they identify the same resource.  However,
+   this definition of equivalence is not of much practical use, as there
+   is no way for an implementation to compare two resources unless it
+   has full knowledge or control of them. For this reason, determination
+   of equivalence or difference of IRIs is based on string comparison,
+   perhaps augmented by reference to additional rules provided by URI
+   scheme definitions.  We use the terms "different" and "equivalent" to
+   describe the possible outcomes of such comparisons, but there are
+   many application-dependent versions of equivalence.
+
+   Even though it is possible to determine that two IRIs are equivalent,
+   IRI comparison is not sufficient to determine whether two IRIs
+   identify different resources.  For example, an owner of two different
+   domain names could decide to serve the same resource from both,
+   resulting in two different IRIs.  Therefore, comparison methods are
+   designed to minimize false negatives while strictly avoiding false
+   positives.
+
+   In testing for equivalence, applications should not directly compare
+   relative references; the references should be converted to their
+   respective target IRIs before comparison.  When IRIs are compared to
+   select (or avoid) a network action, such as retrieval of a
+   representation, fragment components (if any) should be excluded from
+   the comparison.
+
+   Applications using IRIs as identity tokens with no relationship to a
+   protocol MUST use the Simple String Comparison (see section 5.3.1).
+   All other applications MUST select one of the comparison practices
+   from the Comparison Ladder (see section 5.3 or, after IRI-to-URI
+   conversion, select one of the comparison practices from the URI
+   comparison ladder in [RFC3986], section 6.2)
+
+5.2.  Preparation for Comparison
+
+   Any kind of IRI comparison REQUIRES that all escapings or encodings
+   in the protocol or format that carries an IRI are resolved.  This is
+   usually done when the protocol or format is parsed.  Examples of such
+
+
+
+Duerst & Suignard           Standards Track                    [Page 22]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   escapings or encodings are entities and numeric character references
+   in [HTML4] and [XML1].  As an example,
+   "http://example.org/ros&eacute;" (in HTML),
+   "http://example.org/ros&#233"; (in HTML or XML), and
+   "http://example.org/ros&#xE9"; (in HTML or XML) are all resolved into
+   what is denoted in this document (see section 1.4) as
+   "http://example.org/ros&#xE9"; (the "&#xE9;" here standing for the
+   actual e-acute character, to compensate for the fact that this
+   document cannot contain non-ASCII characters).
+
+   Similar considerations apply to encodings such as Transfer Codings in
+   HTTP (see [RFC2616]) and Content Transfer Encodings in MIME
+   ([RFC2045]), although in these cases, the encoding is based not on
+   characters but on octets, and additional care is required to make
+   sure that characters, and not just arbitrary octets, are compared
+   (see section 5.3.1).
+
+5.3.  Comparison Ladder
+
+   In practice, a variety of methods are used, to test IRI equivalence.
+   These methods fall into a range distinguished by the amount of
+   processing required and the degree to which the probability of false
+   negatives is reduced.  As noted above, false negatives cannot be
+   eliminated.  In practice, their probability can be reduced, but this
+   reduction requires more processing and is not cost-effective for all
+   applications.
+
+   If this range of comparison practices is considered as a ladder, the
+   following discussion will climb the ladder, starting with practices
+   that are cheap but have a relatively higher chance of producing false
+   negatives, and proceeding to those that have higher computational
+   cost and lower risk of false negatives.
+
+5.3.1.  Simple String Comparison
+
+   If two IRIs, when considered as character strings, are identical,
+   then it is safe to conclude that they are equivalent.  This type of
+   equivalence test has very low computational cost and is in wide use
+   in a variety of applications, particularly in the domain of parsing.
+   It is also used when a definitive answer to the question of IRI
+   equivalence is needed that is independent of the scheme used and that
+   can be calculated quickly and without accessing a network.  An
+   example of such a case is XML Namespaces ([XMLNamespace]).
+
+   Testing strings for equivalence requires some basic precautions. This
+   procedure is often referred to as "bit-for-bit" or "byte-for-byte"
+   comparison, which is potentially misleading.  Testing strings for
+   equality is normally based on pair comparison of the characters that
+
+
+
+Duerst & Suignard           Standards Track                    [Page 23]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   make up the strings, starting from the first and proceeding until
+   both strings are exhausted and all characters are found to be equal,
+   until a pair of characters compares unequal, or until one of the
+   strings is exhausted before the other.
+
+   This character comparison requires that each pair of characters be
+   put in comparable encoding form.  For example, should one IRI be
+   stored in a byte array in UTF-8 encoding form and the second in a
+   UTF-16 encoding form, bit-for-bit comparisons applied naively will
+   produce errors.  It is better to speak of equality on a
+   character-for-character rather than on a byte-for-byte or bit-for-bit
+   basis.  In practical terms, character-by-character comparisons should
+   be done codepoint by codepoint after conversion to a common character
+   encoding form.  When comparing character by character, the comparison
+   function MUST NOT map IRIs to URIs, because such a mapping would
+   create additional spurious equivalences.  It follows that an IRI
+   SHOULD NOT be modified when being transported if there is any chance
+   that this IRI might be used as an identifier.
+
+   False negatives are caused by the production and use of IRI aliases.
+   Unnecessary aliases can be reduced, regardless of the comparison
+   method, by consistently providing IRI references in an already
+   normalized form (i.e., a form identical to what would be produced
+   after normalization is applied, as described below). Protocols and
+   data formats often limit some IRI comparisons to simple string
+   comparison, based on the theory that people and implementations will,
+   in their own best interest, be consistent in providing IRI
+   references, or at least be consistent enough to negate any efficiency
+   that might be obtained from further normalization.
+
+5.3.2.  Syntax-Based Normalization
+
+   Implementations may use logic based on the definitions provided by
+   this specification to reduce the probability of false negatives. This
+   processing is moderately higher in cost than character-for-character
+   string comparison.  For example, an application using this approach
+   could reasonably consider the following two IRIs equivalent:
+
+      example://a/b/c/%7Bfoo%7D/ros&#xE9;
+      eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9
+
+   Web user agents, such as browsers, typically apply this type of IRI
+   normalization when determining whether a cached response is
+   available.  Syntax-based normalization includes such techniques as
+   case normalization, character normalization, percent-encoding
+   normalization, and removal of dot-segments.
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 24]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+5.3.2.1.  Case Normalization
+
+   For all IRIs, the hexadecimal digits within a percent-encoding
+   triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
+   should be normalized to use uppercase letters for the digits A - F.
+
+   When an IRI uses components of the generic syntax, the component
+   syntax equivalence rules always apply; namely, that the scheme and
+   US-ASCII only host are case insensitive and therefore should be
+   normalized to lowercase.  For example, the URI
+   "HTTP://www.EXAMPLE.com/" is equivalent to "http://www.example.com/".
+   Case equivalence for non-ASCII characters in IRI components that are
+   IDNs are discussed in section 5.3.3.  The other generic syntax
+   components are assumed to be case sensitive unless specifically
+   defined otherwise by the scheme.
+
+   Creating schemes that allow case-insensitive syntax components
+   containing non-ASCII characters should be avoided. Case normalization
+   of non-ASCII characters can be culturally dependent and is always a
+   complex operation.  The only exception concerns non-ASCII host names
+   for which the character normalization includes a mapping step derived
+   from case folding.
+
+5.3.2.2.  Character Normalization
+
+   The Unicode Standard [UNIV4] defines various equivalences between
+   sequences of characters for various purposes.  Unicode Standard Annex
+   #15 [UTR15] defines various Normalization Forms for these
+   equivalences, in particular Normalization Form C (NFC, Canonical
+   Decomposition, followed by Canonical Composition) and Normalization
+   Form KC (NFKC, Compatibility Decomposition, followed by Canonical
+   Composition).
+
+   Equivalence of IRIs MUST rely on the assumption that IRIs are
+   appropriately pre-character-normalized rather than apply character
+   normalization when comparing two IRIs.  The exceptions are conversion
+   from a non-digital form, and conversion from a non-UCS-based
+   character encoding to a UCS-based character encoding. In these cases,
+   NFC or a normalizing transcoder using NFC MUST be used for
+   interoperability.  To avoid false negatives and problems with
+   transcoding, IRIs SHOULD be created by using NFC.  Using NFKC may
+   avoid even more problems; for example, by choosing half-width Latin
+   letters instead of full-width ones, and full-width instead of
+   half-width Katakana.
+
+   As an example, "http://www.example.org/r&#xE9;sum&#xE9;.html" (in XML
+   Notation) is in NFC.  On the other hand,
+   "http://www.example.org/re&#x301;sume&#x301;.html" is not in NFC.
+
+
+
+Duerst & Suignard           Standards Track                    [Page 25]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   The former uses precombined e-acute characters, and the latter uses
+   "e" characters followed by combining acute accents.  Both usages are
+   defined as canonically equivalent in [UNIV4].
+
+   Note: Because it is unknown how a particular sequence of characters
+      is being treated with respect to character normalization, it would
+      be inappropriate to allow third parties to normalize an IRI
+      arbitrarily.  This does not contradict the recommendation that
+      when a resource is created, its IRI should be as character
+      normalized as possible (i.e., NFC or even NFKC).  This is similar
+      to the uppercase/lowercase problems.  Some parts of a URI are case
+      insensitive (domain name).  For others, it is unclear whether they
+      are case sensitive, case insensitive, or something in between
+      (e.g., case sensitive, but with a multiple choice selection if the
+      wrong case is used, instead of a direct negative result).  The
+      best recipe is that the creator use a reasonable capitalization
+      and, when transferring the URI, capitalization never be changed.
+
+   Various IRI schemes may allow the usage of Internationalized Domain
+   Names (IDN) [RFC3490] either in the ireg-name part or elsewhere.
+   Character Normalization also applies to IDNs, as discussed in section
+   5.3.3.
+
+5.3.2.3.  Percent-Encoding Normalization
+
+   The percent-encoding mechanism (section 2.1 of [RFC3986]) is a
+   frequent source of variance among otherwise identical IRIs.  In
+   addition to the case normalization issue noted above, some IRI
+   producers percent-encode octets that do not require percent-encoding,
+   resulting in IRIs that are equivalent to their non encoded
+   counterparts.  These IRIs should be normalized by decoding any
+   percent-encoded octet sequence that corresponds to an unreserved
+   character, as described in section 2.3 of [RFC3986].
+
+   For actual resolution, differences in percent-encoding (except for
+   the percent-encoding of reserved characters) MUST always result in
+   the same resource.  For example, "http://example.org/~user",
+   "http://example.org/%7euser", and "http://example.org/%7Euser", must
+   resolve to the same resource.
+
+   If this kind of equivalence is to be tested, the percent-encoding of
+   both IRIs to be compared has to be aligned; for example, by
+   converting both IRIs to URIs (see section 3.1), eliminating escape
+   differences in the resulting URIs, and making sure that the case of
+   the hexadecimal characters in the percent-encoding is always the same
+   (preferably uppercase).  If the IRI is to be passed to another
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 26]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   application or used further in some other way, its original form MUST
+   be preserved.  The conversion described here should be performed only
+   for local comparison.
+
+5.3.2.4.  Path Segment Normalization
+
+   The complete path segments "." and ".." are intended only for use
+   within relative references (section 4.1 of [RFC3986]) and are removed
+   as part of the reference resolution process (section 5.2 of
+   [RFC3986]).  However, some implementations may incorrectly assume
+   that reference resolution is not necessary when the reference is
+   already an IRI, and thus fail to remove dot-segments when they occur
+   in non-relative paths.  IRI normalizers should remove dot-segments by
+   applying the remove_dot_segments algorithm to the path, as described
+   in section 5.2.4 of [RFC3986].
+
+5.3.3.  Scheme-Based Normalization
+
+   The syntax and semantics of IRIs vary from scheme to scheme, as
+   described by the defining specification for each scheme.
+   Implementations may use scheme-specific rules, at further processing
+   cost, to reduce the probability of false negatives.  For example,
+   because the "http" scheme makes use of an authority component, has a
+   default port of "80", and defines an empty path to be equivalent to
+   "/", the following four IRIs are equivalent:
+
+      http://example.com
+      http://example.com/
+      http://example.com:/
+      http://example.com:80/
+
+   In general, an IRI that uses the generic syntax for authority with an
+   empty path should be normalized to a path of "/".  Likewise, an
+   explicit ":port", for which the port is empty or the default for the
+   scheme, is equivalent to one where the port and its ":" delimiter are
+   elided and thus should be removed by scheme-based normalization.  For
+   example, the second IRI above is the normal form for the "http"
+   scheme.
+
+   Another case where normalization varies by scheme is in the handling
+   of an empty authority component or empty host subcomponent.  For many
+   scheme specifications, an empty authority or host is considered an
+   error; for others, it is considered equivalent to "localhost" or the
+   end-user's host.  When a scheme defines a default for authority and
+   an IRI reference to that default is desired, the reference should be
+   normalized to an empty authority for the sake of uniformity, brevity,
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 27]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   and internationalization.  If, however, either the userinfo or port
+   subcomponents are non-empty, then the host should be given explicitly
+   even if it matches the default.
+
+   Normalization should not remove delimiters when their associated
+   component is empty unless it is licensed to do so by the scheme
+   specification.  For example, the IRI "http://example.com/?" cannot be
+   assumed to be equivalent to any of the examples above.  Likewise, the
+   presence or absence of delimiters within a userinfo subcomponent is
+   usually significant to its interpretation.  The fragment component is
+   not subject to any scheme-based normalization; thus, two IRIs that
+   differ only by the suffix "#" are considered different regardless of
+   the scheme.
+
+   Some IRI schemes may allow the usage of Internationalized Domain
+   Names (IDN) [RFC3490] either in their ireg-name part or elsewhere.
+   When in use in IRIs, those names SHOULD be validated by using the
+   ToASCII operation defined in [RFC3490], with the flags
+   "UseSTD3ASCIIRules" and "AllowUnassigned".  An IRI containing an
+   invalid IDN cannot successfully be resolved.  Validated IDN
+   components of IRIs SHOULD be character normalized by using the
+   Nameprep process [RFC3491]; however, for legibility purposes, they
+   SHOULD NOT be converted into ASCII Compatible Encoding (ACE).
+
+   Scheme-based normalization may also consider IDN components and their
+   conversions to punycode as equivalent.  As an example,
+   "http://r&#xE9;sum&#xE9;.example.org" may be considered equivalent to
+   "http://xn--rsum-bpad.example.org".
+
+   Other scheme-specific normalizations are possible.
+
+5.3.4.  Protocol-Based Normalization
+
+   Substantial effort to reduce the incidence of false negatives is
+   often cost-effective for web spiders. Consequently, they implement
+   even more aggressive techniques in IRI comparison.  For example, if
+   they observe that an IRI such as
+
+      http://example.com/data
+
+   redirects to an IRI differing only in the trailing slash
+
+      http://example.com/data/
+
+   they will likely regard the two as equivalent in the future.  This
+   kind of technique is only appropriate when equivalence is clearly
+   indicated by both the result of accessing the resources and the
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 28]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   common conventions of their scheme's dereference algorithm (in this
+   case, use of redirection by HTTP origin servers to avoid problems
+   with relative references).
+
+6.  Use of IRIs
+
+6.1.  Limitations on UCS Characters Allowed in IRIs
+
+   This section discusses limitations on characters and character
+   sequences usable for IRIs beyond those given in section 2.2 and
+   section 4.1.  The considerations in this section are relevant when
+   IRIs are created and when URIs are converted to IRIs.
+
+   a.  The repertoire of characters allowed in each IRI component is
+       limited by the definition of that component.  For example, the
+       definition of the scheme component does not allow characters
+       beyond US-ASCII.
+
+       (Note: In accordance with URI practice, generic IRI software
+       cannot and should not check for such limitations.)
+
+   b.  The UCS contains many areas of characters for which there are
+       strong visual look-alikes.  Because of the likelihood of
+       transcription errors, these also should be avoided.  This
+       includes the full-width equivalents of Latin characters,
+       half-width Katakana characters for Japanese, and many others.  It
+       also includes many look-alikes of "space", "delims", and
+       "unwise", characters excluded in [RFC3491].
+
+   Additional information is available from [UNIXML].  [UNIXML] is
+   written in the context of running text rather than in that of
+   identifiers.  Nevertheless, it discusses many of the categories of
+   characters not appropriate for IRIs.
+
+6.2.  Software Interfaces and Protocols
+
+   Although an IRI is defined as a sequence of characters, software
+   interfaces for URIs typically function on sequences of octets or
+   other kinds of code units.  Thus, software interfaces and protocols
+   MUST define which character encoding is used.
+
+   Intermediate software interfaces between IRI-capable components and
+   URI-only components MUST map the IRIs per section 3.1, when
+   transferring from IRI-capable to URI-only components.  This mapping
+   SHOULD be applied as late as possible.  It SHOULD NOT be applied
+   between components that are known to be able to handle IRIs.
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 29]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+6.3.  Format of URIs and IRIs in Documents and Protocols
+
+   Document formats that transport URIs may have to be upgraded to allow
+   the transport of IRIs.  In cases where the document as a whole has a
+   native character encoding, IRIs MUST also be encoded in this
+   character encoding and converted accordingly by a parser or
+   interpreter.  IRI characters not expressible in the native character
+   encoding SHOULD be escaped by using the escaping conventions of the
+   document format if such conventions are available. Alternatively,
+   they MAY be percent-encoded according to section 3.1. For example, in
+   HTML or XML, numeric character references SHOULD be used.  If a
+   document as a whole has a native character encoding and that
+   character encoding is not UTF-8, then IRIs MUST NOT be placed into
+   the document in the UTF-8 character encoding.
+
+   Note: Some formats already accommodate IRIs, although they use
+   different terminology.  HTML 4.0 [HTML4] defines the conversion from
+   IRIs to URIs as error-avoiding behavior.  XML 1.0 [XML1], XLink
+   [XLink], XML Schema [XMLSchema], and specifications based upon them
+   allow IRIs.  Also, it is expected that all relevant new W3C formats
+   and protocols will be required to handle IRIs [CharMod].
+
+6.4.  Use of UTF-8 for Encoding Original Characters
+
+   This section discusses details and gives examples for point c) in
+   section 1.2.  To be able to use IRIs, the URI corresponding to the
+   IRI in question has to encode original characters into octets by
+   using UTF-8.  This can be specified for all URIs of a URI scheme or
+   can apply to individual URIs for schemes that do not specify how to
+   encode original characters.  It can apply to the whole URI, or only
+   to some part.  For background information on encoding characters into
+   URIs, see also section 2.5 of [RFC3986].
+
+   For new URI schemes, using UTF-8 is recommended in [RFC2718].
+   Examples where UTF-8 is already used are the URN syntax [RFC2141],
+   IMAP URLs [RFC2192], and POP URLs [RFC2384].  On the other hand,
+   because the HTTP URL scheme does not specify how to encode original
+   characters, only some HTTP URLs can have corresponding but different
+   IRIs.
+
+   For example, for a document with a URI of
+   "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to
+   construct a corresponding IRI (in XML notation, see, section 1.4):
+   "http://www.example.org/r&#xE9;sum&#xE9;.html" ("&#xE9"; stands for
+   the e-acute character, and "%C3%A9" is the UTF-8 encoded and
+   percent-encoded representation of that character).  On the other
+   hand, for a document with a URI of
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 30]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   "http://www.example.org/r%E9sum%E9.html", the percent-encoding octets
+   cannot be converted to actual characters in an IRI, as the
+   percent-encoding is not based on UTF-8.
+
+   This means that for most URI schemes, there is no need to upgrade
+   their scheme definition in order for them to work with IRIs.  The
+   main case where upgrading makes sense is when a scheme definition, or
+   a particular component of a scheme, is strictly limited to the use of
+   US-ASCII characters with no provision to include non-ASCII
+   characters/octets via percent-encoding, or if a scheme definition
+   currently uses highly scheme-specific provisions for the encoding of
+   non-ASCII characters.  An example of this is the mailto: scheme
+   [RFC2368].
+
+   This specification does not upgrade any scheme specifications in any
+   way; this has to be done separately.  Also, note that there is no
+   such thing as an "IRI scheme"; all IRIs use URI schemes, and all URI
+   schemes can be used with IRIs, even though in some cases only by
+   using URIs directly as IRIs, without any conversion.
+
+   URI schemes can impose restrictions on the syntax of scheme-specific
+   URIs; i.e., URIs that are admissible under the generic URI syntax
+   [RFC3986] may not be admissible due to narrower syntactic constraints
+   imposed by a URI scheme specification.  URI scheme definitions cannot
+   broaden the syntactic restrictions of the generic URI syntax;
+   otherwise, it would be possible to generate URIs that satisfied the
+   scheme-specific syntactic constraints without satisfying the
+   syntactic constraints of the generic URI syntax.  However, additional
+   syntactic constraints imposed by URI scheme specifications are
+   applicable to IRI, as the corresponding URI resulting from the
+   mapping defined in section 3.1 MUST be a valid URI under the
+   syntactic restrictions of generic URI syntax and any narrower
+   restrictions imposed by the corresponding URI scheme specification.
+
+   The requirement for the use of UTF-8 applies to all parts of a URI
+   (with the potential exception of the ireg-name part; see section
+   3.1).  However, it is possible that the capability of IRIs to
+   represent a wide range of characters directly is used just in some
+   parts of the IRI (or IRI reference).  The other parts of the IRI may
+   only contain US-ASCII characters, or they may not be based on UTF-8.
+   They may be based on another character encoding, or they may directly
+   encode raw binary data (see also [RFC2397]).
+
+   For example, it is possible to have a URI reference of
+   "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the
+   document name is encoded in iso-8859-1 based on server settings, but
+   where the fragment identifier is encoded in UTF-8 according to
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 31]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   [XPointer]. The IRI corresponding to the above URI would be (in XML
+   notation)
+   "http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9";.
+
+   Similar considerations apply to query parts.  The functionality of
+   IRIs (namely, to be able to include non-ASCII characters) can only be
+   used if the query part is encoded in UTF-8.
+
+6.5.  Relative IRI References
+
+   Processing of relative IRI references against a base is handled
+   straightforwardly; the algorithms of [RFC3986] can be applied
+   directly, treating the characters additionally allowed in IRI
+   references in the same way that unreserved characters are in URI
+   references.
+
+7.  URI/IRI Processing Guidelines (Informative)
+
+   This informative section provides guidelines for supporting IRIs in
+   the same software components and operations that currently process
+   URIs: Software interfaces that handle URIs, software that allows
+   users to enter URIs, software that creates or generates URIs,
+   software that displays URIs, formats and protocols that transport
+   URIs, and software that interprets URIs.  These may all require
+   modification before functioning properly with IRIs.  The
+   considerations in this section also apply to URI references and IRI
+   references.
+
+7.1.  URI/IRI Software Interfaces
+
+   Software interfaces that handle URIs, such as URI-handling APIs and
+   protocols transferring URIs, need interfaces and protocol elements
+   that are designed to carry IRIs.
+
+   In case the current handling in an API or protocol is based on
+   US-ASCII, UTF-8 is recommended as the character encoding for IRIs, as
+   it is compatible with US-ASCII, is in accordance with the
+   recommendations of [RFC2277], and makes converting to URIs easy.  In
+   any case, the API or protocol definition must clearly define the
+   character encoding to be used.
+
+   The transfer from URI-only to IRI-capable components requires no
+   mapping, although the conversion described in section 3.2 above may
+   be performed.  It is preferable not to perform this inverse
+   conversion when there is a chance that this cannot be done correctly.
+
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 32]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+7.2.  URI/IRI Entry
+
+   Some components allow users to enter URIs into the system by typing
+   or dictation, for example.  This software must be updated to allow
+   for IRI entry.
+
+   A person viewing a visual representation of an IRI (as a sequence of
+   glyphs, in some order, in some visual display) or hearing an IRI will
+   use an entry method for characters in the user's language to input
+   the IRI.  Depending on the script and the input method used, this may
+   be a more or less complicated process.
+
+   The process of IRI entry must ensure, as much as possible, that the
+   restrictions defined in section 2.2 are met.  This may be done by
+   choosing appropriate input methods or variants/settings thereof, by
+   appropriately converting the characters being input, by eliminating
+   characters that cannot be converted, and/or by issuing a warning or
+   error message to the user.
+
+   As an example of variant settings, input method editors for East
+   Asian Languages usually allow the input of Latin letters and related
+   characters in full-width or half-width versions.  For IRI input, the
+   input method editor should be set so that it produces half-width
+   Latin letters and punctuation and full-width Katakana.
+
+   An input field primarily or solely used for the input of URIs/IRIs
+   may allow the user to view an IRI as it is mapped to a URI.  Places
+   where the input of IRIs is frequent may provide the possibility for
+   viewing an IRI as mapped to a URI.  This will help users when some of
+   the software they use does not yet accept IRIs.
+
+   An IRI input component interfacing to components that handle URIs,
+   but not IRIs, must map the IRI to a URI before passing it to these
+   components.
+
+   For the input of IRIs with right-to-left characters, please see
+   section 4.3.
+
+7.3.  URI/IRI Transfer between Applications
+
+   Many applications, particularly mail user agents, try to detect URIs
+   appearing in plain text.  For this, they use some heuristics based on
+   URI syntax.  They then allow the user to click on such URIs and
+   retrieve the corresponding resource in an appropriate (usually
+   scheme-dependent) application.
+
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 33]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   Such applications have to be upgraded to use the IRI syntax as a base
+   for heuristics.  In particular, a non-ASCII character should not be
+   taken as the indication of the end of an IRI.  Such applications also
+   have to make sure that they correctly convert the detected IRI from
+   the character encoding of the document or application where the IRI
+   appears to the character encoding used by the system-wide IRI
+   invocation mechanism, or to a URI (according to section 3.1) if the
+   system-wide invocation mechanism only accepts URIs.
+
+   The clipboard is another frequently used way to transfer URIs and
+   IRIs from one application to another.  On most platforms, the
+   clipboard is able to store and transfer text in many languages and
+   scripts.  Correctly used, the clipboard transfers characters, not
+   bytes, which will do the right thing with IRIs.
+
+7.4.  URI/IRI Generation
+
+   Systems that offer resources through the Internet, where those
+   resources have logical names, sometimes automatically generate URIs
+   for the resources they offer.  For example, some HTTP servers can
+   generate a directory listing for a file directory and then respond to
+   the generated URIs with the files.
+
+   Many legacy character encodings are in use in various file systems.
+   Many currently deployed systems do not transform the local character
+   representation of the underlying system before generating URIs.
+
+   For maximum interoperability, systems that generate resource
+   identifiers should make the appropriate transformations.  For
+   example, if a file system contains a file named
+   "r&#xE9;sum&#xE9;.html", a server should expose this as
+   "r%C3%A9sum%C3%A9.html" in a URI, which allows use of
+   "r&#xE9;sum&#xE9;.html" in an IRI, even if locally the file name is
+   kept in a character encoding other than UTF-8.
+
+   This recommendation particularly applies to HTTP servers.  For FTP
+   servers, similar considerations apply; see [RFC2640].
+
+7.5.  URI/IRI Selection
+
+   In some cases, resource owners and publishers have control over the
+   IRIs used to identify their resources.  This control is mostly
+   executed by controlling the resource names, such as file names,
+   directly.
+
+
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 34]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   In these cases, it is recommended to avoid choosing IRIs that are
+   easily confused.  For example, for US-ASCII, the lower-case ell ("l")
+   is easily confused with the digit one ("1"), and the upper-case oh
+   ("O") is easily confused with the digit zero ("0").  Publishers
+   should avoid confusing users with "br0ken" or "1ame" identifiers.
+
+   Outside the US-ASCII repertoire, there are many more opportunities
+   for confusion; a complete set of guidelines is too lengthy to include
+   here.  As long as names are limited to characters from a single
+   script, native writers of a given script or language will know best
+   when ambiguities can appear, and how they can be avoided.  What may
+   look ambiguous to a stranger may be completely obvious to the average
+   native user.  On the other hand, in some cases, the UCS contains
+   variants for compatibility reasons; for example, for typographic
+   purposes.  These should be avoided wherever possible.  Although there
+   may be exceptions, newly created resource names should generally be
+   in NFKC [UTR15] (which means that they are also in NFC).
+
+   As an example, the UCS contains the "fi" ligature at U+FB01 for
+   compatibility reasons.  Wherever possible, IRIs should use the two
+   letters "f" and "i" rather than the "fi" ligature.  An example where
+   the latter may be used is in the query part of an IRI for an explicit
+   search for a word written containing the "fi" ligature.
+
+   In certain cases, there is a chance that characters from different
+   scripts look the same.  The best known example is the similarity of
+   the Latin "A", the Greek "Alpha", and the Cyrillic "A".  To avoid
+   such cases, only IRIs should be created where all the characters in a
+   single component are used together in a given language.  This usually
+   means that all of these characters will be from the same script, but
+   there are languages that mix characters from different scripts (such
+   as Japanese).  This is similar to the heuristics used to distinguish
+   between letters and numbers in the examples above.  Also, for Latin,
+   Greek, and Cyrillic, using lowercase letters results in fewer
+   ambiguities than using uppercase letters would.
+
+7.6.  Display of URIs/IRIs
+
+   In situations where the rendering software is not expected to display
+   non-ASCII parts of the IRI correctly using the available layout and
+   font resources, these parts should be percent-encoded before being
+   displayed.
+
+   For display of Bidi IRIs, please see section 4.1.
+
+
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 35]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+7.7.  Interpretation of URIs and IRIs
+
+   Software that interprets IRIs as the names of local resources should
+   accept IRIs in multiple forms and convert and match them with the
+   appropriate local resource names.
+
+   First, multiple representations include both IRIs in the native
+   character encoding of the protocol and also their URI counterparts.
+
+   Second, it may include URIs constructed based on character encodings
+   other than UTF-8.  These URIs may be produced by user agents that do
+   not conform to this specification and that use legacy character
+   encodings to convert non-ASCII characters to URIs.  Whether this is
+   necessary, and what character encodings to cover, depends on a number
+   of factors, such as the legacy character encodings used locally and
+   the distribution of various versions of user agents.  For example,
+   software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in
+   addition to UTF-8.
+
+   Third, it may include additional mappings to be more user-friendly
+   and robust against transmission errors.  These would be similar to
+   how some servers currently treat URIs as case insensitive or perform
+   additional matching to account for spelling errors.  For characters
+   beyond the US-ASCII repertoire, this may, for example, include
+   ignoring the accents on received IRIs or resource names.  Please note
+   that such mappings, including case mappings, are language dependent.
+
+   It can be difficult to identify a resource unambiguously if too many
+   mappings are taken into consideration.  However, percent-encoded and
+   not percent-encoded parts of IRIs can always be clearly
+   distinguished.  Also, the regularity of UTF-8 (see [Duerst97]) makes
+   the potential for collisions lower than it may seem at first.
+
+7.8.  Upgrading Strategy
+
+   Where this recommendation places further constraints on software for
+   which many instances are already deployed, it is important to
+   introduce upgrades carefully and to be aware of the various
+   interdependencies.
+
+   If IRIs cannot be interpreted correctly, they should not be created,
+   generated, or transported.  This suggests that upgrading URI
+   interpreting software to accept IRIs should have highest priority.
+
+   On the other hand, a single IRI is interpreted only by a single or
+   very few interpreters that are known in advance, although it may be
+   entered and transported very widely.
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 36]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   Therefore, IRIs benefit most from a broad upgrade of software to be
+   able to enter and transport IRIs.  However, before an individual IRI
+   is published, care should be taken to upgrade the corresponding
+   interpreting software in order to cover the forms expected to be
+   received by various versions of entry and transport software.
+
+   The upgrade of generating software to generate IRIs instead of using
+   a local character encoding should happen only after the service is
+   upgraded to accept IRIs.  Similarly, IRIs should only be generated
+   when the service accepts IRIs and the intervening infrastructure and
+   protocol is known to transport them safely.
+
+   Software converting from URIs to IRIs for display should be upgraded
+   only after upgraded entry software has been widely deployed to the
+   population that will see the displayed result.
+
+   Where there is a free choice of character encodings, it is often
+   possible to reduce the effort and dependencies for upgrading to IRIs
+   by using UTF-8 rather than another encoding.  For example, when a new
+   file-based Web server is set up, using UTF-8 as the character
+   encoding for file names will make the transition to IRIs easier.
+   Likewise, when a new Web form is set up using UTF-8 as the character
+   encoding of the form page, the returned query URIs will use UTF-8 as
+   the character encoding (unless the user, for whatever reason, changes
+   the character encoding) and will therefore be compatible with IRIs.
+
+   These recommendations, when taken together, will allow for the
+   extension from URIs to IRIs in order to handle characters other than
+   US-ASCII while minimizing interoperability problems.  For
+   considerations regarding the upgrade of URI scheme definitions, see
+   section 6.4.
+
+8.  Security Considerations
+
+   The security considerations discussed in [RFC3986] also apply to
+   IRIs.  In addition, the following issues require particular care for
+   IRIs.
+
+   Incorrect encoding or decoding can lead to security problems.  In
+   particular, some UTF-8 decoders do not check against overlong byte
+   sequences.  As an example, a "/" is encoded with the byte 0x2F both
+   in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly
+   interpret the sequence 0xC0 0xAF as a "/".  A sequence such as
+
+
+
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 37]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   "%C0%AF.." may pass some security tests and then be interpreted as
+   "/.." in a path if UTF-8 decoders are fault-tolerant, if conversion
+   and checking are not done in the right order, and/or if reserved
+   characters and unreserved characters are not clearly distinguished.
+
+   There are various ways in which "spoofing" can occur with IRIs.
+   "Spoofing" means that somebody may add a resource name that looks the
+   same or similar to the user, but that points to a different resource.
+   The added resource may pretend to be the real resource by looking
+   very similar but may contain all kinds of changes that may be
+   difficult to spot and that can cause all kinds of problems.  Most
+   spoofing possibilities for IRIs are extensions of those for URIs.
+
+   Spoofing can occur for various reasons.  First, a user's
+   normalization expectations or actual normalization when entering an
+   IRI or transcoding an IRI from a legacy character encoding do not
+   match the normalization used on the server side.  Conceptually, this
+   is no different from the problems surrounding the use of
+   case-insensitive web servers.  For example, a popular web page with a
+   mixed-case name ("http://big.example.com/PopularPage.html") might be
+   "spoofed" by someone who is able to create
+   "http://big.example.com/popularpage.html".  However, the use of
+   unnormalized character sequences, and of additional mappings for user
+   convenience, may increase the chance for spoofing.  Protocols and
+   servers that allow the creation of resources with names that are not
+   normalized are particularly vulnerable to such attacks.  This is an
+   inherent security problem of the relevant protocol, server, or
+   resource and is not specific to IRIs, but it is mentioned here for
+   completeness.
+
+   Spoofing can occur in various IRI components, such as the domain name
+   part or a path part.  For considerations specific to the domain name
+   part, see [RFC3491].  For the path part, administrators of sites that
+   allow independent users to create resources in the same sub area may
+   have to be careful to check for spoofing.
+
+   Spoofing can occur because in the UCS many characters look very
+   similar.  Details are discussed in Section 7.5.  Again, this is very
+   similar to spoofing possibilities on US-ASCII, e.g., using "br0ken"
+   or "1ame" URIs.
+
+   Spoofing can occur when URIs with percent-encodings based on various
+   character encodings are accepted to deal with older user agents.  In
+   some cases, particularly for Latin-based resource names, this is
+   usually easy to detect because UTF-8-encoded names, when interpreted
+   and viewed as legacy character encodings, produce mostly garbage.
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 38]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   When concurrently used character encodings have a similar structure
+   but there are no characters that have exactly the same encoding,
+   detection is more difficult.
+
+   Spoofing can occur with bidirectional IRIs, if the restrictions in
+   section 4.2 are not followed.  The same visual representation may be
+   interpreted as different logical representations, and vice versa.  It
+   is also very important that a correct Unicode bidirectional
+   implementation be used.
+
+9.  Acknowledgements
+
+   We would like to thank Larry Masinter for his work as coauthor of
+   many earlier versions of this document (draft-masinter-url-i18n-xx).
+
+   The discussion on the issue addressed here started a long time ago.
+   There was a thread in the HTML working group in August 1995 (under
+   the topic of "Globalizing URIs") and in the www-international mailing
+   list in July 1996 (under the topic of "Internationalization and
+   URLs"), and there were ad-hoc meetings at the Unicode conferences in
+   September 1995 and September 1997.
+
+   Many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding,
+   Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim
+   Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie
+   Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman, Russ Housley,
+   Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex Texin, Graham Klyne,
+   Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam Costello, Dan
+   Oscarson, Elliotte Rusty Harold, Mike J. Brown, Roy Badami, Jonathan
+   Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio, Chris
+   Haynes, Walter Underwood, and many others for help with understanding
+   the issues and possible solutions, and with getting the details
+   right.
+
+   This document is a product of the Internationalization Working Group
+   (I18N WG) of the World Wide Web Consortium (W3C).  Thanks to the
+   members of the W3C I18N Working Group and Interest Group for their
+   contributions and their work on [CharMod].  Thanks also go to the
+   members of many other W3C Working Groups for adopting IRIs, and to
+   the members of the Montreal IAB Workshop on Internationalization and
+   Localization for their review.
+
+
+
+
+
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 39]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+10.  References
+
+10.1.  Normative References
+
+   [ASCII]        American National Standards Institute, "Coded
+                  Character Set -- 7-bit American Standard Code for
+                  Information Interchange", ANSI X3.4, 1986.
+
+   [ISO10646]     International Organization for Standardization,
+                  "ISO/IEC 10646:2003: Information Technology -
+                  Universal Multiple-Octet Coded Character Set (UCS)",
+                  ISO Standard 10646, December 2003.
+
+   [RFC2119]      Bradner, S., "Key words for use in RFCs to Indicate
+                  Requirement Levels", BCP 14, RFC 2119, March 1997.
+
+   [RFC2234]      Crocker, D. and P. Overell, "Augmented BNF for Syntax
+                  Specifications: ABNF", RFC 2234, November 1997.
+
+   [RFC3490]      Faltstrom, P., Hoffman, P., and A. Costello,
+                  "Internationalizing Domain Names in Applications
+                  (IDNA)", RFC 3490, March 2003.
+
+   [RFC3491]      Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
+                  Profile for Internationalized Domain Names (IDN)", RFC
+                  3491, March 2003.
+
+   [RFC3629]      Yergeau, F., "UTF-8, a transformation format of ISO
+                  10646", STD 63, RFC 3629, November 2003.
+
+   [RFC3986]      Berners-Lee, T., Fielding, R., and L. Masinter,
+                  "Uniform Resource Identifier (URI): Generic Syntax",
+                  STD 66, RFC 3986, January 2005.
+
+   [UNI9]         Davis, M., "The Bidirectional Algorithm", Unicode
+                  Standard Annex #9, March 2004,
+                  <http://www.unicode.org/reports/tr9/tr9-13.html>.
+
+   [UNIV4]        The Unicode Consortium, "The Unicode Standard, Version
+                  4.0.1, defined by: The Unicode Standard, Version 4.0
+                  (Reading, MA, Addison-Wesley, 2003. ISBN
+                  0-321-18578-1), as amended by Unicode 4.0.1
+                  (http://www.unicode.org/versions/Unicode4.0.1/)",
+                  March 2004.
+
+
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 40]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   [UTR15]        Davis, M. and M. Duerst, "Unicode Normalization
+                  Forms", Unicode Standard Annex #15, April 2003,
+                  <http://www.unicode.org/unicode/reports/
+                  tr15/tr15-23.html>.
+
+10.2.  Informative References
+
+   [BidiEx]       "Examples of bidirectional IRIs",
+                  <http://www.w3.org/International/iri-edit/
+                  BidiExamples>.
+
+   [CharMod]      Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T.
+                  Texin, "Character Model for the World Wide Web:
+                  Resource Identifiers", World Wide Web Consortium
+                  Candidate Recommendation, November 2004,
+                  <http://www.w3.org/TR/charmod-resid>.
+
+   [Duerst97]     Duerst, M., "The Properties and Promises of UTF-8",
+                  Proc.  11th International Unicode Conference, San Jose
+                  , September 1997,
+                  <http://www.ifi.unizh.ch/mml/mduerst/papers/
+                  PDF/IUC11-UTF-8.pdf>.
+
+   [Gettys]       Gettys, J., "URI Model Consequences",
+                  <http://www.w3.org/DesignIssues/ModelConsequences>.
+
+   [HTML4]        Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01
+                  Specification", World Wide Web Consortium
+                  Recommendation, December 1999,
+                  <http://www.w3.org/TR/html401/appendix/
+                  notes.html#h-B.2>.
+
+   [RFC2045]      Freed, N. and N. Borenstein, "Multipurpose Internet
+                  Mail Extensions (MIME) Part One: Format of Internet
+                  Message Bodies", RFC 2045, November 1996.
+
+   [RFC2130]      Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
+                  Atkinson, R., Crispin, M., and P. Svanberg, "The
+                  Report of the IAB Character Set Workshop held 29
+                  February - 1 March, 1996", RFC 2130, April 1997.
+
+   [RFC2141]      Moats, R., "URN Syntax", RFC 2141, May 1997.
+
+   [RFC2192]      Newman, C., "IMAP URL Scheme", RFC 2192, September
+                  1997.
+
+   [RFC2277]      Alvestrand, H., "IETF Policy on Character Sets and
+                  Languages", BCP 18, RFC 2277, January 1998.
+
+
+
+Duerst & Suignard           Standards Track                    [Page 41]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   [RFC2368]      Hoffman, P., Masinter, L., and J. Zawinski, "The
+                  mailto URL scheme", RFC 2368, July 1998.
+
+   [RFC2384]      Gellens, R., "POP URL Scheme", RFC 2384, August 1998.
+
+   [RFC2396]      Berners-Lee, T., Fielding, R., and L. Masinter,
+                  "Uniform Resource Identifiers (URI): Generic Syntax",
+                  RFC 2396, August 1998.
+
+   [RFC2397]      Masinter, L., "The "data" URL scheme", RFC 2397,
+                  August 1998.
+
+   [RFC2616]      Fielding,  R., Gettys, J., Mogul, J., Frystyk, H.,
+                  Masinter, L., Leach, P., and T. Berners-Lee,
+                  "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616,
+                  June 1999.
+
+   [RFC2640]      Curtin, B., "Internationalization of the File Transfer
+                  Protocol", RFC 2640, July 1999.
+
+   [RFC2718]      Masinter, L., Alvestrand, H., Zigmond, D., and R.
+                  Petke, "Guidelines for new URL Schemes", RFC 2718,
+                  November 1999.
+
+   [UNIXML]       Duerst, M. and A. Freytag, "Unicode in XML and other
+                  Markup Languages", Unicode Technical Report #20, World
+                  Wide Web Consortium Note, June 2003,
+                  <http://www.w3.org/TR/unicode-xml/>.
+
+   [XLink]        DeRose, S., Maler, E., and D. Orchard, "XML Linking
+                  Language (XLink) Version 1.0", World Wide Web
+                  Consortium Recommendation, June 2001,
+                  <http://www.w3.org/TR/xlink/#link-locators>.
+
+   [XML1]         Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E.,
+                  and F. Yergeau, "Extensible Markup Language (XML) 1.0
+                  (Third Edition)", World Wide Web Consortium
+                  Recommendation, February 2004,
+                  <http://www.w3.org/TR/REC-xml#sec-external-ent>.
+
+   [XMLNamespace] Bray, T., Hollander, D., and A. Layman, "Namespaces in
+                  XML", World Wide Web Consortium Recommendation,
+                  January 1999, <http://www.w3.org/TR/REC-xml-names>.
+
+   [XMLSchema]    Biron, P. and A. Malhotra, "XML Schema Part 2:
+                  Datatypes", World Wide Web Consortium Recommendation,
+                  May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>.
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 42]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   [XPointer]     Grosso, P., Maler, E., Marsh, J. and N. Walsh,
+                  "XPointer Framework", World Wide Web Consortium
+                  Recommendation, March 2003,
+                  <http://www.w3.org/TR/xptr-framework/#escaping>.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 43]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+Appendix A.  Design Alternatives
+
+   This section shortly summarizes major design alternatives and the
+   reasons for why they were not chosen.
+
+Appendix A.1.  New Scheme(s)
+
+   Introducing new schemes (for example, httpi:, ftpi:,...) or a new
+   metascheme (e.g., i:, leading to URI/IRI prefixes such as i:http:,
+   i:ftp:,...) was proposed to make IRI-to-URI conversion scheme
+   dependent or to distinguish between percent-encodings resulting from
+   IRI-to-URI conversion and percent-encodings from legacy character
+   encodings.
+
+   New schemes are not needed to distinguish URIs from true IRIs (i.e.,
+   IRIs that contain non-ASCII characters).  The benefit of being able
+   to detect the origin of percent-encodings is marginal, as UTF-8 can
+   be detected with very high reliability.  Deploying new schemes is
+   extremely hard, so not requiring new schemes for IRIs makes
+   deployment of IRIs vastly easier.  Making conversion scheme dependent
+   is highly inadvisable and would be encouraged by separate schemes for
+   IRIs.  Using a uniform convention for conversion from IRIs to URIs
+   makes IRI implementation orthogonal to the introduction of actual new
+   schemes.
+
+Appendix A.2.  Character Encodings Other Than UTF-8
+
+   At an early stage, UTF-7 was considered as an alternative to UTF-8
+   when IRIs are converted to URIs.  UTF-7 would not have needed
+   percent-encoding and in most cases would have been shorter than
+   percent-encoded UTF-8.
+
+   Using UTF-8 avoids a double layering and overloading of the use of
+   the "+" character.  UTF-8 is fully compatible with US-ASCII and has
+   therefore been recommended by the IETF, and is being used widely.
+
+   UTF-7 has never been used much and is now clearly being discouraged.
+   Requiring implementations to convert from UTF-8 to UTF-7 and back
+   would be an additional implementation burden.
+
+Appendix A.3.  New Encoding Convention
+
+   Instead of using the existing percent-encoding convention of URIs,
+   which is based on octets, the idea was to create a new encoding
+   convention; for example, to use "%u" to introduce UCS code points.
+
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 44]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+   Using the existing octet-based percent-encoding mechanism does not
+   need an upgrade of the URI syntax and does not need corresponding
+   server upgrades.
+
+Appendix A.4.  Indicating Character Encodings in the URI/IRI
+
+   Some proposals suggested indicating the character encodings used in
+   an URI or IRI with some new syntactic convention in the URI itself,
+   similar to the "charset" parameter for e-mails and Web pages.  As an
+   example, the label in square brackets in
+   "http://www.example.org/ros[iso-8859-1]&#xE9"; indicated that the
+   following "&#xE9"; had to be interpreted as iso-8859-1.
+
+   If UTF-8 is used exclusively, an upgrade to the URI syntax is not
+   needed.  It avoids potentially multiple labels that have to be copied
+   correctly in all cases, even on the side of a bus or on a napkin,
+   leading to usability problems (and being prohibitively annoying).
+   Exclusively using UTF-8 also reduces transcoding errors and
+   confusion.
+
+Authors' Addresses
+
+   Martin Duerst  (Note: Please write "Duerst" with u-umlaut wherever
+                  possible, for example as "D&#252;rst" in XML and
+                  HTML.)
+   World Wide Web Consortium
+   5322 Endo
+   Fujisawa, Kanagawa  252-8520
+   Japan
+
+   Phone: +81 466 49 1170
+   Fax:   +81 466 49 1171
+   EMail: duerst@w3.org
+   URI:   http://www.w3.org/People/D%C3%BCrst/
+   (Note: This is the percent-encoded form of an IRI.)
+
+
+   Michel Suignard
+   Microsoft Corporation
+   One Microsoft Way
+   Redmond, WA  98052
+   U.S.A.
+
+   Phone: +1 425 882-8080
+   EMail: michelsu@microsoft.com
+   URI:   http://www.suignard.com
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 45]
+
+RFC 3987         Internationalized Resource Identifiers     January 2005
+
+
+Full Copyright Statement
+
+   Copyright (C) The Internet Society (2005).
+
+   This document is subject to the rights, licenses and restrictions
+   contained in BCP 78, and except as set forth therein, the authors
+   retain all their rights.
+
+   This document and the information contained herein are provided on an
+   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
+   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
+   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
+   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
+   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
+   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+Intellectual Property
+
+   The IETF takes no position regarding the validity or scope of any
+   Intellectual Property Rights or other rights that might be claimed to
+   pertain to the implementation or use of the technology described in
+   this document or the extent to which any license under such rights
+   might or might not be available; nor does it represent that it has
+   made any independent effort to identify any such rights.  Information
+   on the IETF's procedures with respect to rights in IETF Documents can
+   be found in BCP 78 and BCP 79.
+
+   Copies of IPR disclosures made to the IETF Secretariat and any
+   assurances of licenses to be made available, or the result of an
+   attempt made to obtain a general license or permission for the use of
+   such proprietary rights by implementers or users of this
+   specification can be obtained from the IETF on-line IPR repository at
+   http://www.ietf.org/ipr.
+
+   The IETF invites any interested party to bring to its attention any
+   copyrights, patents or patent applications, or other proprietary
+   rights that may cover technology that may be required to implement
+   this standard.  Please address the information to the IETF at ietf-
+   ipr@ietf.org.
+
+
+Acknowledgement
+
+   Funding for the RFC Editor function is currently provided by the
+   Internet Society.
+
+
+
+
+
+
+Duerst & Suignard           Standards Track                    [Page 46]
+
diff --git a/trunk/txt/vfs-ideas.txt b/trunk/txt/vfs-ideas.txt
new file mode 100644
index 00000000..7b41a064
--- /dev/null
+++ b/trunk/txt/vfs-ideas.txt
@@ -0,0 +1,425 @@
+Subject: Plans for gnome-vfs replacement
+
+Recently there has been a lot of discussions about the platform and
+the correct stacking order and quality of the modules. Gnome-vfs
+is a clear problem in this discussion. Having spent the last 4 years
+as the gnome-vfs maintainer, and even longer as the primary gnome-vfs
+user (in Nautilus) I'm well aware of the problems it has. I think that
+we've reached a point where the problems in the gnome-vfs architecture
+and its position in the stack are now ranking as one of the most
+problematic aspects of the gnome platform, especially considering the
+enhancements and quality improvements seen in other parts of the
+platform.
+
+So, I think the time has come for a serious look at what gnome-vfs
+could be. I've spent much time last week thinking about the weaknesses
+and problems of the current gnome-vfs and possibilities inherent in a
+redesign, both having learnt from 7 years of gnome-vfs existance and
+the improvements in the platform (both Gnome and surrounding
+technologies) since 1999 when it was designed.
+
+As soon as you spend some time looking at this problem is evident that
+to solve the platform ordering issues we really need a clean cut from
+the current gnome-vfs. I think the ideal level for a VFS would be in
+glib, in a separate library similar to gthread or gobject. That way
+gtk+ would be able to integrate with it and all gnome apps would have
+access to it, but it wouldn't affect small apps that just wants to use
+the glib core. Furthermore, not being libglib lets us use GObjects in
+the vfs, which means we can make a more modern API. Of course, this
+places quite some limitations on the vfs, especially in terms of
+dependencies and how integration with a UI should work.
+
+Any thoughs on the design of a vfs to replace gnome-vfs must be based
+on a solid understanding of the problems with the current system, to
+avoid redoing old mistakes and to make sure that we solve all the
+known problems of the current system. So, I'm gonna start by
+describing what I see as the main architectural and "hard" problems in
+gnome-vfs.
+
+The first, and most often discussed problem is of course
+compatibility. The various desktops use different vfs implementations,
+so they can't read each others files. Many applications use no VFS
+at all and have no way to access files on vfs shares. And the
+existence of multiple vfs implementations makes it unlikely that they
+will start using one.
+
+Gnome-vfs has no concept of Display Names, something which is very
+useful in a gui based system. In some places we auto-generate virtual
+desktop files to get this feature, but that is very hackish and need
+support in all apps to understand them. This is also quite closely
+related to handling filename charset encoding, another very weak point
+in gnome-vfs. The ideal way to reference a file is the actual real
+identifer on disk, database, remote share or what have you, as that
+can be passed between implementations, used with other access methods,
+etc. But to display something useful to the user you really need a
+user understandable utf-8 encoded string and a way to map that to/from
+the filename.
+
+There is no support for icons at the level of gnome-vfs. This means
+that all users above the vfs must implement it themselves (e.g. in
+nautilus and in the file selector). Each implementation have its own
+bugs, maintainance load and risk for different behaviour. It also
+means that vfs backends cannot supply their own icons, something which
+might be very useful for e.g. a network share based on some new fancy
+web service. 
+
+The abstraction that gnome-vfs use is very similar to the posix model,
+which is problematic in several ways. The posix model matches poorly
+with the sort of highlevel operations that gnome applications wants to
+do with the vfs. The vfs is not typically used for what I would call
+"implementation files", which are things like configuration files,
+data files shipped with apps, system files, etc, but rather for what
+I'd like to call "user document files". These are the kind of files
+you open, save, or download from the internet to look at. Applications
+that use these would like highlevel operations that match the kind
+of operations you use on them, like read-entire-file, save-file,
+copy-file, etc. 
+
+The posix model is also a bad match when implementing gnome-vfs
+modules. It requires some features from the implementation that can be
+difficult or impossible to implement. For example, its not really
+possible to support seeking when writing to a file on a webdav share,
+because a webdav put operation is essentially just streaming a copy of
+the new file contents. We currently work around this by locally
+caching all the file data being written and then sending it all in the
+close() call. This is clearly suboptimal for a lot of reasons, like
+applications not expecting close() to take a long time, and not
+checking its error conditions very closely.
+
+Another problem is that the posix model doesn't contain explicit
+operations for some of the things applications need to do. So instead
+applications rely on well known knowledge of the behaviour of posix
+to implement these operations. However, such behaviour might not be
+guaranteed on some gnome-vfs backends. A common example is the atomic
+save operation. A typical way to implement this on posix is to write
+to a temporary file in the same directory, and then rename over the
+target file, thus guaranteeing an atomic replacement of the file on
+disk, or a failure that didn't affect the original file. Of course,
+if a gnome-vfs application would use this and the backend was a
+webdav share to a subversion repository you would get some really
+weird versioning history in the repository for no good reason. If the
+backend had its own implementation of the save operation we could get 
+both optimal behaviour on each backend, and an application API that
+doesn't require arcane knowledge of the atomicity of renames.
+
+One of the most problematic aspects of gnome-vfs is its authentication
+framework. The way it works is that you register callbacks to handle
+the authentication dialog, and whenever any operation needs to do
+authentication these callbacks will be called. The idea is that a
+console application would register a set of callbacks that print
+prompts on the console, and a Gtk+ application would have a set of
+callbacks that displays dialogs. There is a set of standard dialog
+based callbacks in libgnomeui that you can install by calling
+gnome_authentication_manager_init(). From an initial look this seems
+like a reasonable approach, but it turns out that this creates a host
+of different problems.
+
+One problem is how you connect this to the application. A lot of
+people are unaware that you have to call gnome_auth_manager_init() to
+get authentication dialogs, or don't want to depend on libgnomeui to
+do so. So a lot of applications don't work with authentication. Those
+who do call it generally have pretty poor integration with the
+authentication dialogs. For instance, the general authentication
+dialogs can't be marked as parents of whatever dialog caused the
+authentication (because they have know whay of knowing what caused
+it), and all sorts of problems appear when there is a modal dialog
+displayed already.
+
+Another problem is the combination of blocking gnome-vfs calls and
+authentication. When calling a blocking operation like read() and it
+results in a password dialog we have to start up a recursive mainloop
+to display it. Not only is this unexpected for the application, it
+also brings with it all the type of reentrancy issues that we had in
+bonobo. Even worse, there is no way to make this threadsafe. To make
+it threadsafe the callback would have to take the gdk lock before
+doing any Gtk+ calls, but this would cause a deadlock if the
+application called it with the gdk lock held. If we don't take the
+gdk lock then you can't do blocking vfs calls on any thread but the
+mainloop, or you have to take the gdk lock on any gnome-vfs call.
+
+The authentication callbacks can appear at *any* gnome-vfs entry
+point, which makes it very hard to write gnome-vfs applications that
+don't accidentally trigger a lot of authentication dialogs. For
+instance, the tree sidebar in nautilus has to take particular care not
+to stat or otherwise look at the toplevel items until the user
+explicitly expands them, otherwise you'd get authentication dialogs
+every time you opened a window. Its also easy to get multiple
+authentication dialogs for the same entity.
+
+The way threads are used in gnome-vfs is problematic, both from the
+point of view of writing backends, and for users of the library. For
+users it forces the use of threading, even if the application doesn't
+use the asynchronous calls that use the threading. It also enforces
+the need for a gnome_vfs_init() function, as thread initialization
+must be done very early.
+
+For backend implementations the use of threads forces every
+backend to be threadsafe. Many of the backends are inherently single
+threaded, either because they use non-threadsafe libraries like the
+smb backend, or because the server being wrapped forces serialized
+access (like an ftp backend where you really only want one connection
+to the server). 
+
+Backends run in context of the application using gnome-vfs, which can
+be a gtk+ app, but as well a console application, so they have no
+control or guarantee of their environment. For instance, they cannot
+rely on the existance of a mainloop, so there is no way to use
+e.g. timeouts to handle invalidation of caches. One way we have tried
+to solve this is to move some backends to the gnome-vfs daemon, where
+they can rely on the existance of the mainloop.
+
+Gnome-vfs use something called "gnome vfs uris" to identify
+files. These are similar, but not entierly identical to the types of
+uri used in webbrowsers. For instance, we often make us our own types
+of URIs when there is no official standard for them (although such
+standards might appear later, with incompatible behaviour). We also
+have a "well defined" posix-like type of behaviour that isn't the same
+as for web uris. The most extreme example would be mailto:, but even
+things like ftp:// uris are different. The ftp uri rfc explains how
+ftp:///dir/file.txt refers to $(login_dir)/dir/file.txt, and that you
+have to use ftp:///%2fdir/file.txt to refer to the absolute path
+/dir/file.txt on the server. Clearly we can't have pathname handling
+semantics that vary depending on the backend (no app would get it
+right), so we ignore the rfcs on this.
+
+Then there is the thing with escaping and unescaping uris. Although
+technically not very complex it is just are very hard to get right all
+the time. Among the most common questions on the gnome-vfs list is
+what the various escape/unescape functions does, what arguments has to
+be escaped, and how to display uris "nicely" (i.e. without escapes,
+although that makes them invalid uris). This is made extra complicated
+due to the poor handling of filename encodings and display names, and
+the fact that only "less common" cases (like spaces in filenames)
+break if you get it wrong.
+
+Last but not least, the fact that gnome-vfs uses something called a
+"uri" gives people the wrong impression of what the library is
+designed for. It causes people to complain when it doesn't have some
+support for mailto: links, and it makes people want support for
+cookies, extra http headers and other things typically used by a web
+browser. This isn't really the kind of use that vfs is targeted at. A
+library specific to that sort use would probably fit these apps much 
+better. 
+
+Most gnome-vfs state is tied to the application that uses it, which I
+think is quite unexpected by the user. For instance, when you log into
+a network share in nautilus and then click on a file to open it, the
+opening  application will have to re-connect and re-authenticate to
+the share, much to the users surprise. I really think most people
+expect a login like that is somehow session global. We do sometimes
+misuse gnome-keyring to "solve" the authentication issue, but even
+then we still have multiple connections to the network share, which
+can cause problem, for instance with ftp shares that use round-robin
+dns where the mirrors aren't fully sync:ed up. Again, some backends
+(smb) are now in the daemon which solves this issue.
+
+gnome_vfs_xfer() is possibly the worst-API call in the whole gnome
+platform. Its a single, buggy, do-it-all function with shitloads of
+combinations of flags and arguments to do all sort of things, with
+little or no semantic specifications or testcases. Its also to a large
+extent unnecessary for most applications and could easily be part of
+the file manager instead of a generic library. I'm also not sure that
+the "first do preflight calculation, then execute operation" model it
+uses is right. It is inherently racy, since the target or source could
+easily change during the preflight, and it makes error reporting and
+handling much more complicated. 
+
+The behaviour of symlink resolution in the UI has been discussed many
+times. Should clicking on a symlink "foo" in $dir go to $dir/foo or to
+the target directory. The Nautilus maintainers has decided that the
+best way to approach this is to have symlinks be used for "filesystem
+implementation" (like a symlink for /home -> /mnt/hdb2) and thus not
+be resolved on activation. However, we should (this hasn't been
+finished yet) support a different form of links (called "shortcuts" in
+the UI) that always resolve on activation. At the moment there is no
+support for anything like that in gnome-vfs, so we abuse desktop files
+for this. We even generate virtual in-memory desktop files in the smb
+backend to get this behaviour. Proper support for shortcuts in the
+vfs API would let apps automatically work without ugly desktop file
+hacks.
+
+Over the years gnome-vfs has accumulated a lot of cruft. It links to a
+lot of libraries, including openssl, gconf+ORBit2, avahi, dbus, popt,
+libxml, kerberos, libz and libresolv. Very few applications need all
+of these, yet every application that uses gnome-vfs links to all of
+them. Furthermore, some of the functionallity in gnome-vfs, like the
+wrapper for dns-sd, resolving, network utilities, ls parsing
+functions, ssl support, pty handling are perhaps not best suited for a
+vfs library, nor do they always have great apis and quality
+implementations. We could definately clean this up and minimize the
+APIs.
+
+At some point in time gnome_vfs_uri_is_local() started detecting and
+returning TRUE for NFS mounts and other type of local network
+mounts. This is both slow and unexpected, and has led to problems and
+unnecessary changes in many places. 
+
+The way the cancellation API for asynchronous operations is set up
+creates races and fragile code. The main issue is that if you call
+cancel before the operation callback has been called the callback will
+not be called. However, the callback typically wants to free some sort
+of user_data object passed to it, so that has to be handled also when
+you call cancellation. Couple this with the fact that there is no
+destroy notifies and you can't cancel after the operation callback has
+been called and you get an extremely tricky setup of combined
+callbacks. Furthermore, if threads are used there are some inherent
+races wrt detecting if the callback has been called when cancel is
+called, making it essentially impossible to get this right.
+
+There are also a bunch of issues with the current gnome-vfs that could
+technically be fixed like support for hidden file flags,
+backend-extensible metadata, no standard vfs dialogs like progress
+bars, etc.
+
+Last week I started thinking about a new design for a gnome-vfs
+replacement that would solve most of these issues, and at the same
+time gives a correct ordering of the platform stack. I've come up with
+a highlevel architecture that I think will work, even though I haven't
+yet finished it in detail or gotten the API totally worked out. Its
+somewhat of a radical departure from gnome-vfs as it is today, so
+brace with me as I try to explain the model and the ideas behind it.
+
+The gnome-vfs model is what I would call stateless. You can at any
+time throw a URI at it and it will do everything required to access
+the location. There is no need to, nor is there a way to set up
+anything like a "session" with a remote share. Of course, in practice
+this is not the way network shares work, so all sorts of session
+initiation, caching and other magic happens under the covers to make
+it look stateless. This is the source of all the problems with the
+gnome-vfs authentication model.
+
+I'd like to propose using a stateful model, where you have to
+explicitly initiate a session ("mount" a share) before you can start
+accessing files. This will give a well specified time when all forms
+of authentication will happen, when applications expect it and when
+they can use a more expressive and suitable API for this kind of
+operation. The actual i/o operations will then never cause any sort of
+authentication issues, and can thus be purely non-graphical
+(i.e. glib-only apps can do i/o). I imagine all/most actual mounting
+of shares will happen in the file manager and the file selector, or at
+gnome-session startup, so applications don't really need to handle
+this themselves.
+
+Not only is the model stateful. I'd like all state to be session
+global. That is, all mounts and network connections are shared between
+all applications in the session. So, if you pass a file reference from
+one app to another there is no need to log in again or anything like
+that. I think this is what users expect.
+
+Having a global stateful model means all non-local vfs accesses go
+through the vfs daemon. This works pretty well with the smb backend in
+the current gnome-vfs, and smb is the backend most likely to have high
+bandwidth traffic, so this doesn't seem to be a large performance
+problem. Although we do have to take the performance aspect into
+consideration when designing the daemon.
+
+In order to avoid all the problems with threading described above the
+vfs daemon will not use threads. In fact, I think the best approach is
+to let each active mountpoint be its own process. That way we get
+robustness (one mount can't crash the others) and simplify the backend
+creation greatly (each backend fully controls its context). It also
+will let us do concurrent access to e.g. two smb shares (like a copy
+from one to the other). We can't really do this atm since the thread
+lock in the smb backend serializes such access. But with two smb
+processes this is not a problem.
+
+There might be an issue with using separate processes for the
+mountpoints bloating up the desktop, but I don't think that it will be
+much of a problem. None of these processes will use the gui libraries
+that are the real sources of unshared dirty memory use. I tried a
+simple process that just used gobject and ran a mainloop. It only used
+78k of dirty memory. Also, each server need only link to and
+initialize the few libraries it needs, further keeping memory use down
+and avoiding bloat in all applications (e.g. apps need not link to
+openssl). 
+
+As a consequence of the stateful model we don't need the stateless
+properties that URIs has as identifier. To avoid all the problems
+comming from the use of URIs  we use a much simpler form of
+identifier. Namely filenames, in a hierarchical tree with
+mountpoints. These filenames are from an extended set of strings that
+includes the set of normal filenames, but also includes some platform
+dependent extensions. On win32 the full set might be some form of
+stringified version of the ITEMIDLIST from the windows shell api, and
+on unix we would use some out of band prefix to mark a non-local
+filename.  
+
+For example, we could be to use "//.network/" as a prefix for the vfs
+filename namespace. A smb share might then be accessed as
+"//.network/smb/computer:share/dir/file.txt", or a ftp share as
+"//.network/ftp/alex@ftp.gnome.org/dir/file.txt". With a setup like
+"//.network/$method/$mount_object/" it would be quite easy to find the
+process handling the mount. Just ask for a dbus named object like
+"org.glib.gvsf.smb.computer:share". It is also very easy to detect
+local filenames and short-circuit to in-process i/o.
+
+These filenames would be the real identifier for the files, and as
+such not really presentable to the user as it. You'd need to ask for
+the display name via the vfs to get a user readable utf8-encoded
+string for display. 
+
+The set of operations on files and folders would be both simplified
+and extended. We'd remove complicated things like read+write access to
+a file, and give less posix-like guarantees. We also make seek and
+truncate support optional in the backend. But then we will extend the
+set of operations possible to allow things like copy on the remote
+side (to avoid a download+upload operation on copy) and to have 
+a set of highlevel operations that applications want, like "save" that
+implements the best way to save for each particular backend.
+
+We support metadata like display name, mimetypes, icon, and some
+general information like length and mtime. But we make support for
+getting the full "struct stat" buffer backend optional, as that isn't
+a good abstraction for most backends. Also, the API will be designed
+on the idea that network latency is expensive, so that there will only
+be one call to stat() or readdir() needed to read all the metadata
+requested by the application. (Whereas posix will have readdir return
+only the names and force you to stat each file in a separate
+roundtrip.)
+
+We likely don't want the full gnome/unix vfs implementation in
+glib, instead glib will only ship an implementation of the vfs API for
+local file access, and one that communicates to the vfs
+daemon(s). Then we ship the daemon and the implementations of the
+various backends externally.
+
+We will also write a single gnome-vfs backend that allows access to
+all the glib vfs shares by using a uri like gvfs:///XXX that just maps
+to //.network/XXX. We can also implement a similar backend for kio so
+that kde applications can read and write to the shares.
+
+Furthermore, if FUSE is supported on the system we can write a FUSE
+filesystem so that we can access the files as $HOME/.network/XXX. This
+can be made extra nice if the application (like e.g. acrobat) uses
+the gtk+ file selector but not the vfs by having the file selector
+detect a filename like this and reverse-mapping it into a vfs pathname
+and use the vfs for folder access. 
+
+I've been doing some initial sketching of the glib API, and I've
+started by introducing base GInputStream and GOutputStream similar to
+the stream objects in Java and .Net. These have async i/o support and
+will make the API for reading and writing files nicer and more
+modern. There is also a GSeekable interface that streams can
+optionally implement if they support seeking.
+
+I've also introduced a GFile object that wraps a file path. This means
+you don't have to do tedious string operations on the pathnames to
+navigate the filesystem. It also means we can use the openat() class
+of file operations when traversing the filesystem tree, avoiding some
+forms of races when we do things like recursive copies.
+
+To support the stateful model and still have some form of caching we
+will also need to add some cache specific api so that you can trigger
+a reload of information from a directory. Otherwise a reload operation
+in the file manager wouldn't always get the latest state on something
+like a ftp share where we cache things aggressively.
+
+I have some initial code here for some of the basic APIs, but its far
+from finished and I'd like to spend some more time working on it
+before I present it. However, I think the general architecture is
+pretty sound and in a state where it can be discussed.
+
+Hopefully this description of my plans is enought to make people
+understand some of my ideas and allow us to start and discussion about
+the future of gnome-vfs. Also, consider it a heads up that I and other
+people will likely be working on this this in the future.
diff --git a/trunk/txt/vfs-names.txt b/trunk/txt/vfs-names.txt
new file mode 100644
index 00000000..25833d7c
--- /dev/null
+++ b/trunk/txt/vfs-names.txt
@@ -0,0 +1,142 @@
+Local filenames (in utf8 mode)
+1) standard: /etc/passwd
+2) utf8 and spaces: "/tmp/a åäö.txt" (encoding==utf8)
+3) latin-1 and spaces: "/tmp/a åäö.txt" (encoding==iso8859-1)
+4) filename without encoding: "/tmp/bad:\001\010\011\012\013" (as a C string)
+5) mountpoint: /mnt/cdrom (cd has title "CD Title")
+
+Ftp mount to ftp.gnome.org
+(where filenames are stored as utf8, this is detected by using
+ ftp protocol extensions (there is an rfc) or by having the user
+ specify the encoding at mount time)
+
+6) normal dir: /pub/sources
+7) valid utf8 name: /dir/a file öää.txt
+8) latin-1 name: /dir/a file öää.txt
+
+Ftp mount to ftp.gnome.org (with filenames in latin-1)
+9) latin-1 name: /dir/a file öää.txt
+
+backend that stores display name separate from real name. Examples
+could be a flickr backend, a file backend that handles desktop files,
+or a virtual location like computer:// (which is implemented using
+virtual desktop files atm).
+
+10) /tmp/foo.desktop (with Name[en]="Display Name")
+
+special cases:
+ftp names relative to login dir
+
+Places where display filenames (i.e utf-8 strings) are used:
+
+A) Absolute filename, for editing (nautilus text entry, file selector entry)
+B) Semi-Absolute filename, for display (nautilus window title)
+C) Relative file name, for display (in nautilus/file selector icon/list view)
+D) Relative file name, for editing (rename in nautilus)
+E) Relative file name, for creating absolute name (filename completion for a)
+   This needs to know the exact form of the parent (i.e. it differs for filename vs uri).
+   I won't list this below as its always the same as A from the last slash to the end.
+
+This is how these work with gnome-vfs uris:
+
+   A                                                     B                             C                             D        
+1) file:///etc/passwd                                    passwd                        passwd                        passwd   
+2) file:///tmp/a%20%C3%B6%C3%A4%C3%A4.txt                a åäö.txt                     a åäö.txt                     a åäö.txt
+3) file:///tmp/a%20%E5%E4%F6.txt                         a ???.txt                     a ???.txt (invalid unicode)   a ???.txt
+4) file:///tmp/bad%3A%01%08%09%0A%0B                     bad:?????                     bad:????? (invalid unicode)   bad:?????
+5) file:///mnt/cdrom                                     CD Title (cdrom)              CD Title (cdrom)              CD Title
+6) ftp://ftp.gnome.org/pub/sources                       sources on ftp.gnome.org      sources                       sources
+7) ftp://ftp.gnome.org/dir/a%20%C3%B6%C3%A4%C3%A4.txt    a åäö.txt on ftp.gnome.org    a åäö.txt                     a åäö.txt
+8) ftp://ftp.gnome.org/dir/a%20%E5%E4%F6.txt             a ???.txt on ftp.gnome.org    a ???.txt (invalid unicode)   a ???.txt
+9) ftp://ftp.gnome.org/dir/a%20%E5%E4%F6.txt             a åäö.txt on ftp.gnome.org    a åäö.txt                     a åäö.txt
+10)file:///tmp/foo.desktop                               Display Name                  Display Name                  Display Name
+
+The stuff in column A is pretty insane. It works fine as an identifier
+for the computer to use, but nobody would want to have to type that in
+or look at that all the time. That is why Nautilus also allows
+entering some filenames as absolute unix pathnames, although not all
+filenames can be specified this way. If used when possible the column
+looks like this:
+
+   A
+1) /etc/passwd
+2) /tmp/a åäö.txt
+3) file:///tmp/a%20%E5%E4%F6.txt
+4) file:///tmp/bad%3A%01%08%09%0A%0B
+5) /mnt/cdrom
+6) ftp://ftp.gnome.org/pub/sources
+7) ftp://ftp.gnome.org/dir/a%20%C3%B6%C3%A4%C3%A4.txt
+8) ftp://ftp.gnome.org/dir/a%20%E5%E4%F6.txt
+9) ftp://ftp.gnome.org/dir/a%20%E5%E4%F6.txt
+10)/tmp/foo.desktop
+
+As we see this helps for most normal local paths, but it becomes
+problematic when the filenames are in the wrong encoding. For
+non-local files it doesn't help at all. We still have to look at these
+horrible escapes, even when we know the encoding of the filename.
+
+The examples 7-9 in this version shows the problem with URIs. Suppose
+we allowed an invalid URI like "ftp://ftp.gnome.org/dir/a åäö.txt"
+(utf8-encoded string). Given the state inherent in the mountpoint we
+know what encoding is used for the ftp server, so if someone types it
+in we know which file they mean. However, suppose someone pastes a URI
+like that into firefox, or mails it to someone, now we can't
+reconstruct the real valid URI anymore. If you drag and drop it
+however, the code can send the real valid uri so that firefox can load
+it correctly.
+
+So, this introduces two kinds of of URIs that are "mostly similar" but
+breaks in many nonobvious cases. This is very unfortunate, imho not
+acceptable. I think its ok to accept a URI typed in like
+"ftp://ftp.gnome.org/dir/a åäö.txt" and convert it to the right uri,
+but its not right to display such a uri in the nautilus location bar,
+as that can result in that invalid uri getting into other places.
+
+Since I dislike showing invalid URIs in the UI I think it makes sense
+to create a new absolute pathname display and entry format. Ideally
+such a system should allow any ascii or utf8 local filename to be
+represented as itself. Furthermore it would allow input of URIs, but
+immediately convert them to the display format (similar to how
+inputing a file:// uri in nautilus displays as a normal filename).
+
+One solution would be to use some other prefix than / for
+non-local files, and to use some form of escaping only for non-utf8
+chars and non-printables. Here is an example:
+
+   A
+1) /etc/passwd
+2) /tmp/a åäö.txt
+3) /tmp/a \xE5\xE4\xF6.txt
+4) /tmp/bad:\x01\x08\x09\x0A\x0B
+5) /mnt/cdrom
+6) :ftp:ftp.gnome.org/pub/sources
+7) :ftp:ftp.gnome.org/dir/a åäö.txt
+8) :ftp:ftp.gnome.org/dir/a \xE5\xE4\xF6.txt
+9) :ftp:ftp.gnome.org/dir/a åäö.txt
+10)/tmp/foo.desktop
+
+Under the hood this would use proper, valid escaped URIs. However, we
+would display things in the UI that made some sense to users, only
+falling back to escaping in the last possible case.
+
+The API could look something like:
+
+GFile *g_file_new_from_filename (char *filename);
+GFile *g_file_new_from_uri (char *uri);
+GFile *g_file_parse_display_name (char *display_name);
+
+Another approach (mentioned by Jürg Billeter on irc yesterday) is to
+move from a pure textual representation of the full uri to a more
+structured UI. For example the ftp://ftp.gnome.org/ part of the URI
+could be converted to a single item in the entry looking like
+[#ftp.gnome.org] (where # is an ftp icon). Then the rest of the entry
+would edit just the path on the ftp server, as a local filename. The
+disadvantage here is that its a bit harder to know how to type in a
+full pathname including what method to use and what server (you'd type
+in a URI). This isn't necessarily a huge problem if you rarely type in
+remote URIs (instead you can follow links, browse the network, add
+favourites, etc).
+
+I don't know how hard this is to do from a Gtk+ perspective
+though. Its somewhat similar to what the evolution address entry does.
+
author	Alexander Larsson <alexl@src.gnome.org>	2009-03-16 11:43:23 +0000
committer	Alexander Larsson <alexl@src.gnome.org>	2009-03-16 11:43:23 +0000
commit	4ad537c5c3e17e1efe289020d7dc6cd0efae42c5 (patch)
tree	891f2ec720f5ae321762965a00d352ad0a1592a2 /trunk/txt
parent	4c59b80ab2b0e942bd45ff12f238038293d21821 (diff)
download	gvfs-82d3197d52d9a1f8a1a1b928e2550444138d088b.tar.gz