1 files changed, 304 insertions, 0 deletions
diff --git a/docs/url-parsing-with-wsgi.txt b/docs/url-parsing-with-wsgi.txt
new file mode 100644
index 0000000..856971f
--- /dev/null
+++ b/docs/url-parsing-with-wsgi.txt
@@ -0,0 +1,304 @@
+URL Parsing With WSGI And Paste
++++++++++++++++++++++++++++++++
+
+:author: Ian Bicking <ianb@colorstudy.com>
+:revision: $Rev$
+:date: $LastChangedDate$
+
+.. contents::
+
+Introduction and Audience
+=========================
+
+This document is intended for web framework authors and integrators,
+and people who want to understand the internal architecture of Paste.
+
+.. include:: include/contact.txt
+
+URL Parsing
+===========
+
+.. note::
+
+   Sometimes people use "URL", and sometimes "URI".  I think URLs are
+   a subset of URIs.  But in practice you'll almost never see URIs
+   that aren't URLs, and certainly not in Paste.  URIs that aren't
+   URLs are abstract Identifiers, that cannot necessarily be used to
+   Locate the resource.  This document is *all* about locating.
+
+Most generally, URL parsing is about taking a URL and determining what
+"resource" the URL refers to.  "Resource" is a rather vague term,
+intentionally.  It's really just a metaphor -- in reality there aren't
+any "resources" in HTTP; there are only requests and responses.
+
+In Paste, everything is about WSGI.  But that can seem too fancy.
+There are four core things involved: the *request* (personified in the
+WSGI environment), the *response* (personified inthe
+``start_response`` callback and the return iterator), the WSGI
+application, and the server that calls that application.  The
+application and request are objects, while the server and response are
+really more like actions than concrete objects.
+
+In this context, URL parsing is about mapping a URL to an
+*application* and a *request*.  The request actually gets modified as
+it moves through different parts of the system.  Two dictionary keys
+in particular relate to URLs -- ``SCRIPT_NAME`` and ``PATH_INFO`` --
+but any part of the environment can be modified as it is passed
+through the system.
+
+Dispatching
+===========
+
+.. note::
+
+   WSGI isn't object oriented?  Well, if you look at it, you'll notice
+   there's no objects except built-in types, so it shouldn't be a
+   surprise.  Additionally, the interface and promises of the objects
+   we do see are very minimal.  An application doesn't have any
+   interface except one method -- ``__call__`` -- and that method
+   *does* things, it doesn't give any other information.
+
+Because WSGI is action-oriented, rather than object-oriented, it's
+more important what we *do*.  "Finding" an application is probably an
+intermediate step, but "running" the application is our ultimate goal,
+and the only real judge of success.  An application that isn't run is
+useless to us, because it doesn't have any other useful methods.
+
+So what we're really doing is *dispatching* -- we're handing the
+request and responsibility for the response off to another object
+(another actor, really).  In the process we can actually retain some
+control -- we can capture and transform the response, and we can
+modify the request -- but that's not what the typical URL resolver will
+do.  
+
+Motivations
+===========
+
+The most obvious kind of URL parsing is finding a WSGI application.
+
+Typically when a framework first supports WSGI or is integrated into
+Paste, it is "monolithic" with respect to URLs.  That is, you define
+(in Paste, or maybe in Apache) a "root" URL, and everything under that
+goes into the framework.  What the framework does internally, Paste
+does not know -- it probably finds internal objects to dispatch to, 
+but the framework is opaque to Paste.  Not just to Paste, but to
+any code that isn't in that framework.
+
+That means that we can't mix code from multiple frameworks, or as
+easily share services, or use WSGI middleware that doesn't apply to
+the entire framework/application.
+
+An example of someplace we might want to use an "application" that
+isn't part of the framework would be uploading large files.  It's
+possible to keep track of upload progress, and report that back to the
+user, but no framework typically is capable of this.  This is usually
+because the POST request is completely read and parsed before it
+invokes any application code.
+
+This is resolvable in WSGI -- a WSGI application can provide its own
+code to read and parse the POST request, and simultaneously report
+progress (usually in a way that *another* WSGI application/request can
+read and report to the user on that progress).  This is an example
+where you want to allow "foreign" applications to be intermingled with
+framework application code.
+
+Finding Applications
+====================
+
+OK, enough theory.  How does a URL parser work?  Well, it is a WSGI
+application, and a WSGI server, in the typical "WSGI middleware"
+style.  Except that it determines which application it will serve
+for each request.
+
+Let's consider Paste's ``URLParser`` (in ``paste.urlparser``).  This
+class takes a directory name as its only required argument, and
+instances are WSGI applications.
+
+When a request comes in, the parser looks at ``PATH_INFO`` to see
+what's left to parse.  ``SCRIPT_NAME`` represents where we are *now*;
+it's the part of the URL that has been parsed.
+
+There's a couple special cases:
+
+The empty string:
+
+    URLParser serves directories.  When ``PATH_INFO`` is empty, that
+    means we got a request with no trailing ``/``, like say ``/blog``
+    If URLParser serves the ``blog`` directory, then this won't do --
+    the user is requesting the ``blog`` *page*.  We have to redirect
+    them to ``/blog/``.
+
+A single ``/``:
+
+    So, we got a trailing ``/``.  This means we need to serve the
+    "index" page.  In URLParser, this is some file named ``index``,
+    though that's really an implementation detail.  You could create
+    an index dynamically (like Apache's file listings), or whatever.
+
+Otherwise we get a string like ``/path...``.  Note that ``PATH_INFO``
+*must* start with a ``/``, or it must be empty.
+
+URLParser pulls off the first part of the path.  E.g., if
+``PATH_INFO`` is ``/blog/edit/285``, then the first part is ``blog``.
+It appends this to ``SCRIPT_NAME``, and strips it off ``PATH_INFO``
+(which becomes ``/edit/285``).
+
+It then searches for a file that matches "blog".  In URLParser, this
+means it looks for a filename which matches that name (ignoring the
+extension).  It then uses the type of that file (determined by
+extension) to create a WSGI application.
+
+One case is that the file is a directory.  In that case, the
+application is *another* URLParser instance, this time with the new
+directory.
+
+URLParser actually allows per-extension "plugins" -- these are just
+functions that get a filename, and produce a WSGI application.  One of
+these is ``make_py`` -- this function imports the module, and looks
+for special symbols; if it finds a symbol ``application``, it assumes
+this is a WSGI application that is ready to accept the request.  If it
+finds a symbol that matches the name of the module (e.g., ``edit``),
+then it assumes that is an application *factory*, meaning that when
+you call it with no arguments you get a WSGI application.
+
+Another function takes "unknown" files (files for which no better
+constructor exists) and creates an application that simply responds
+with the contents of that file (and the appropriate ``Content-Type``).
+
+In any case, ``URLParser`` delegates as soon as it can.  It doesn't
+parse the entire path -- it just finds the *next* application, which
+in turn may delegate to yet another application.
+
+Here's a very simple implementation of URLParser::
+
+    class URLParser(object):
+        def __init__(self, dir):
+            self.dir = dir
+        def __call__(self, environ, start_response):
+            segment = wsgilib.path_info_pop(environ)
+            if segment is None: # No trailing /
+                # do a redirect...
+            for filename in os.listdir(self.dir):
+                if os.path.splitext(filename)[0] == segment:
+                    return self.serve_application(
+                        environ, start_response, filename)
+            # do a 404 Not Found
+        def serve_application(self, environ, start_response, filename):
+            basename, ext = os.path.splitext(filename)
+            filename = os.path.join(self.dir, filename)
+            if os.path.isdir(filename):
+                return URLParser(filename)(environ, start_response)
+            elif ext == '.py':
+                module = import_module(filename)
+                if hasattr(module, 'application'):
+                    return module.application(environ, start_response)
+                elif hasattr(module, basename):
+                    return getattr(module, basename)(
+                        environ, start_response)
+            else:
+                return wsgilib.send_file(filename)
+
+Modifying The Request
+=====================
+
+Well, URLParser is one kind of parser.  But others are possible, and
+aren't too hard to write.
+
+Lets imagine a URL like ``/2004/05/01/edit``.  It's likely that
+``/2004/05/01`` doesn't point to anything on file, but is really more
+of a "variable" that gets passed to ``edit``.  So we can pull them off
+and put them somewhere.  This is a good place for a WSGI extension.
+Lets put them in ``environ["app.url_date"]``.
+
+We'll pass one other applications in -- once we get the date (if any)
+we need to pass the request onto an application that can actually
+handle it.  This "application" might be a URLParser or similar system
+(that figures out what ``/edit`` means).
+
+::
+
+    class GrabDate(object):
+        def __init__(self, subapp):
+            self.subapp = subapp
+        def __call__(self, environ, start_response):
+            date_parts = []
+            while len(date_parts) < 3:
+               first, rest = wsgilib.path_info_split(environ['PATH_INFO'])
+               try:
+                   date_parts.append(int(first))
+                   wsgilib.path_info_pop(environ)
+               except (ValueError, TypeError):
+	           break
+            environ['app.date_parts'] = date_parts
+            return self.subapp(environ, start_response)
+
+This is really like traditional "middleware", in that it sits between
+the server and just one application.
+
+Assuming you put this class in the ``myapp.grabdate`` module, you
+could install it by adding this to your configuration::
+
+    middleware.append('myapp.grabdate.GrabDate')
+
+Object Publishing
+=================
+
+Besides looking in the filesystem, "object publishing" is another
+popular way to do URL parsing.  This is pretty easy to implement as
+well -- it usually just means use ``getattr`` with the popped
+segments.  But we'll implement a rough approximation of `Quixote's
+<http://www.mems-exchange.org/software/quixote/>`_ URL parsing::
+
+    class ObjectApp(object):
+        def __init__(self, obj):
+            self.obj = obj
+        def __call__(self, environ, start_response):
+            next = wsgilib.path_info_pop(environ)
+            if next is None: 
+                # This is the object, lets serve it...
+                return self.publish(obj, environ, start_response)
+            next = next or '_q_index' # the default index method
+            if next in obj._q_export and getattr(obj, next, None):
+                return ObjectApp(getattr(obj, next))(
+                    environ, start_reponse)
+            next_obj = obj._q_traverse(next)
+            if not next_obj:
+                # Do a 404
+            return ObjectApp(next_obj)(environ, start_response)
+
+        def publish(self, obj, environ, start_response):
+            if callable(obj):
+                output = str(obj())
+            else:
+                output = str(obj)
+            start_response('200 OK', [('Content-type', 'text/html')])
+            return [output]
+
+The ``publish`` object is a little weak, and functions like
+``_q_traverse`` aren't passed interesting information about the
+request, but this is only a rough approximation of the framework.
+Things to note:
+
+* The object has standard attributes and methods -- ``_q_exports``
+  (attributes that are public to the web) and ``_q_traverse``
+  (a way of overriding the traversal without having an attribute for
+  each possible path segment).
+
+* The object isn't rendered until the path is completely consumed
+  (when ``next`` is ``None``).  This means ``_q_traverse`` has to
+  consume extra segments of the path.  In this version ``_q_traverse``
+  is only given the next piece of the path; Quixote gives it the
+  entire path (as a list of segments).  
+
+* ``publish`` is really a small and lame way to turn a Quixote object
+  into a WSGI application.  For any serious framework you'd want to do
+  a better job than what I do here.
+
+* It would be even better if you used something like `Adaptation
+  <http://www.python.org/peps/pep-0246.html>`_ to convert objects into
+  applications.  This would include removing the explicit creation of
+  new ``ObjectApp`` instances, which could also be a kind of fall-back
+  adaptation. 
+
+Anyway, this example is less complete, but maybe it will get you
+thinking.