[svn r2835] cleanup in parser code, ET-compatible target parser interface (SAX-like), tutorial section on parsing

--HG-- branch : trunk
author: scoder <none@none> 2007-09-11 21:55:37 +0200
committer: scoder <none@none> 2007-09-11 21:55:37 +0200
commit: d4138303e5b3cff342baae4f0ab061909d2f28c3 (patch)
tree: 42b60be68252c2728ce67fe06a5c281389480b34 /doc
parent: debae622334df649add61cb213d3588aa6306782 (diff)
download: python-lxml-d4138303e5b3cff342baae4f0ab061909d2f28c3.tar.gz
1 files changed, 215 insertions, 14 deletions
diff --git a/doc/tutorial.txt b/doc/tutorial.txt
index d1f97a6d..e87a3217 100644
--- a/doc/tutorial.txt
+++ b/doc/tutorial.txt
@@ -6,24 +6,32 @@ The lxml.etree Tutorial
   Stefan Behnel
 
 This tutorial briefly overviews the main concepts of the `ElementTree API`_ as
-implemented by lxml.etree, and some simple enhancements that make your life as
-a programmer easier.
+implemented by ``lxml.etree``, and some simple enhancements that make your
+life as a programmer easier.
 
 .. _`ElementTree API`: http://effbot.org/zone/element-index.htm#documentation
 
 .. contents::
 .. 
-   1  Elements and ElementTrees
-     1.1  The Element class
-     1.2  The ElementTree class
-   2  Parsing and XML literals
-     2.1  The XML() function
-     2.2  The parse() function
-   3  Namespaces
-   4  The find*() methods
-     4.1  findall()
-     4.2  find()
-     4.3  findtext()
+   1  The Element class
+     1.1  Elements are lists
+     1.2  Elements carry attributes
+     1.3  Elements contain text
+     1.4  Tree iteration
+   2  The ElementTree class
+   3  Parsing from strings and files
+     3.1  The fromstring() function
+     3.2  The XML() function
+     3.3  The parse() function
+     3.4  Parser objects
+     3.5  Incremental parsing
+     3.6  Event-driven parsing
+   4  Namespaces
+   5  The E-factory
+   6  ElementPath
+     6.1  findall()
+     6.2  find()
+     6.3  findtext()
 
 
 A common way to import ``lxml.etree`` is as follows::
@@ -380,15 +388,208 @@ upcoming lxml 2.0.  Before, both would serialise without DTD content, which
 made lxml loose DTD information in an input-output cycle.
 
 
-Parsing files and XML literals
+Parsing from strings and files
 ==============================
 
+``lxml.etree`` supports parsing XML in a number of ways and from all important
+sources, namely strings, files and file-like objects.  The main parse
+functions are ``fromstring()`` and ``parse()``, both called with the source as
+first argument.  By default, they use the standard parser, but you can always
+pass a different parser as second argument.
+
+
+The fromstring() function
+-------------------------
+
+The ``fromstring()`` function is the easiest way to parse a string::
+
+    >>> some_xml_data = "<root>data</root>"
+
+    >>> root = etree.fromstring(some_xml_data)
+    >>> print root.tag
+    root
+    >>> print etree.tostring(root)
+    <root>data</root>
+
+
 The XML() function
 ------------------
 
+The ``XML()`` function behaves like the ``fromstring()`` function, but is
+commonly used to write XML literals right into the source::
+
+    >>> root = etree.XML("<root>data</root>")
+    >>> print root.tag
+    root
+    >>> print etree.tostring(root)
+    <root>data</root>
+
+
 The parse() function
 --------------------
 
+The ``parse()`` function is used to parse from files and file-like objects::
+
+    >>> some_file_like = StringIO("<root>data</root>")
+
+    >>> tree = etree.parse(some_file_like)
+
+    >>> print etree.tostring(tree)
+    <root>data</root>
+
+Note that ``parse()`` returns an ElementTree object, not an Element object as
+the string parser functions::
+
+    >>> root = tree.getroot()
+    >>> print root.tag
+    root
+    >>> print etree.tostring(root)
+    <root>data</root>
+
+
+Parser objects
+--------------
+
+By default, ``lxml.etree`` uses a standard parser with a default setup.  If
+you want to configure the parser, you can create a you instance::
+
+    >>> parser = etree.XMLParser(remove_blank_text=True) # lxml.etree only!
+
+This creates a parser that removes empty text between tags while parsing,
+which can reduce the size of the tree and avoid dangling tail text if you know
+that whitespace-only content is not meaningful for your data.  An example::
+
+    >>> root = etree.XML("<root>  <a/>   <b>  </b>     </root>", parser)
+
+    >>> print etree.tostring(root)
+    <root><a/><b>  </b></root>
+
+Note that the whitespace content inside the ``<b>`` tag was not removed, as
+content at leaf elements tends to be data content (even if blank).  You can
+easily remove it in an additional step by traversing the tree::
+
+    >>> for element in root.getiterator("*"):
+    ...     if element.text is not None and not element.text.strip():
+    ...         element.text = None
+
+    >>> print etree.tostring(root)
+    <root><a/><b/></root>
+
+See ``help(etree.XMLParser)`` to find out about the available parser options.
+
+
+Incremental parsing
+-------------------
+
+``lxml.etree`` provides two ways for incremental step-by-step parsing.  One is
+through file-like objects, where it calls the ``read()`` method repeatedly.
+This is best used where the data arrives from a source like ``urllib`` or any
+other file-like object that can provide data on request.  Note that the parser
+will block and wait until data becomes available in this case::
+
+    >>> class DataSource:
+    ...     data = iter(["<roo", "t><", "a/", "><", "/root>"])
+    ...     def read(self, requested_size):
+    ...         try:
+    ...             return self.data.next()
+    ...         except StopIteration:
+    ...             return ""
+
+    >>> root = etree.parse(DataSource())
+
+    >>> print etree.tostring(root)
+    <root><a/></root>
+
+The second way is through a feed parser interface, given by the ``feed(data)``
+and ``close()`` methods::
+
+    >>> parser = etree.XMLParser()
+
+    >>> parser.feed("<roo")
+    >>> parser.feed("t><")
+    >>> parser.feed("a/")
+    >>> parser.feed("><")
+    >>> parser.feed("/root>")
+
+    >>> root = parser.close()
+
+    >>> print etree.tostring(root)
+    <root><a/></root>
+
+Here, you can interrupt the parsing process at any time and continue it later
+on with another call to the ``feed()`` method.  This comes in handy if you
+want to avoid blocking calls to the parser, e.g. in frameworks like Twisted,
+or whenever data comes in slowly or in chunks and you want to do other things
+while waiting for the next chunk.
+
+You can reuse the parser by calling its ``feed()`` method again::
+
+    >>> parser.feed("<root/>")
+    >>> root = parser.close()
+    >>> print etree.tostring(root)
+    <root/>
+
+
+Event-driven parsing
+--------------------
+
+Sometimes, all you need from a document is a small fraction somewhere deep
+inside the tree, so parsing the whole tree into memory, traversing it and
+dropping it can be too much overhead.  ``lxml.etree`` supports this use case
+with two event-driven parser interfaces, one that generates parser events
+while building the tree (``iterparse``), and one that does not build the tree
+at all, and instead calls feedback methods on a target object in a SAX-like
+fashion.
+
+Here is a simple ``iterparse()`` example::
+
+    >>> some_file_like = StringIO("<root><a>data</a></root>")
+
+    >>> for event, element in etree.iterparse(some_file_like):
+    ...     print "%s, %4s, %s" % (event, element.tag, element.text)
+    end,    a, data
+    end, root, None
+
+By default, ``iterparse()`` only generates events when it is done parsing an
+element, but you can control this through the ``events`` keyword argument::
+
+    >>> some_file_like = StringIO("<root><a>data</a></root>")
+
+    >>> for event, element in etree.iterparse(some_file_like,
+    ...                                       events=("start", "end")):
+    ...     print "%5s, %4s, %s" % (event, element.tag, element.text)
+    start, root, None
+    start,    a, data
+      end,    a, data
+      end, root, None
+
+Note that the text, tail and children of an Element are not necessarily there
+yet when receiving the ``start`` event.  Only the ``end`` event guarantees
+that the Element has been parsed completely.  It also allows to ``clear()`` or
+modify the content of an Element to save memory.
+
+If memory is a real bottleneck, or if building the tree is not desired at all,
+the target parser interface of ``lxml.etree`` can be used.  It creates
+SAX-like events by calling the methods of a target object.  By implementing
+some or all of these methods, you can control which events are generated::
+
+    >>> class ParserTarget:
+    ...     events = []
+    ...     def start(self, tag, attrib):
+    ...         self.events.append(("start", tag, attrib))
+    ...     def close(self):
+    ...         return self.events
+
+    >>> parser = etree.XMLParser(target=ParserTarget())
+    >>> events = etree.fromstring('<root test="true"/>', parser)
+
+    >>> for event in events:
+    ...     print 'event: %s - tag: %s' % (event[0], event[1])
+    ...     for attr, value in event[2].iteritems():
+    ...         print ' * %s = %s' % (attr, value)
+    event: start - tag: root
+     * test = true
+
 
 Namespaces
 ==========
author	scoder <none@none>	2007-09-11 21:55:37 +0200
committer	scoder <none@none>	2007-09-11 21:55:37 +0200
commit	d4138303e5b3cff342baae4f0ab061909d2f28c3 (patch)
tree	42b60be68252c2728ce67fe06a5c281389480b34 /doc
parent	debae622334df649add61cb213d3588aa6306782 (diff)
download	python-lxml-d4138303e5b3cff342baae4f0ab061909d2f28c3.tar.gz