doc/html5parser.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

===============
html5lib Parser
===============

`html5lib`_ is a Python package that implements the HTML5 parsing algorithm
which is heavily influenced by current browsers and based on the `WHATWG
HTML5 specification`_.

.. _html5lib: http://code.google.com/p/html5lib/
.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
.. _WHATWG HTML5 specification: http://www.whatwg.org/specs/web-apps/current-work/

lxml can benefit from the parsing capabilities of `html5lib` through
the ``lxml.html.html5parser`` module.  It provides a similar interface
to the ``lxml.html`` module by providing ``fromstring()``,
``parse()``, ``document_fromstring()``, ``fragment_fromstring()`` and
``fragments_fromstring()`` that work like the regular html parsing
functions.


Differences to regular HTML parsing
===================================

There are a few differences in the returned tree to the regular HTML
parsing functions from ``lxml.html``.  html5lib normalizes some elements
and element structures to a common format.  For example even if a tables
does not have a `tbody` html5lib will inject one automatically:

.. sourcecode:: pycon

    >>> from lxml.html import tostring, html5parser
    >>> tostring(html5parser.fromstring("<table><td>foo"))
    '<table><tbody><tr><td>foo</td></tr></tbody></table>'

Also the parameters the functions accept are different.


Function Reference
==================

``parse(filename_url_or_file)``:
    Parses the named file or url, or if the object has a ``.read()``
    method, parses from that.

``document_fromstring(html, guess_charset=True)``:
    Parses a document from the given string.  This always creates a
    correct HTML document, which means the parent node is ``<html>``,
    and there is a body and possibly a head.

    If a bytestring is passed and ``guess_charset`` is true the chardet
    library (if installed) will guess the charset if ambiguities exist.

``fragment_fromstring(string, create_parent=False, guess_charset=False)``:
    Returns an HTML fragment from a string.  The fragment must contain
    just a single element, unless ``create_parent`` is given;
    e.g., ``fragment_fromstring(string, create_parent='div')`` will
    wrap the element in a ``<div>``.  If ``create_parent`` is true the
    default parent tag (div) is used.

    If a bytestring is passed and ``guess_charset`` is true the chardet
    library (if installed) will guess the charset if ambiguities exist.

``fragments_fromstring(string, no_leading_text=False, parser=None)``:
    Returns a list of the elements found in the fragment.  The first item in
    the list may be a string.  If ``no_leading_text`` is true, then it will
    be an error if there is leading text, and it will always be a list of
    only elements.

    If a bytestring is passed and ``guess_charset`` is true the chardet
    library (if installed) will guess the charset if ambiguities exist.

``fromstring(string)``:
    Returns ``document_fromstring`` or ``fragment_fromstring``, based
    on whether the string looks like a full document, or just a
    fragment.

Additionally all parsing functions accept an ``parser`` keyword argument
that can be set to a custom parser instance.  To create custom parsers
you can subclass the ``HTMLParser`` and ``XHTMLParser`` from the same
module.  Note that these are the parser classes provided by html5lib.