1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
|
===============
html5lib Parser
===============
`html5lib`_ is a Python package that implements the HTML5 parsing algorithm
which is heavily influenced by current browsers and based on the `WHATWG
HTML5 specification`_.
.. _html5lib: http://code.google.com/p/html5lib/
.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
.. _WHATWG HTML5 specification: http://www.whatwg.org/specs/web-apps/current-work/
lxml can benefit from the parsing capabilities of `html5lib` through
the ``lxml.html.html5parser`` module. It provides a similar interface
to the ``lxml.html`` module by providing ``fromstring()``,
``parse()``, ``document_fromstring()``, ``fragment_fromstring()`` and
``fragments_fromstring()`` that work like the regular html parsing
functions.
Differences to regular HTML parsing
===================================
There are a few differences in the returned tree to the regular HTML
parsing functions from ``lxml.html``. html5lib normalizes some elements
and element structures to a common format. For example even if a tables
does not have a `tbody` html5lib will inject one automatically:
.. sourcecode:: pycon
>>> from lxml.html import tostring, html5parser
>>> tostring(html5parser.fromstring("<table><td>foo"))
'<table><tbody><tr><td>foo</td></tr></tbody></table>'
Also the parameters the functions accept are different.
Function Reference
==================
``parse(filename_url_or_file)``:
Parses the named file or url, or if the object has a ``.read()``
method, parses from that.
``document_fromstring(html, guess_charset=True)``:
Parses a document from the given string. This always creates a
correct HTML document, which means the parent node is ``<html>``,
and there is a body and possibly a head.
If a bytestring is passed and ``guess_charset`` is true the chardet
library (if installed) will guess the charset if ambiguities exist.
``fragment_fromstring(string, create_parent=False, guess_charset=False)``:
Returns an HTML fragment from a string. The fragment must contain
just a single element, unless ``create_parent`` is given;
e.g., ``fragment_fromstring(string, create_parent='div')`` will
wrap the element in a ``<div>``. If ``create_parent`` is true the
default parent tag (div) is used.
If a bytestring is passed and ``guess_charset`` is true the chardet
library (if installed) will guess the charset if ambiguities exist.
``fragments_fromstring(string, no_leading_text=False, parser=None)``:
Returns a list of the elements found in the fragment. The first item in
the list may be a string. If ``no_leading_text`` is true, then it will
be an error if there is leading text, and it will always be a list of
only elements.
If a bytestring is passed and ``guess_charset`` is true the chardet
library (if installed) will guess the charset if ambiguities exist.
``fromstring(string)``:
Returns ``document_fromstring`` or ``fragment_fromstring``, based
on whether the string looks like a full document, or just a
fragment.
Additionally all parsing functions accept an ``parser`` keyword argument
that can be set to a custom parser instance. To create custom parsers
you can subclass the ``HTMLParser`` and ``XHTMLParser`` from the same
module. Note that these are the parser classes provided by html5lib.
|