diff options
author | Leonard Richardson <leonard.richardson@canonical.com> | 2011-08-16 09:13:05 -0400 |
---|---|---|
committer | Leonard Richardson <leonard.richardson@canonical.com> | 2011-08-16 09:13:05 -0400 |
commit | 03b1b76ff969651b1ec4071c0b2e749590461f7d (patch) | |
tree | a35b43951571c39da69ae8c1f8a38698bfe5075f | |
parent | ff9bd68fc0853d4b723a3edb1510005a90759bc3 (diff) | |
download | beautifulsoup4-03b1b76ff969651b1ec4071c0b2e749590461f7d.tar.gz |
Moved documentation to README where tarball downloaders will see it.
-rw-r--r-- | CHANGELOG | 155 | ||||
-rw-r--r-- | README.txt | 178 |
2 files changed, 172 insertions, 161 deletions
@@ -1,159 +1,6 @@ = 4.0 = -This is a nearly-complete rewrite that removes Beautiful Soup's custom -HTML parser in favor of a system that lets you write a little glue -code and plug in whatever HTML or XML parser you want. - -Beautiful Soup 4.0 comes with glue code for four parsers: an Python's -HTMLParser, lxml's HTML and XML parsers, and html5lib's HTML -parser. HTMLParser is the default, but I recommend you install one of -the other parsers, or you'll have problems handling real-world HTML. - -== The module name has changed == - -Previously you imported the BeautifulSoup class from a module also -called BeautifulSoup. To save keystrokes and make it clear which -version of the API is in use, the module is now called 'bs4': - - >>> from bs4 import BeautifulSoup - -== It works with Python 3 == - -Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was -so bad that it barely worked at all. Beautiful Soup 4 works with -Python 3, and since its parser is pluggable, you don't sacrifice -quality. - -Special thanks to Thomas Kluyver for getting Python 3 support to the -finish line. - -== Better method names == - -Methods and attributes have been renamed to comply with PEP 8. The old names -still work. Here are the renames: - - * replaceWith -> replace_with - * replaceWithChildren -> replace_with_children - * findAll -> find_all - * findAllNext -> find_all_next - * findAllPrevious -> find_all_previous - * findNext -> find_next - * findNextSibling -> find_next_sibling - * findNextSiblings -> find_next_siblings - * findParent -> find_parent - * findParents -> find_parents - * findPrevious -> find_previous - * findPreviousSibling -> find_previous_sibling - * findPreviousSiblings -> find_previous_siblings - * nextSibling -> next_sibling - * previousSibling -> previous_sibling - -Methods have been renamed for compatibility with Python 3. - - * Tag.has_key() -> Tag.has_attr() - - (This was misleading, anyway, because has_key() looked at - a tag's attributes and __in__ looked at a tag's contents.) - -Some attributes have also been renamed: - - * Tag.isSelfClosing -> Tag.is_empty_element - * UnicodeDammit.unicode -> UnicodeDammit.unicode_markup - * Tag.next -> Tag.next_element - * Tag.previous -> Tag.previous_element - -So have some arguments to popular methods: - - * BeautifulSoup(parseOnlyThese=...) -> BeautifulSoup(parse_only=...) - * BeautifulSoup(fromEncoding=...) -> BeautifulSoup(from_encoding=...) - -== Generators are now properties == - -The generators have been given more sensible (and PEP 8-compliant) -names, and turned into properties: - - * childGenerator() -> children - * nextGenerator() -> next_elements - * nextSiblingGenerator() -> next_siblings - * previousGenerator() -> previous_elements - * previousSiblingGenerator() -> previous_siblings - * recursiveChildGenerator() -> recursive_children - * parentGenerator() -> parents - -So instead of this: - - for parent in tag.parentGenerator(): - ... - -You can write this: - - for parent in tag.parents: - ... - -(But the old code will still work.) - -== tag.string is recursive == - -tag.string now operates recursively. If tag A contains a single tag B -and nothing else, then A.string is the same as B.string. So: - -<a><b>foo</b></a> - -The value of a.string used to be None, and now it's "foo". - -== Empty-element tags == - -Beautiful Soup's handling of empty-element tags (aka self-closing -tags) has been improved, especially when parsing XML. Previously you -had to explicitly specify a list of empty-element tags when parsing -XML. You can still do that, but if you don't, Beautiful Soup now -considers any empty tag to be an empty-element tag. - -The determination of empty-element-ness is now made at runtime rather -than parse time. If you add a child to an empty-element tag, it stops -being an empty-element tag. - -== Entities are always converted to Unicode == - -An HTML or XML entity is always converted into the corresponding -Unicode character. There are no longer any smartQuotesTo or -convertEntities arguments. (Unicode, Dammit still has smart_quotes_to, -but its default is now to turn smart quotes into Unicode.) - -== CDATA sections are normal text, if they're understood at all. == - -Currently, the lxml and html5lib HTML parsers ignore CDATA sections in -markup: - - <p><![CDATA[foo]]></p> => <p></p> - -A future version of html5lib will turn CDATA sections into text nodes, -but only within tags like <svg> and <math>: - - <svg><![CDATA[foo]]></svg> => <p>foo</p> - -The default XML parser (which uses lxml behind the scenes) turns CDATA -sections into ordinary text elements: - - <p><![CDATA[foo]]></p> => <p>foo</p> - -In theory it's possible to preserve the CDATA sections when using the -XML parser, but I don't see how to get it to work in practice. - -== Miscellaneous other stuff == - -If the BeautifulSoup instance has .is_xml set to True, an appropriate -XML declaration will be emitted when the tree is transformed into a -string: - - <?xml version="1.0" encoding="utf-8"> - <markup> - ... - </markup> - -The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree -builders set it to False. If you want to parse XHTML with an HTML -parser, you can set it manually. +See README.TXT. = 3.2.0 = @@ -1,10 +1,3 @@ -= About Beautiful Soup 4 = - -Earlier versions of Beautiful Soup included a custom HTML -parser. Beautiful Soup 4 uses Python's default HTMLParser, which does -fairly poorly on real-world HTML. By installing lxml or html5lib you -can get more accurate parsing and possibly better performance as well. - = Introduction = >>> from bs4 import BeautifulSoup @@ -29,4 +22,175 @@ can get more accurate parsing and possibly better performance as well. >>> soup.i <i>HTML</i> + >>> soup = BeautifulSoup("<tag1>Some<tag2/>bad<tag3>XML", "xml") + >>> print soup.prettify() + <?xml version="1.0" encoding="utf-8"> + <tag1> + Some + <tag2 /> + bad + <tag3> + XML + </tag3> + </tag1> + += About Beautiful Soup 4 = + +This is a nearly-complete rewrite that removes Beautiful Soup's custom +HTML parser in favor of a system that lets you write a little glue +code and plug in any HTML or XML parser you want. + +Beautiful Soup 4.0 comes with glue code for four parsers: + + * Python's standard HTMLParser + * lxml's HTML and XML parsers + * html5lib's HTML parser + +HTMLParser is the default, but I recommend you install one of the +other parsers, or you'll have problems handling real-world markup. + +== The module name has changed == + +Previously you imported the BeautifulSoup class from a module also +called BeautifulSoup. To save keystrokes and make it clear which +version of the API is in use, the module is now called 'bs4': + + >>> from bs4 import BeautifulSoup + +== It works with Python 3 == + +Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was +so bad that it barely worked at all. Beautiful Soup 4 works with +Python 3, and since its parser is pluggable, you don't sacrifice +quality. + +Special thanks to Thomas Kluyver for getting Python 3 support to the +finish line. + +== Better method names == + +Methods and attributes have been renamed to comply with PEP 8. The old names +still work. Here are the renames: + + * replaceWith -> replace_with + * replaceWithChildren -> replace_with_children + * findAll -> find_all + * findAllNext -> find_all_next + * findAllPrevious -> find_all_previous + * findNext -> find_next + * findNextSibling -> find_next_sibling + * findNextSiblings -> find_next_siblings + * findParent -> find_parent + * findParents -> find_parents + * findPrevious -> find_previous + * findPreviousSibling -> find_previous_sibling + * findPreviousSiblings -> find_previous_siblings + * nextSibling -> next_sibling + * previousSibling -> previous_sibling + +Methods have been renamed for compatibility with Python 3. + + * Tag.has_key() -> Tag.has_attr() + + (This was misleading, anyway, because has_key() looked at + a tag's attributes and __in__ looked at a tag's contents.) + +Some attributes have also been renamed: + + * Tag.isSelfClosing -> Tag.is_empty_element + * UnicodeDammit.unicode -> UnicodeDammit.unicode_markup + * Tag.next -> Tag.next_element + * Tag.previous -> Tag.previous_element + +So have some arguments to popular methods: + + * BeautifulSoup(parseOnlyThese=...) -> BeautifulSoup(parse_only=...) + * BeautifulSoup(fromEncoding=...) -> BeautifulSoup(from_encoding=...) + +== Generators are now properties == + +The generators have been given more sensible (and PEP 8-compliant) +names, and turned into properties: + + * childGenerator() -> children + * nextGenerator() -> next_elements + * nextSiblingGenerator() -> next_siblings + * previousGenerator() -> previous_elements + * previousSiblingGenerator() -> previous_siblings + * recursiveChildGenerator() -> recursive_children + * parentGenerator() -> parents + +So instead of this: + + for parent in tag.parentGenerator(): + ... + +You can write this: + + for parent in tag.parents: + ... + +(But the old code will still work.) + +== tag.string is recursive == + +tag.string now operates recursively. If tag A contains a single tag B +and nothing else, then A.string is the same as B.string. So: + +<a><b>foo</b></a> + +The value of a.string used to be None, and now it's "foo". + +== Empty-element tags == + +Beautiful Soup's handling of empty-element tags (aka self-closing +tags) has been improved, especially when parsing XML. Previously you +had to explicitly specify a list of empty-element tags when parsing +XML. You can still do that, but if you don't, Beautiful Soup now +considers any empty tag to be an empty-element tag. + +The determination of empty-element-ness is now made at runtime rather +than parse time. If you add a child to an empty-element tag, it stops +being an empty-element tag. + +== Entities are always converted to Unicode == + +An HTML or XML entity is always converted into the corresponding +Unicode character. There are no longer any smartQuotesTo or +convertEntities arguments. (Unicode, Dammit still has smart_quotes_to, +but its default is now to turn smart quotes into Unicode.) + +== CDATA sections are normal text, if they're understood at all. == + +Currently, the lxml and html5lib HTML parsers ignore CDATA sections in +markup: + + <p><![CDATA[foo]]></p> => <p></p> + +A future version of html5lib will turn CDATA sections into text nodes, +but only within tags like <svg> and <math>: + + <svg><![CDATA[foo]]></svg> => <p>foo</p> + +The default XML parser (which uses lxml behind the scenes) turns CDATA +sections into ordinary text elements: + + <p><![CDATA[foo]]></p> => <p>foo</p> + +In theory it's possible to preserve the CDATA sections when using the +XML parser, but I don't see how to get it to work in practice. + +== Miscellaneous other stuff == + +If the BeautifulSoup instance has .is_xml set to True, an appropriate +XML declaration will be emitted when the tree is transformed into a +string: + + <?xml version="1.0" encoding="utf-8"> + <markup> + ... + </markup> +The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree +builders set it to False. If you want to parse XHTML with an HTML +parser, you can set it manually. |