summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorLeonard Richardson <leonard.richardson@canonical.com>2011-08-16 09:13:05 -0400
committerLeonard Richardson <leonard.richardson@canonical.com>2011-08-16 09:13:05 -0400
commit03b1b76ff969651b1ec4071c0b2e749590461f7d (patch)
treea35b43951571c39da69ae8c1f8a38698bfe5075f
parentff9bd68fc0853d4b723a3edb1510005a90759bc3 (diff)
downloadbeautifulsoup4-03b1b76ff969651b1ec4071c0b2e749590461f7d.tar.gz
Moved documentation to README where tarball downloaders will see it.
-rw-r--r--CHANGELOG155
-rw-r--r--README.txt178
2 files changed, 172 insertions, 161 deletions
diff --git a/CHANGELOG b/CHANGELOG
index 559fc8f..701ed36 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -1,159 +1,6 @@
= 4.0 =
-This is a nearly-complete rewrite that removes Beautiful Soup's custom
-HTML parser in favor of a system that lets you write a little glue
-code and plug in whatever HTML or XML parser you want.
-
-Beautiful Soup 4.0 comes with glue code for four parsers: an Python's
-HTMLParser, lxml's HTML and XML parsers, and html5lib's HTML
-parser. HTMLParser is the default, but I recommend you install one of
-the other parsers, or you'll have problems handling real-world HTML.
-
-== The module name has changed ==
-
-Previously you imported the BeautifulSoup class from a module also
-called BeautifulSoup. To save keystrokes and make it clear which
-version of the API is in use, the module is now called 'bs4':
-
- >>> from bs4 import BeautifulSoup
-
-== It works with Python 3 ==
-
-Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
-so bad that it barely worked at all. Beautiful Soup 4 works with
-Python 3, and since its parser is pluggable, you don't sacrifice
-quality.
-
-Special thanks to Thomas Kluyver for getting Python 3 support to the
-finish line.
-
-== Better method names ==
-
-Methods and attributes have been renamed to comply with PEP 8. The old names
-still work. Here are the renames:
-
- * replaceWith -> replace_with
- * replaceWithChildren -> replace_with_children
- * findAll -> find_all
- * findAllNext -> find_all_next
- * findAllPrevious -> find_all_previous
- * findNext -> find_next
- * findNextSibling -> find_next_sibling
- * findNextSiblings -> find_next_siblings
- * findParent -> find_parent
- * findParents -> find_parents
- * findPrevious -> find_previous
- * findPreviousSibling -> find_previous_sibling
- * findPreviousSiblings -> find_previous_siblings
- * nextSibling -> next_sibling
- * previousSibling -> previous_sibling
-
-Methods have been renamed for compatibility with Python 3.
-
- * Tag.has_key() -> Tag.has_attr()
-
- (This was misleading, anyway, because has_key() looked at
- a tag's attributes and __in__ looked at a tag's contents.)
-
-Some attributes have also been renamed:
-
- * Tag.isSelfClosing -> Tag.is_empty_element
- * UnicodeDammit.unicode -> UnicodeDammit.unicode_markup
- * Tag.next -> Tag.next_element
- * Tag.previous -> Tag.previous_element
-
-So have some arguments to popular methods:
-
- * BeautifulSoup(parseOnlyThese=...) -> BeautifulSoup(parse_only=...)
- * BeautifulSoup(fromEncoding=...) -> BeautifulSoup(from_encoding=...)
-
-== Generators are now properties ==
-
-The generators have been given more sensible (and PEP 8-compliant)
-names, and turned into properties:
-
- * childGenerator() -> children
- * nextGenerator() -> next_elements
- * nextSiblingGenerator() -> next_siblings
- * previousGenerator() -> previous_elements
- * previousSiblingGenerator() -> previous_siblings
- * recursiveChildGenerator() -> recursive_children
- * parentGenerator() -> parents
-
-So instead of this:
-
- for parent in tag.parentGenerator():
- ...
-
-You can write this:
-
- for parent in tag.parents:
- ...
-
-(But the old code will still work.)
-
-== tag.string is recursive ==
-
-tag.string now operates recursively. If tag A contains a single tag B
-and nothing else, then A.string is the same as B.string. So:
-
-<a><b>foo</b></a>
-
-The value of a.string used to be None, and now it's "foo".
-
-== Empty-element tags ==
-
-Beautiful Soup's handling of empty-element tags (aka self-closing
-tags) has been improved, especially when parsing XML. Previously you
-had to explicitly specify a list of empty-element tags when parsing
-XML. You can still do that, but if you don't, Beautiful Soup now
-considers any empty tag to be an empty-element tag.
-
-The determination of empty-element-ness is now made at runtime rather
-than parse time. If you add a child to an empty-element tag, it stops
-being an empty-element tag.
-
-== Entities are always converted to Unicode ==
-
-An HTML or XML entity is always converted into the corresponding
-Unicode character. There are no longer any smartQuotesTo or
-convertEntities arguments. (Unicode, Dammit still has smart_quotes_to,
-but its default is now to turn smart quotes into Unicode.)
-
-== CDATA sections are normal text, if they're understood at all. ==
-
-Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
-markup:
-
- <p><![CDATA[foo]]></p> => <p></p>
-
-A future version of html5lib will turn CDATA sections into text nodes,
-but only within tags like <svg> and <math>:
-
- <svg><![CDATA[foo]]></svg> => <p>foo</p>
-
-The default XML parser (which uses lxml behind the scenes) turns CDATA
-sections into ordinary text elements:
-
- <p><![CDATA[foo]]></p> => <p>foo</p>
-
-In theory it's possible to preserve the CDATA sections when using the
-XML parser, but I don't see how to get it to work in practice.
-
-== Miscellaneous other stuff ==
-
-If the BeautifulSoup instance has .is_xml set to True, an appropriate
-XML declaration will be emitted when the tree is transformed into a
-string:
-
- <?xml version="1.0" encoding="utf-8">
- <markup>
- ...
- </markup>
-
-The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree
-builders set it to False. If you want to parse XHTML with an HTML
-parser, you can set it manually.
+See README.TXT.
= 3.2.0 =
diff --git a/README.txt b/README.txt
index 8baa022..5c99381 100644
--- a/README.txt
+++ b/README.txt
@@ -1,10 +1,3 @@
-= About Beautiful Soup 4 =
-
-Earlier versions of Beautiful Soup included a custom HTML
-parser. Beautiful Soup 4 uses Python's default HTMLParser, which does
-fairly poorly on real-world HTML. By installing lxml or html5lib you
-can get more accurate parsing and possibly better performance as well.
-
= Introduction =
>>> from bs4 import BeautifulSoup
@@ -29,4 +22,175 @@ can get more accurate parsing and possibly better performance as well.
>>> soup.i
<i>HTML</i>
+ >>> soup = BeautifulSoup("<tag1>Some<tag2/>bad<tag3>XML", "xml")
+ >>> print soup.prettify()
+ <?xml version="1.0" encoding="utf-8">
+ <tag1>
+ Some
+ <tag2 />
+ bad
+ <tag3>
+ XML
+ </tag3>
+ </tag1>
+
+= About Beautiful Soup 4 =
+
+This is a nearly-complete rewrite that removes Beautiful Soup's custom
+HTML parser in favor of a system that lets you write a little glue
+code and plug in any HTML or XML parser you want.
+
+Beautiful Soup 4.0 comes with glue code for four parsers:
+
+ * Python's standard HTMLParser
+ * lxml's HTML and XML parsers
+ * html5lib's HTML parser
+
+HTMLParser is the default, but I recommend you install one of the
+other parsers, or you'll have problems handling real-world markup.
+
+== The module name has changed ==
+
+Previously you imported the BeautifulSoup class from a module also
+called BeautifulSoup. To save keystrokes and make it clear which
+version of the API is in use, the module is now called 'bs4':
+
+ >>> from bs4 import BeautifulSoup
+
+== It works with Python 3 ==
+
+Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
+so bad that it barely worked at all. Beautiful Soup 4 works with
+Python 3, and since its parser is pluggable, you don't sacrifice
+quality.
+
+Special thanks to Thomas Kluyver for getting Python 3 support to the
+finish line.
+
+== Better method names ==
+
+Methods and attributes have been renamed to comply with PEP 8. The old names
+still work. Here are the renames:
+
+ * replaceWith -> replace_with
+ * replaceWithChildren -> replace_with_children
+ * findAll -> find_all
+ * findAllNext -> find_all_next
+ * findAllPrevious -> find_all_previous
+ * findNext -> find_next
+ * findNextSibling -> find_next_sibling
+ * findNextSiblings -> find_next_siblings
+ * findParent -> find_parent
+ * findParents -> find_parents
+ * findPrevious -> find_previous
+ * findPreviousSibling -> find_previous_sibling
+ * findPreviousSiblings -> find_previous_siblings
+ * nextSibling -> next_sibling
+ * previousSibling -> previous_sibling
+
+Methods have been renamed for compatibility with Python 3.
+
+ * Tag.has_key() -> Tag.has_attr()
+
+ (This was misleading, anyway, because has_key() looked at
+ a tag's attributes and __in__ looked at a tag's contents.)
+
+Some attributes have also been renamed:
+
+ * Tag.isSelfClosing -> Tag.is_empty_element
+ * UnicodeDammit.unicode -> UnicodeDammit.unicode_markup
+ * Tag.next -> Tag.next_element
+ * Tag.previous -> Tag.previous_element
+
+So have some arguments to popular methods:
+
+ * BeautifulSoup(parseOnlyThese=...) -> BeautifulSoup(parse_only=...)
+ * BeautifulSoup(fromEncoding=...) -> BeautifulSoup(from_encoding=...)
+
+== Generators are now properties ==
+
+The generators have been given more sensible (and PEP 8-compliant)
+names, and turned into properties:
+
+ * childGenerator() -> children
+ * nextGenerator() -> next_elements
+ * nextSiblingGenerator() -> next_siblings
+ * previousGenerator() -> previous_elements
+ * previousSiblingGenerator() -> previous_siblings
+ * recursiveChildGenerator() -> recursive_children
+ * parentGenerator() -> parents
+
+So instead of this:
+
+ for parent in tag.parentGenerator():
+ ...
+
+You can write this:
+
+ for parent in tag.parents:
+ ...
+
+(But the old code will still work.)
+
+== tag.string is recursive ==
+
+tag.string now operates recursively. If tag A contains a single tag B
+and nothing else, then A.string is the same as B.string. So:
+
+<a><b>foo</b></a>
+
+The value of a.string used to be None, and now it's "foo".
+
+== Empty-element tags ==
+
+Beautiful Soup's handling of empty-element tags (aka self-closing
+tags) has been improved, especially when parsing XML. Previously you
+had to explicitly specify a list of empty-element tags when parsing
+XML. You can still do that, but if you don't, Beautiful Soup now
+considers any empty tag to be an empty-element tag.
+
+The determination of empty-element-ness is now made at runtime rather
+than parse time. If you add a child to an empty-element tag, it stops
+being an empty-element tag.
+
+== Entities are always converted to Unicode ==
+
+An HTML or XML entity is always converted into the corresponding
+Unicode character. There are no longer any smartQuotesTo or
+convertEntities arguments. (Unicode, Dammit still has smart_quotes_to,
+but its default is now to turn smart quotes into Unicode.)
+
+== CDATA sections are normal text, if they're understood at all. ==
+
+Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
+markup:
+
+ <p><![CDATA[foo]]></p> => <p></p>
+
+A future version of html5lib will turn CDATA sections into text nodes,
+but only within tags like <svg> and <math>:
+
+ <svg><![CDATA[foo]]></svg> => <p>foo</p>
+
+The default XML parser (which uses lxml behind the scenes) turns CDATA
+sections into ordinary text elements:
+
+ <p><![CDATA[foo]]></p> => <p>foo</p>
+
+In theory it's possible to preserve the CDATA sections when using the
+XML parser, but I don't see how to get it to work in practice.
+
+== Miscellaneous other stuff ==
+
+If the BeautifulSoup instance has .is_xml set to True, an appropriate
+XML declaration will be emitted when the tree is transformed into a
+string:
+
+ <?xml version="1.0" encoding="utf-8">
+ <markup>
+ ...
+ </markup>
+The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree
+builders set it to False. If you want to parse XHTML with an HTML
+parser, you can set it manually.