summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--AUTHORS10
-rw-r--r--doc/source/index.rst79
2 files changed, 64 insertions, 25 deletions
diff --git a/AUTHORS b/AUTHORS
index cfe018d..9623a7c 100644
--- a/AUTHORS
+++ b/AUTHORS
@@ -3,17 +3,19 @@ Behold, mortal, the origins of Beautiful Soup...
Leonard Richardson is the primary programmer.
-Sam Ruby helps with a lot of edge cases.
+Aaron DeVore is awesome.
Mark Pilgrim provided the encoding detection code that forms the base
of UnicodeDammit.
+Thomas Kluyver and Ezio Melotti finished the work of getting Beautiful
+Soup 4 working under Python 3.
+
+Sam Ruby helped with a lot of edge cases.
+
Jonathan Ellis was awarded the prestigous Beau Potage D'Or for his
work in solving the nestable tags conundrum.
-Thomas Kluyver finished the work of getting Beautiful Soup 4 working
-under Python 3.
-
The following people have contributed patches to Beautiful Soup:
Istvan Albert, Andrew Lin, Anthony Baxter, Andrew Boyko, Tony Chang,
diff --git a/doc/source/index.rst b/doc/source/index.rst
index ba923dc..625a6f5 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -19,6 +19,18 @@ violates your expectations.
The examples in this documentation should work the same way in Python
2.7 and Python 3.2.
+You might be looking for the documentation for `Beautiful Soup 3
+<http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_. If
+you want to learn about the differences between Beautiful Soup 3 and
+Beautiful Soup 4, see `Porting code to BS4`_.
+
+Getting help
+------------
+
+If you have questions about Beautiful Soup, or run into problems,
+`send mail to the discussion group
+<http://groups.google.com/group/beautifulsoup/>`_.
+
Quick Start
===========
@@ -151,7 +163,7 @@ BS3, so it's still available, but if you're writing new code you
should install ``beautifulsoup4``.)
You can also `download the Beautiful Soup 4 source tarball
-<http://www.crummy.com/software/BeautifulSoup/download/BeautifulSoup.tar.gz>`_
+<http://www.crummy.com/software/BeautifulSoup/download/beautifulsoup4.tar.gz>`_
and install it with ``setup.py``. The license for Beautiful Soup
allows you to package the entire library with your application, so you
can also download the tarball and insert the ``bs4`` directory into
@@ -1951,7 +1963,7 @@ entities::
# u'<p>I just &#x201C;love&#x201D; Microsoft Word</p>'
You might find this feature useful, but Beautiful Soup doesn't use
-it. Beautiful Soup prefers the default behavior, which is toconvert
+it. Beautiful Soup prefers the default behavior, which is to convert
Microsoft smart quotes to Unicode characters along with everything
else::
@@ -2073,7 +2085,7 @@ you're not using lxml as the underlying parser, my advice is to
:ref:`start <parser-installation>`. Beautiful Soup parses documents
significantly faster using lxml than using html.parser or html5lib.
-Sometimes `Unicode, Dammit` can only detect the encoding of a file by
+Sometimes `Unicode, Dammit`_ can only detect the encoding of a file by
doing a byte-by-byte examination of the file. This slows Beautiful
Soup to a crawl. My tests indicate that this only happened on 2.x
versions of Python, and that it happened most often with documents
@@ -2127,20 +2139,30 @@ becomes this::
from bs4 import BeautifulSoup
-If you get the ``ImportError`` "No module named BeautifulSoup", your
-problem is that you're trying to run Beautiful Soup 3 code, but you
-only have Beautiful Soup 4 installed.
+* If you get the ``ImportError`` "No module named BeautifulSoup", your
+ problem is that you're trying to run Beautiful Soup 3 code, but you
+ only have Beautiful Soup 4 installed.
+
+* If you get the ``ImportError`` "No module named bs4", your problem
+ is that you're trying to run Beautiful Soup 4 code, but you only
+ have Beautiful Soup 3 installed.
+
+Although BS4 is mostly backwards-compatible with BS3, most of its
+methods have been deprecated and given new names for `PEP 8 compliance
+<http://www.python.org/dev/peps/pep-0008/>`_. There are numerous other
+renames and changes, and a few of them break backwards compatibility.
-If you get the ``ImportError`` "No module named bs4", your problem is
-that you're trying to run Beautiful Soup 4 code, but you only have
-Beautiful Soup 3 installed.
+Here's what you'll need to know to convert your BS3 code and habits to BS4:
-Although BS4 is almost entirely backwards-compatible with BS3, most of
-its methods have been deprecated and given new names for PEP 8
-compliance. There are numerous other renames and changes, a few of
-which break backwards compatibility.
+You need a parser
+^^^^^^^^^^^^^^^^^
-Here are the changes:
+Beautiful Soup 3 used Python's ``SGMLParser``, a module that was
+deprecated and removed in Python 3.0. Beautiful Soup 4 uses
+``html.parser`` by default, but you can plug in lxml or html5lib and
+use that instead. Until ``html.parser`` is improved to handle
+real-world HTML better, that's what I recommend you do. See `Be sure
+to install a good parser!`_
Method names
^^^^^^^^^^^^
@@ -2210,7 +2232,7 @@ You can write this::
(But the old code will still work.)
-Some of the generators used to yield None after they were done, and
+Some of the generators used to yield ``None`` after they were done, and
then stop. That was a bug. Now the generators just stop.
There are two new generators, :ref:`.strings and
@@ -2235,6 +2257,22 @@ Beautiful Soup considers any empty tag to be an empty-element tag. If
you add a child to an empty-element tag, it stops being an
empty-element tag.
+Entities
+^^^^^^^^
+
+An incoming HTML or XML entity is always converted into the
+corresponding Unicode character. Beautiful Soup 3 had a number of
+overlapping ways of dealing with entities, which have been
+removed. The ``BeautifulSoup`` constructor no longer recognizes the
+``smartQuotesTo`` or ``convertEntities`` arguments. (`Unicode,
+Dammit`_ still has ``smart_quotes_to``, but its default is now to turn
+smart quotes into Unicode.)
+
+If you want to turn those Unicode characters back into HTML entities
+on output, rather than turning them into UTF-8 characters, you need to
+use ``.encode``, as described in `Substituting HTML entities`. This
+may change before the final release.
+
Miscellaneous
^^^^^^^^^^^^^
@@ -2242,12 +2280,11 @@ Miscellaneous
contains a single tag B and nothing else, then A.string is the same as
B.string. (Previously, it was None.)
-An incoming HTML or XML entity is always converted into the
-corresponding Unicode character. The ``BeautifulSoup`` constructor no
-longer recognizes the ``smartQuotesTo`` or ``convertEntities``
-arguments. (`Unicode, Dammit`_ still has ``smart_quotes_to``, but its
-default is now to turn smart quotes into Unicode.)
-
The ``BeautifulSoup`` constructor no longer recognizes the
`markupMassage` argument. It's now the parser's responsibility to
handle markup correctly.
+
+The rarely-used alternate parser classes like
+``ICantBelieveItsBeautifulSoup`` and ``BeautifulSOAP`` have been
+removed. It's now the parser's decision how to handle ambiguous
+markup.