diff options
author | Leonard Richardson <leonard.richardson@canonical.com> | 2013-06-02 22:19:37 -0400 |
---|---|---|
committer | Leonard Richardson <leonard.richardson@canonical.com> | 2013-06-02 22:19:37 -0400 |
commit | b42a4ece63de739ad7a37973a4e10af23346ffd1 (patch) | |
tree | a65794b5422a1e12a8ddf943c9afd0e0f798f6c4 /doc | |
parent | b8b0711b903509e4b88e878fb6ca3731738ca99e (diff) | |
parent | 847a8e08e21de9036783feeecd8de93b112f3868 (diff) | |
download | beautifulsoup4-b42a4ece63de739ad7a37973a4e10af23346ffd1.tar.gz |
Merged in big encoding-detection refactoring branch.
Diffstat (limited to 'doc')
-rw-r--r-- | doc/source/index.rst | 18 |
1 files changed, 5 insertions, 13 deletions
diff --git a/doc/source/index.rst b/doc/source/index.rst index a91854c..1b38df7 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -2478,9 +2478,11 @@ become Unicode:: dammit.original_encoding # 'utf-8' -The more data you give Unicode, Dammit, the more accurately it will -guess. If you have your own suspicions as to what the encoding might -be, you can pass them in as a list:: +Unicode, Dammit's guesses will get a lot more accurate if you install +the ``chardet`` or ``cchardet`` Python libraries. The more data you +give Unicode, Dammit, the more accurately it will guess. If you have +your own suspicions as to what the encoding might be, you can pass +them in as a list:: dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) print(dammit.unicode_markup) @@ -2823,16 +2825,6 @@ significantly faster using lxml than using html.parser or html5lib. You can speed up encoding detection significantly by installing the `cchardet <http://pypi.python.org/pypi/cchardet/>`_ library. -Sometimes `Unicode, Dammit`_ can only detect the encoding of a file by -doing a byte-by-byte examination of the file. This slows Beautiful -Soup to a crawl. My tests indicate that this only happened on 2.x -versions of Python, and that it happened most often with documents -using Russian or Chinese encodings. If this is happening to you, you -can fix it by installing cchardet, or by using Python 3 for your -script. If you happen to know a document's encoding, you can pass -it into the ``BeautifulSoup`` constructor as ``from_encoding``, and -bypass encoding detection altogether. - `Parsing only part of a document`_ won't save you much time parsing the document, but it can save a lot of memory, and it'll make `searching` the document much faster. |