diff options
author | Leonard Richardson <leonardr@segfault.org> | 2015-06-27 09:55:40 -0400 |
---|---|---|
committer | Leonard Richardson <leonardr@segfault.org> | 2015-06-27 09:55:40 -0400 |
commit | 017e4526af39ab75286ebfd2d64db25da116f27b (patch) | |
tree | 92441998f88babb05bb4a3d86949eeb9c4fd4985 /doc | |
parent | 800d1971dcbdc6316a013a4c6ce86e8c18d48dca (diff) | |
download | beautifulsoup4-017e4526af39ab75286ebfd2d64db25da116f27b.tar.gz |
Added an exclude_encodings argument to UnicodeDammit and to the
Beautiful Soup constructor, which lets you prohibit the detection of
an encoding that you know is wrong. [bug=1469408]
Diffstat (limited to 'doc')
-rw-r--r-- | doc/source/index.rst | 13 |
1 files changed, 13 insertions, 0 deletions
diff --git a/doc/source/index.rst b/doc/source/index.rst index 1b7b1e6..821dad4 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -2397,6 +2397,19 @@ We can fix this by passing in the correct ``from_encoding``:: soup.original_encoding 'iso8859-8' +If you don't know what the correct encoding is, but you know that +Unicode, Dammit is guessing wrong, you can pass the wrong guesses in +as ``exclude_encodings``:: + + soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"]) + soup.h1 + <h1>םולש</h1> + soup.original_encoding + 'WINDOWS-1255' + +(This isn't 100% correct, but Windows-1255 is a compatible superset of +ISO-8859-8, so it's close enough.) + In rare cases (usually when a UTF-8 document contains text written in a completely different encoding), the only way to get Unicode may be to replace some characters with the special Unicode character |