summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorLeonard Richardson <leonardr@segfault.org>2015-06-27 09:55:40 -0400
committerLeonard Richardson <leonardr@segfault.org>2015-06-27 09:55:40 -0400
commit017e4526af39ab75286ebfd2d64db25da116f27b (patch)
tree92441998f88babb05bb4a3d86949eeb9c4fd4985 /doc
parent800d1971dcbdc6316a013a4c6ce86e8c18d48dca (diff)
downloadbeautifulsoup4-017e4526af39ab75286ebfd2d64db25da116f27b.tar.gz
Added an exclude_encodings argument to UnicodeDammit and to the
Beautiful Soup constructor, which lets you prohibit the detection of an encoding that you know is wrong. [bug=1469408]
Diffstat (limited to 'doc')
-rw-r--r--doc/source/index.rst13
1 files changed, 13 insertions, 0 deletions
diff --git a/doc/source/index.rst b/doc/source/index.rst
index 1b7b1e6..821dad4 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -2397,6 +2397,19 @@ We can fix this by passing in the correct ``from_encoding``::
soup.original_encoding
'iso8859-8'
+If you don't know what the correct encoding is, but you know that
+Unicode, Dammit is guessing wrong, you can pass the wrong guesses in
+as ``exclude_encodings``::
+
+ soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
+ soup.h1
+ <h1>םולש</h1>
+ soup.original_encoding
+ 'WINDOWS-1255'
+
+(This isn't 100% correct, but Windows-1255 is a compatible superset of
+ISO-8859-8, so it's close enough.)
+
In rare cases (usually when a UTF-8 document contains text written in
a completely different encoding), the only way to get Unicode may be
to replace some characters with the special Unicode character