summaryrefslogtreecommitdiff
path: root/bs4/dammit.py
Commit message (Collapse)AuthorAgeFilesLines
* Fixed code that was causing deprecation warnings in recent Python 3Leonard Richardson2018-07-141-3/+3
| | | versions. Includes a patch from Ville Skyttä. [bug=1778909] [bug=1689496]
* Indentation change contributed by Pranav Salunke.Leonard Richardson2016-12-191-2/+2
|\
| * Minor change. Extra indent for character so it looks nicer.Pranav Salunke2016-04-061-2/+2
| |
* | Use a dedicated logger instead of the root logger. [bug=1511661]Leonard Richardson2016-07-171-1/+1
| |
* | Use a dedicated logger instead of the root logger. [bug=1511661]Leonard Richardson2016-07-171-3/+4
| |
* | Removed imports to pdb, since pdb is not available in some environments. ↵Leonard Richardson2016-07-161-1/+0
| | | | | | | | [bug=1491700]
* | Rename COPYING.txt to LICENSE. Add a reference to LICENSE in every source file.Leonard Richardson2016-07-161-0/+2
|/
* Add a __license__ statement to all source files.Leonard Richardson2015-09-281-0/+1
|
* Unicode data cannot have a byte-order mark. Returning early stops a warning ↵Leonard Richardson2015-07-031-0/+3
| | | | from happening.
* Added an exclude_encodings argument to UnicodeDammit and to theLeonard Richardson2015-06-271-3/+9
| | | | Beautiful Soup constructor, which lets you prohibit the detection of an encoding that you know is wrong. [bug=1469408]
* Added a sanity check helper method that makes sure all the elements of a ↵Leonard Richardson2015-06-261-1/+2
| | | | tree are properly connected via .next_element and .previous_element.
* Fixed a crash in Unicode, Dammit's encoding detector when the nameLeonard Richardson2015-06-251-1/+1
| | | of the encoding itself contained invalid bytes. [bug=1360913]
* Fixed a bug that caused Unicode data put into UnicodeDammit toLeonard Richardson2013-10-021-6/+9
| | | return None instead of the original data. [bug=1214983]
* Inlined some commonly called code to save a function call.Leonard Richardson2013-06-031-4/+4
|
* Limit how much of the document is searched via regular expression for a ↵Leonard Richardson2013-06-031-4/+11
| | | | declared encoding.
* Turns out we had two bits of code to strip byte-order marks.Leonard Richardson2013-06-021-34/+43
|
* It turns out most of the untested code wasn't doing anything useful.Leonard Richardson2013-06-021-108/+20
|
* Create a new lxml parser object for every new parsing strategy.Leonard Richardson2013-05-311-5/+16
|
* Refactored code a bit.Leonard Richardson2013-05-301-14/+13
|
* Split out the code that guesses at encodings from the code that tries to ↵Leonard Richardson2013-05-301-128/+189
| | | | decode a bytestring based on those encodings. This is necessary because lxml wants to do the decoding itself.
* The default XML formatter will now replace ampersands even if they appear to ↵Leonard Richardson2013-05-201-0/+25
| | | | be part of entities. That is, "<" will become "<".[bug=1182183]
* Doc fixes.Leonard Richardson2012-11-031-1/+0
|
* Fixed cchardet import.Leonard Richardson2012-08-171-3/+3
|
* Mentioned cchardet in docs.Leonard Richardson2012-07-031-1/+1
|
* When sniffing encodings, if the cchardet library is installed, use it ↵Leonard Richardson2012-07-031-10/+22
| | | | instead of chardet. It's much faster. [bug=1020748]
* Use logging.warning() instead of warning.warn() to notify the user that ↵Leonard Richardson2012-07-031-4/+3
| | | | characters were replaced with REPLACEMENT CHARACTER. [bug=1013862]
* Comments, processing instructions, document type declarations, and markup ↵Leonard Richardson2012-05-241-11/+18
| | | | declarations are now treated as preformatted strings, the way CData blocks are. [bug=1001025] Also in this commit: renamed detwingle method to detwingle().
* Fixed the handling of " with the built-in parser. [bug=993871]Leonard Richardson2012-05-031-7/+7
|
* Added experimental support for fixing Windows-1252 characters embedded in ↵Leonard Richardson2012-04-271-0/+196
| | | | UTF-8 documents.
* Fixed a bug in decoding data that contained a byte-order mark, such as data ↵Leonard Richardson2012-04-261-20/+28
| | | | encoded in UTF-16LE. [bug=988980]
* Unicode, Dammit now has an option to turn MS smart quotes into ASCII characters.Leonard Richardson2012-04-161-8/+148
|
* Attribute values are now run through the provided output formatter. ↵Leonard Richardson2012-04-161-33/+37
| | | | Previously they were always run through the 'minimal' formatter. [bug=980237]
* Issue a warning if characters were replaced with REPLACEMENT CHARACTER ↵Leonard Richardson2012-02-161-0/+5
| | | | during Unicode conversion.
* As a last-ditch attempt to turn data into Unicode, use errors=replace ↵Leonard Richardson2012-02-091-9/+25
| | | | instead of errors=strict.
* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags like ↵Leonard Richardson2012-02-091-2/+4
| | | | <meta charset="utf-8" />. [bug=837268]
* Minor Unicode, Dammit cleanup.Leonard Richardson2012-02-091-11/+11
|
* Improved Unicode, Dammit's behavior when you give it Unicode to begin with.Leonard Richardson2012-02-091-2/+4
|
* Various changes so most tests pass on Python 3.Thomas Kluyver2011-06-291-33/+33
|
* OK, figured that out.Leonard Richardson2011-05-211-7/+6
|\
| * Changed dammit.py to require fewer changes to be Python 3 compatible.Leonard Richardson2011-05-211-7/+6
| |
* | PEP8ifyingAaron DeVore2011-03-051-45/+46
|/
* Added a tree builder for the built-in HTMLParser, and tests.Leonard Richardson2011-02-271-3/+5
|
* Renamed the beautifulsoup module to bs4 to save typing.Leonard Richardson2011-02-271-0/+410