summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSeth M Morton <seth.m.morton@gmail.com>2017-08-18 23:22:39 -0700
committerSeth M Morton <seth.m.morton@gmail.com>2017-08-18 23:24:52 -0700
commit06a67bf5d4a3ba7de1e3104f87df911b86d1511b (patch)
tree41dd9117eadb0e729b9e0e4a454f5bea6404727a
parent3a75ddb0dda38e59bd1e034390933ec39a1ab0ff (diff)
downloadnatsort-06a67bf5d4a3ba7de1e3104f87df911b86d1511b.tar.gz
Update documentation to discuss Unicode normalization.
-rw-r--r--README.rst1
-rw-r--r--docs/source/howitworks.rst89
2 files changed, 86 insertions, 4 deletions
diff --git a/README.rst b/README.rst
index 3a283f2..61ddb96 100644
--- a/README.rst
+++ b/README.rst
@@ -234,6 +234,7 @@ Other Useful Things
+++++++++++++++++++
- recursively descend into lists of lists
+ - automatic unicode normalization of input data
- `controlling the case-sensitivity <http://natsort.readthedocs.io/en/stable/examples.html#case-sort>`_
- `sorting file paths correctly <http://natsort.readthedocs.io/en/stable/examples.html#path-sort>`_
- `allow custom sorting keys <http://natsort.readthedocs.io/en/stable/examples.html#custom-sort>`_
diff --git a/docs/source/howitworks.rst b/docs/source/howitworks.rst
index f15db8f..22de2a5 100644
--- a/docs/source/howitworks.rst
+++ b/docs/source/howitworks.rst
@@ -655,12 +655,14 @@ StdLib there can't be too many dragons, right?
- https://github.com/SethMMorton/natsort/issues/22
- https://github.com/SethMMorton/natsort/issues/23
- https://github.com/SethMMorton/natsort/issues/36
+ - https://github.com/SethMMorton/natsort/issues/44
- https://bugs.python.org/issue2481
- https://bugs.python.org/issue23195
- - http://stackoverflow.com/questions/3412933/python-not-sorting-unicode-properly-strcoll-doesnt-help
- - http://stackoverflow.com/questions/22203550/sort-dictionary-by-key-using-locale-collation
- - http://stackoverflow.com/questions/33459384/unicode-character-not-in-range-when-calling-locale-strxfrm
- - http://stackoverflow.com/questions/36431810/sort-numeric-lines-with-thousand-separators
+ - https://stackoverflow.com/questions/3412933/python-not-sorting-unicode-properly-strcoll-doesnt-help
+ - https://stackoverflow.com/questions/22203550/sort-dictionary-by-key-using-locale-collation
+ - https://stackoverflow.com/questions/33459384/unicode-character-not-in-range-when-calling-locale-strxfrm
+ - https://stackoverflow.com/questions/36431810/sort-numeric-lines-with-thousand-separators
+ - https://stackoverflow.com/questions/45734562/how-can-i-get-a-reasonable-string-sorting-with-python
These can be summed up as follows:
@@ -787,6 +789,84 @@ the ``else:`` block of :func:`coerce_to_int`/:func:`coerce_to_float`.
Of course, applying both *LOWERCASEFIRST* and *GROUPLETTERS* is just
a matter of turning on both functions.
+Basic Unicode Support
++++++++++++++++++++++
+
+Unicode is hard and complicated. Here's an example.
+
+.. code-block:: python
+
+ >>> b = [b'\x66', b'\x65', b'\xc3\xa9', b'\x65\xcc\x81', b'\x61', b'\x7a']
+ >>> a = [x.decode('utf8') for x in b]
+ >>> a # doctest: +SKIP
+ ['f', 'e', 'é', 'é', 'a', 'z']
+ >>> sorted(a) # doctest: +SKIP
+ ['a', 'e', 'é', 'f', 'z', 'é']
+
+
+There are more than one way to represent the character 'é' in Unicode.
+In fact, many characters have multiple representations. This is a challenge
+because comparing the two representations would return ``False`` even though
+they *look* the same.
+
+.. code-block:: python
+
+ >>> a[2] == a[3]
+ False
+
+Alas, since characters are compared based on the numerical value of their
+representation, sorting Unicode often gives unexpected results (like seeing
+'é' come both *before* and *after* 'z').
+
+The original approach that :mod:`natsort` took with respect to non-ASCII
+Unicode characters was to say "just use
+the :mod:`locale` or :mod:`PyICU` library" and then cross it's fingers
+and hope those libraries take care of it. As you will find in the following
+sections, that comes with its own baggage, and turned out to not always work anyway
+(see https://stackoverflow.com/q/45734562/1399279). A more robust approach is to
+handle the Unicode out-of-the-box without invoking a heavy-handed library
+like :mod:`locale` or :mod:`PyICU`. To do this, we must use *normalization*.
+
+To fully understand Unicode normalization, `check out some official Unicode documentation`_.
+Just kidding... that's too much text. The following StackOverflow answers do
+a good job at explaining Unicode normalization in simple terms:
+https://stackoverflow.com/a/7934397/1399279 and
+https://stackoverflow.com/a/7931547/1399279. Put simply, normalization
+ensures that Unicode characters with multiple representations are in
+some canonical and consistent representation so that (for example) comparisons
+of the characters can be performed in a sane way. The following discussion
+assumes you at least read the StackOverflow answers.
+
+Looking back at our 'é' example, we can see that the two versions were
+constructed with the byte strings ``b'\xc3\xa9'`` and ``b'\x65\xcc\x81'``.
+The former representation is actually
+`LATIN SMALL LETTER E WITH ACUTE <http://www.fileformat.info/info/unicode/char/e9/index.htm>`_
+and is a single character in the Unicode standard. This is known as the
+*compressed form* and corresponds to the 'NFC' normalization scheme.
+The latter representation is actually the letter 'e' followed by
+`COMBINING ACUTE ACCENT <http://www.fileformat.info/info/unicode/char/0301/index.htm>`_
+and so is two characters in the Unicode standard. This is known as the
+*decompressed form* and corresponds to the 'NFD' normalization scheme.
+Since the first character in the decompressed form is actually the letter 'e',
+when compared to other ASCII characters it fits where you might expect.
+Unfortunately, all Unicode compressed form characters come after the
+ASCII characters and so they always will be placed after 'z' when sorting.
+
+It seems that most Unicode data is stored and shared in the compressed form
+which makes it challenging to sort. This can be solved by normalizing all
+incoming Unicode data to the decompressed form ('NFD') and *then* sorting.
+
+.. code-block:: python
+
+ >>> import unicodedata
+ >>> c = [unicodedata.normalize('NFD', x) for x in a]
+ >>> c # doctest: +SKIP
+ ['f', 'e', 'é', 'é', 'a', 'z']
+ >>> sorted(c) # doctest: +SKIP
+ ['a', 'e', 'é', 'é', 'f', 'z']
+
+Huzzah! Sane sorting without having to resort to :mod:`locale`!
+
Using Locale to Compare Strings
+++++++++++++++++++++++++++++++
@@ -1052,3 +1132,4 @@ what the rest of the world assumes.
.. _Thousands separator support: https://github.com/SethMMorton/natsort/issues/36
.. _really good: https://hypothesis.readthedocs.io/en/latest/
.. _testing strategy: http://doc.pytest.org/en/latest/
+.. _check out some official Unicode documentation: http://unicode.org/reports/tr15/