summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSeth Morton <seth.m.morton@gmail.com>2023-02-26 15:08:25 -0800
committerSeth Morton <seth.m.morton@gmail.com>2023-02-26 15:08:25 -0800
commit48349a14c6a23d6924bf29ca8aaa06b6a401e551 (patch)
treecea6dcea62b9573543259e3cd00d7f80b1213e4f
parent837a3875e5414fecf0c87b565c8b61fde16ed016 (diff)
downloadnatsort-48349a14c6a23d6924bf29ca8aaa06b6a401e551.tar.gz
Move most documentation to the wiki
The pages are left with a note about the redirection so as to not destroy the internet.
-rw-r--r--docs/examples.rst420
-rw-r--r--docs/howitworks.rst1169
-rw-r--r--docs/locale_issues.rst91
-rw-r--r--docs/shell.rst152
4 files changed, 10 insertions, 1822 deletions
diff --git a/docs/examples.rst b/docs/examples.rst
index 689983a..b372ca7 100644
--- a/docs/examples.rst
+++ b/docs/examples.rst
@@ -6,421 +6,5 @@
Examples and Recipes
====================
-If you want more detailed examples than given on this page, please see
-https://github.com/SethMMorton/natsort/tree/master/tests.
-
-.. contents::
- :local:
-
-Basic Usage
------------
-
-In the most basic use case, simply import :func:`~natsorted` and use
-it as you would :func:`sorted`:
-
-.. code-block:: pycon
-
- >>> a = ['2 ft 7 in', '1 ft 5 in', '10 ft 2 in', '2 ft 11 in', '7 ft 6 in']
- >>> sorted(a)
- ['1 ft 5 in', '10 ft 2 in', '2 ft 11 in', '2 ft 7 in', '7 ft 6 in']
- >>> from natsort import natsorted, ns
- >>> natsorted(a)
- ['1 ft 5 in', '2 ft 7 in', '2 ft 11 in', '7 ft 6 in', '10 ft 2 in']
-
-Sort Version Numbers
---------------------
-
-As of :mod:`natsort` version >= 4.0.0, :func:`~natsorted` will work for
-well-behaved version numbers, like ``MAJOR.MINOR.PATCH``.
-
-.. _rc_sorting:
-
-Sorting More Expressive Versioning Schemes
-++++++++++++++++++++++++++++++++++++++++++
-
-By default, if you wish to sort versions that are not as simple as
-``MAJOR.MINOR.PATCH`` (or similar), you may not get the results you expect:
-
-.. code-block:: pycon
-
- >>> a = ['1.2', '1.2rc1', '1.2beta2', '1.2beta1', '1.2alpha', '1.2.1', '1.1', '1.3']
- >>> natsorted(a)
- ['1.1', '1.2', '1.2.1', '1.2alpha', '1.2beta1', '1.2beta2', '1.2rc1', '1.3']
-
-To make the '1.2' pre-releases come before '1.2.1', you need to use the
-following recipe:
-
-.. code-block:: pycon
-
- >>> natsorted(a, key=lambda x: x.replace('.', '~'))
- ['1.1', '1.2', '1.2alpha', '1.2beta1', '1.2beta2', '1.2rc1', '1.2.1', '1.3']
-
-If you also want '1.2' after all the alpha, beta, and rc candidates, you can
-modify the above recipe:
-
-.. code-block:: pycon
-
- >>> natsorted(a, key=lambda x: x.replace('.', '~')+'z')
- ['1.1', '1.2alpha', '1.2beta1', '1.2beta2', '1.2rc1', '1.2', '1.2.1', '1.3']
-
-Please see `this issue <https://github.com/SethMMorton/natsort/issues/13>`_ to
-see why this works.
-
-Sorting Rigorously Defined Versioning Schemes (e.g. SemVer or PEP 440)
-""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
-
-If you know you are using a versioning scheme that follows a well-defined format
-for which there is third-party module support, you should use those modules
-to assist in sorting. Some examples might be
-`PEP 440 <https://packaging.pypa.io/en/latest/version>`_ or
-`SemVer <https://python-semver.readthedocs.io/en/latest/api.html>`_.
-
-If we are being honest, using these methods to parse a version means you don't
-need to use :mod:`natsort` - you should probably just use :func:`sorted`
-directly. Here's an example with SemVer:
-
-.. code-block:: pycon
-
- >>> from semver import VersionInfo
- >>> a = ['3.4.5-pre.1', '3.4.5', '3.4.5-pre.2+build.4']
- >>> sorted(a, key=VersionInfo.parse)
- ['3.4.5-pre.1', '3.4.5-pre.2+build.4', '3.4.5']
-
-.. _path_sort:
-
-Sort OS-Generated Paths
------------------------
-
-In some cases when sorting file paths with OS-Generated names, the default
-:mod:`~natsorted` algorithm may not be sufficient. In cases like these,
-you may need to use the ``ns.PATH`` option:
-
-.. code-block:: pycon
-
- >>> a = ['./folder/file (1).txt',
- ... './folder/file.txt',
- ... './folder (1)/file.txt',
- ... './folder (10)/file.txt']
- >>> natsorted(a)
- ['./folder (1)/file.txt', './folder (10)/file.txt', './folder/file (1).txt', './folder/file.txt']
- >>> natsorted(a, alg=ns.PATH)
- ['./folder/file.txt', './folder/file (1).txt', './folder (1)/file.txt', './folder (10)/file.txt']
-
-Locale-Aware Sorting (Human Sorting)
-------------------------------------
-
-.. note::
- Please read :ref:`locale_issues` before using ``ns.LOCALE``, :func:`humansorted`,
- or :func:`index_humansorted`.
-
-You can instruct :mod:`natsort` to use locale-aware sorting with the
-``ns.LOCALE`` option. In addition to making this understand non-ASCII
-characters, it will also properly interpret non-'.' decimal separators
-and also properly order case. It may be more convenient to just use
-the :func:`humansorted` function:
-
-.. code-block:: pycon
-
- >>> from natsort import humansorted
- >>> import locale
- >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
- 'en_US.UTF-8'
- >>> a = ['Apple', 'corn', 'Corn', 'Banana', 'apple', 'banana']
- >>> natsorted(a, alg=ns.LOCALE)
- ['apple', 'Apple', 'banana', 'Banana', 'corn', 'Corn']
- >>> humansorted(a)
- ['apple', 'Apple', 'banana', 'Banana', 'corn', 'Corn']
-
-You may find that if you do not explicitly set the locale your results may not
-be as you expect... I have found that it depends on the system you are on.
-If you use `PyICU <https://pypi.org/project/PyICU>`_ (see below) then
-you should not need to do this.
-
-.. _case_sort:
-
-Controlling Case When Sorting
------------------------------
-
-For non-numbers, by default :mod:`natsort` used ordinal sorting (i.e.
-it sorts by the character's value in the ASCII table). For example:
-
-.. code-block:: pycon
-
- >>> a = ['Apple', 'corn', 'Corn', 'Banana', 'apple', 'banana']
- >>> natsorted(a)
- ['Apple', 'Banana', 'Corn', 'apple', 'banana', 'corn']
-
-There are times when you wish to ignore the case when sorting,
-you can easily do this with the ``ns.IGNORECASE`` option:
-
-.. code-block:: pycon
-
- >>> natsorted(a, alg=ns.IGNORECASE)
- ['Apple', 'apple', 'Banana', 'banana', 'corn', 'Corn']
-
-Note that's since Python's sorting is stable, the order of equivalent
-elements after lowering the case is the same order they appear in the
-original list.
-
-Upper-case letters appear first in the ASCII table, but many natural
-sorting methods place lower-case first. To do this, use
-``ns.LOWERCASEFIRST``:
-
-.. code-block:: pycon
-
- >>> natsorted(a, alg=ns.LOWERCASEFIRST)
- ['apple', 'banana', 'corn', 'Apple', 'Banana', 'Corn']
-
-It may be undesirable to have the upper-case letters grouped together
-and the lower-case letters grouped together; most would expect all
-"a"s to bet together regardless of case, and all "b"s, and so on. To
-achieve this, use ``ns.GROUPLETTERS``:
-
-.. code-block:: pycon
-
- >>> natsorted(a, alg=ns.GROUPLETTERS)
- ['Apple', 'apple', 'Banana', 'banana', 'Corn', 'corn']
-
-You might combine this with ``ns.LOWERCASEFIRST`` to get what most
-would expect to be "natural" sorting:
-
-.. code-block:: pycon
-
- >>> natsorted(a, alg=ns.G | ns.LF)
- ['apple', 'Apple', 'banana', 'Banana', 'corn', 'Corn']
-
-Customizing Float Definition
-----------------------------
-
-You can make :func:`~natsorted` search for any float that would be
-a valid Python float literal, such as 5, 0.4, -4.78, +4.2E-34, etc.
-using the ``ns.FLOAT`` key. You can disable the exponential component
-of the number with ``ns.NOEXP``.
-
-.. code-block:: pycon
-
- >>> a = ['a50', 'a51.', 'a+50.4', 'a5.034e1', 'a+50.300']
- >>> natsorted(a, alg=ns.FLOAT)
- ['a50', 'a5.034e1', 'a51.', 'a+50.300', 'a+50.4']
- >>> natsorted(a, alg=ns.FLOAT | ns.SIGNED)
- ['a50', 'a+50.300', 'a5.034e1', 'a+50.4', 'a51.']
- >>> natsorted(a, alg=ns.FLOAT | ns.SIGNED | ns.NOEXP)
- ['a5.034e1', 'a50', 'a+50.300', 'a+50.4', 'a51.']
-
-For convenience, the ``ns.REAL`` option is provided which is a shortcut
-for ``ns.FLOAT | ns.SIGNED`` and can be used to sort on real numbers.
-This can be easily accessed with the :func:`~realsorted` convenience
-function. Please note that the behavior of the :func:`~realsorted` function
-was the default behavior of :func:`~natsorted` for :mod:`natsort`
-version < 4.0.0:
-
-.. code-block:: pycon
-
- >>> natsorted(a, alg=ns.REAL)
- ['a50', 'a+50.300', 'a5.034e1', 'a+50.4', 'a51.']
- >>> from natsort import realsorted
- >>> realsorted(a)
- ['a50', 'a+50.300', 'a5.034e1', 'a+50.4', 'a51.']
-
-.. _custom_sort:
-
-Using a Custom Sorting Key
---------------------------
-
-Like the built-in ``sorted`` function, ``natsorted`` can accept a custom
-sort key so that:
-
-.. code-block:: pycon
-
- >>> from operator import attrgetter, itemgetter
- >>> a = [['a', 'num4'], ['b', 'num8'], ['c', 'num2']]
- >>> natsorted(a, key=itemgetter(1))
- [['c', 'num2'], ['a', 'num4'], ['b', 'num8']]
- >>> class Foo:
- ... def __init__(self, bar):
- ... self.bar = bar
- ... def __repr__(self):
- ... return "Foo('{}')".format(self.bar)
- >>> b = [Foo('num3'), Foo('num5'), Foo('num2')]
- >>> natsorted(b, key=attrgetter('bar'))
- [Foo('num2'), Foo('num3'), Foo('num5')]
-
-.. _unit_sorting:
-
-Accounting for Units When Sorting
-+++++++++++++++++++++++++++++++++
-
-:mod:`natsort` does not come with any pre-built mechanism to sort units,
-but you can write your own `key` to do this. Below, I will demonstrate sorting
-imperial lengths (e.g. feet an inches), but of course you can extend this to any
-set of units you need. This example is based on code
-`from this issue <https://github.com/SethMMorton/natsort/issues/100#issuecomment-530659310>`_,
-and uses the function :func:`natsort.numeric_regex_chooser` to build a regular
-expression that will parse numbers in the same manner as :mod:`natsort` itself.
-
-.. code-block:: pycon
-
- >>> import re
- >>> import natsort
- >>>
- >>> # Define how each unit will be transformed
- >>> conversion_mapping = {
- ... "in": 1,
- ... "inch": 1,
- ... "inches": 1,
- ... "ft": 12,
- ... "feet": 12,
- ... "foot": 12,
- ... }
- >>>
- >>> # This regular expression searches for numbers and units
- >>> all_units = "|".join(conversion_mapping.keys())
- >>> float_re = natsort.numeric_regex_chooser(natsort.FLOAT | natsort.SIGNED)
- >>> unit_finder = re.compile(r"({})\s*({})".format(float_re, all_units), re.IGNORECASE)
- >>>
- >>> def unit_replacer(matchobj):
- ... """
- ... Given a regex match object, return a replacement string where units are modified
- ... """
- ... number = matchobj.group(1)
- ... unit = matchobj.group(2)
- ... new_number = float(number) * conversion_mapping[unit]
- ... return "{} in".format(new_number)
- ...
- >>> # Demo time!
- >>> data = ['1 ft', '5 in', '10 ft', '2 in']
- >>> [unit_finder.sub(unit_replacer, x) for x in data]
- ['12.0 in', '5.0 in', '120.0 in', '2.0 in']
- >>>
- >>> natsort.natsorted(data, key=lambda x: unit_finder.sub(unit_replacer, x))
- ['2 in', '5 in', '1 ft', '10 ft']
-
-Generating a Natsort Key
-------------------------
-
-If you need to sort a list in-place, you cannot use :func:`~natsorted`; you
-need to pass a key to the :meth:`list.sort` method. The function
-:func:`~natsort_keygen` is a convenient way to generate these keys for you:
-
-.. code-block:: pycon
-
- >>> from natsort import natsort_keygen
- >>> a = ['a50', 'a51.', 'a50.4', 'a5.034e1', 'a50.300']
- >>> natsort_key = natsort_keygen(alg=ns.FLOAT)
- >>> a.sort(key=natsort_key)
- >>> a
- ['a50', 'a50.300', 'a5.034e1', 'a50.4', 'a51.']
-
-:func:`~natsort_keygen` has the same API as :func:`~natsorted` (minus the
-`reverse` option).
-
-Sorting Multiple Lists According to a Single List
--------------------------------------------------
-
-Sometimes you have multiple lists, and you want to sort one of those
-lists and reorder the other lists according to how the first was sorted.
-To achieve this you could use the :func:`~index_natsorted` in combination
-with the convenience function
-:func:`~order_by_index`:
-
-.. code-block:: pycon
-
- >>> from natsort import index_natsorted, order_by_index
- >>> a = ['a2', 'a9', 'a1', 'a4', 'a10']
- >>> b = [4, 5, 6, 7, 8]
- >>> c = ['hi', 'lo', 'ah', 'do', 'up']
- >>> index = index_natsorted(a)
- >>> order_by_index(a, index)
- ['a1', 'a2', 'a4', 'a9', 'a10']
- >>> order_by_index(b, index)
- [6, 4, 7, 5, 8]
- >>> order_by_index(c, index)
- ['ah', 'hi', 'do', 'lo', 'up']
-
-Returning Results in Reverse Order
-----------------------------------
-
-Just like the :func:`sorted` built-in function, you can supply the
-``reverse`` option to return the results in reverse order:
-
-.. code-block:: pycon
-
- >>> a = ['a2', 'a9', 'a1', 'a4', 'a10']
- >>> natsorted(a, reverse=True)
- ['a10', 'a9', 'a4', 'a2', 'a1']
-
-Sorting Bytes
--------------
-
-Python is rather strict about comparing strings and bytes, and this
-can make it difficult to deal with collections of both. Because of the
-challenge of guessing which encoding should be used to decode a bytes
-array to a string, :mod:`natsort` does *not* try to guess and automatically
-convert for you; in fact, the official stance of :mod:`natsort` is to
-not support sorting bytes. Instead, some decoding convenience functions
-have been provided to you (see :ref:`bytes_help`) that allow you to
-provide a codec for decoding bytes through the ``key`` argument that
-will allow :mod:`natsort` to convert byte arrays to strings for sorting;
-these functions know not to raise an error if the input is not a byte
-array, so you can use the key on any arbitrary collection of data.
-
-.. code-block:: pycon
-
- >>> from natsort import as_ascii
- >>> a = [b'a', 14.0, 'b']
- >>> # natsorted(a) would raise a TypeError (bytes() < str())
- >>> natsorted(a, key=as_ascii) == [14.0, b'a', 'b']
- True
-
-Additionally, regular expressions cannot be run on byte arrays, making it
-so that :mod:`natsort` cannot parse them for numbers. As a result, if you
-run :mod:`natsort` on a list of bytes, you will get results that are like
-Python's default sorting behavior. Of course, you can use the decoding
-functions to solve this:
-
-.. code-block:: pycon
-
- >>> from natsort import as_utf8
- >>> a = [b'a56', b'a5', b'a6', b'a40']
- >>> natsorted(a) # doctest: +SKIP
- [b'a40', b'a5', b'a56', b'a6']
- >>> natsorted(a, key=as_utf8) == [b'a5', b'a6', b'a40', b'a56']
- True
-
-If you need a codec different from ASCII or UTF-8, you can use
-:func:`decoder` to generate a custom key:
-
-.. code-block:: pycon
-
- >>> from natsort import decoder
- >>> a = [b'a56', b'a5', b'a6', b'a40']
- >>> natsorted(a, key=decoder('latin1')) == [b'a5', b'a6', b'a40', b'a56']
- True
-
-Sorting a Pandas DataFrame
---------------------------
-
-Starting from Pandas version 1.1.0, the
-`sorting methods accept a "key" argument <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html>`_,
-so you can simply pass :func:`natsort_keygen` to the sorting methods and sort:
-
-.. code-block:: python
-
- import pandas as pd
- from natsort import natsort_keygen
- s = pd.Series(['2 ft 7 in', '1 ft 5 in', '10 ft 2 in', '2 ft 11 in', '7 ft 6 in'])
- s.sort_values(key=natsort_keygen())
- # 1 1 ft 5 in
- # 0 2 ft 7 in
- # 3 2 ft 11 in
- # 4 7 ft 6 in
- # 2 10 ft 2 in
- # dtype: object
-
-Similarly, if you need to sort the index there is
-`sort_index <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html>`_
-of a DataFrame.
-
-If you are on an older version of Pandas, check out please check out
-`this answer on StackOverflow <https://stackoverflow.com/a/29582718/1399279>`_
-for ways to do this without the ``key`` argument to ``sort_values``.
+This page has been moved to the
+`natsort wiki <https://github.com/SethMMorton/natsort/wiki/How-Does-Natsort-Work%3F>`_.
diff --git a/docs/howitworks.rst b/docs/howitworks.rst
index d1c4559..4617a1b 100644
--- a/docs/howitworks.rst
+++ b/docs/howitworks.rst
@@ -6,1172 +6,11 @@
How Does Natsort Work?
======================
-.. contents::
- :local:
-
-:mod:`natsort` works by breaking strings into smaller sub-components (numbers
-or everything else), and returning these components in a tuple. Sorting
-tuples in Python is well-defined, and this fact is used to sort the input
-strings properly. But how does one break a string into sub-components?
-And what does one do to those components once they are split? Below I
-will explain the algorithm that was chosen for the :mod:`natsort` module,
-and some of the thinking that went into those design decisions. I will
-also mention some of the stumbling blocks I ran into because
-`getting sorting right is surprisingly hard`_.
-
-If you are impatient, you can skip to :ref:`tldr1` for the algorithm
-in the simplest case, and :ref:`tldr2`
-to see what extra code is needed to handle special cases.
-
-First, How Does Natural Sorting Work At a High Level?
------------------------------------------------------
-
-If I want to compare '2 ft 7 in' to '2 ft 11 in', I might do the following
-
-.. code-block:: pycon
-
- >>> '2 ft 7 in' < '2 ft 11 in'
- False
-
-We as humans know that the above should be true, but why does Python think it
-is false? Here is how it is performing the comparison:
-
-::
-
- '2' <=> '2' ==> equal, so keep going
- ' ' <=> ' ' ==> equal, so keep going
- 'f' <=> 'f' ==> equal, so keep going
- 't' <=> 't' ==> equal, so keep going
- ' ' <=> ' ' ==> equal, so keep going
- '7' <=> '1' ==> different, use result of '7' < '1'
-
-'7' evaluates as greater than '1' so the statement is false. When sorting, if
-a value is less than another it is placed first, so in our above example
-'2 ft 11 in' would end up before '2 ft 7 in', which is not correct. What to do?
-
-The best way to handle this is to break the string into sub-components
-of numbers and non-numbers, and then convert the numeric parts into
-:func:`float` or :func:`int` types. This will force Python to
-actually understand the context of what it is sorting and then "do the
-right thing." Luckily, it handles sorting lists of strings right
-out-of-the-box, so the only hard part is actually making this string-to-list
-transformation and then Python will handle the rest.
-
-::
-
- '2 ft 7 in' ==> (2, ' ft ', 7, ' in')
- '2 ft 11 in' ==> (2, ' ft ', 11, ' in')
-
-When Python compares the two, it roughly follows the below logic:
-
-::
-
- 2 <=> 2 ==> equal, so keep going
- ' ft ' <=> ' ft ' ==> a string is a special type of sequence - evaluate each character individually
- ||
- -->
- ' ' <=> ' ' ==> equal, so keep going
- 'f' <=> 'f' ==> equal, so keep going
- 't' <=> 't' ==> equal, so keep going
- ' ' <=> ' ' ==> equal, so keep going
- <== Back to parent sequence
- 7 <=> 11 ==> different, use the result of 7 < 11
-
-Clearly, seven is less than eleven, so our comparison is as we expect, and we
-would get the sorting order we wanted.
-
-At its heart, :mod:`natsort` is simply a tool to break strings into tuples,
-turning numbers in strings (i.e. ``'79'``) into *ints* and *floats* as it does this.
-
-Natsort's Approach
-------------------
-
-.. contents::
- :local:
-
-Decomposing Strings Into Sub-Components
-+++++++++++++++++++++++++++++++++++++++
-
-The first major hurtle to overcome is to decompose the string into
-sub-components. Remarkably, this turns out to be the easy part, owing mostly
-to Python's easy access to regular expressions. Breaking an arbitrary string
-based on a pattern is pretty straightforward.
-
-.. code-block:: pycon
-
- >>> import re
- >>> re.split(r'(\d+)', '2 ft 11 in')
- ['', '2', ' ft ', '11', ' in']
-
-Clear (assuming you can read regular expressions) and concise.
-
-The reason I began developing :mod:`natsort` in the first place was because I
-needed to handle the natural sorting of strings containing *real numbers*, not
-just unsigned integers as the above example contains. By real numbers, I mean
-those like ``-45.4920E-23``. :mod:`natsort` can handle just about any number
-definition; to that end, here are all the regular expressions used in
-:mod:`natsort`:
-
-.. code-block:: pycon
-
- >>> unsigned_int = r'([0-9]+)'
- >>> signed_int = r'([-+]?[0-9]+)'
- >>> unsigned_float = r'((?:[0-9]+\.?[0-9]*|\.[0-9]+)(?:[eE][-+]?[0-9]+)?)'
- >>> signed_float = r'([-+]?(?:[0-9]+\.?[0-9]*|\.[0-9]+)(?:[eE][-+]?[0-9]+)?)'
- >>> unsigned_float_no_exponent = r'((?:[0-9]+\.?[0-9]*|\.[0-9]+))'
- >>> signed_float_no_exponent = r'([-+]?(?:[0-9]+\.?[0-9]*|\.[0-9]+))'
-
-Note that ``"inf"`` and ``"nan"`` are deliberately omitted from the float
-definition because you wouldn't want (for example) ``"banana"`` to be converted
-into ``['ba', 'nan', 'a']``, Let's see an example:
-
-.. code-block:: pycon
-
- >>> re.split(signed_float, 'The mass of 3 electrons is 2.732815068E-30 kg')
- ['The mass of ', '3', ' electrons is ', '2.732815068E-30', ' kg']
-
-.. note::
-
- It is a bit of a lie to say the above are the complete regular expressions. In the
- actual code there is also handling for non-ASCII unicode characters (such as ⑦),
- but I will ignore that aspect of :mod:`natsort` in this discussion.
-
-Now, when the user wants to change the definition of a number, it is as easy as
-changing the pattern supplied to the regular expression engine.
-
-Choosing the right default is hard, though (well, in this case it shouldn't
-have been but I was rather thick-headed). In retrospect, it should have been
-obvious that since essentially all the code examples I had/have seen for
-natural sorting were for *unsigned integers*, I should have made the default
-definition of a number an *unsigned integer*. But, in the brash days of my
-youth I assumed that since my use case was real numbers, everyone else would
-be happier sorting by real numbers; so, I made the default definition of a
-number a *signed float with exponent*. `This astonished`_ `a lot`_ `of people`_
-(`and some people aren't very nice when they are astonished`_).
-Starting with :mod:`natsort` version 4.0.0 the default number definition was
-changed to an *unsigned integer* which satisfies the "least astonishment"
-principle, and I have not heard a complaint since.
-
-Coercing Strings Containing Numbers Into Numbers
-++++++++++++++++++++++++++++++++++++++++++++++++
-
-There has been some debate on Stack Overflow as to what method is best to
-coerce a string to a number if it can be coerced, and leaving it alone otherwise
-(see `this one for coercion`_ and `this one for checking`_ for some high traffic questions),
-but it mostly boils down to two different solutions, shown here:
-
-.. code-block:: pycon
-
- >>> def coerce_try_except(x):
- ... try:
- ... return int(x)
- ... except ValueError:
- ... return x
- ...
- >>> def coerce_regex(x):
- ... # Note that precompiling the regex is more performant,
- ... # but I do not show that here for clarity's sake.
- ... return int(x) if re.match(r'[-+]?\d+$', x) else x
- ...
-
-Here are some timing results run on my machine:
-
-.. code-block:: pycon
-
- In [0]: numbers = list(map(str, range(100))) # A list of numbers as strings
-
- In [1]: not_numbers = ['banana' + x for x in numbers]
-
- In [2]: %timeit [coerce_try_except(x) for x in numbers]
- 10000 loops, best of 3: 51.1 µs per loop
-
- In [3]: %timeit [coerce_try_except(x) for x in not_numbers]
- 1000 loops, best of 3: 289 µs per loop
-
- In [4]: %timeit [coerce_regex(x) for x in not_numbers]
- 10000 loops, best of 3: 67.6 µs per loop
-
- In [5]: %timeit [coerce_regex(x) for x in numbers]
- 10000 loops, best of 3: 123 µs per loop
-
-What can we learn from this? The ``try: except`` method (arguably the most
-"pythonic" of the solutions) is best for numeric input, but performs over 5X
-slower for non-numeric input. Conversely, the regular expression method, though
-slower than ``try: except`` for both input types, is more efficient for
-non-numeric input than for input that can be converted to an ``int``. Further,
-even though the regular expression method is slower for both input types, it is
-always at least twice as fast as the worst case for the ``try: except``.
-
-Why do I care? Shouldn't I just pick a method and not worry about it? Probably.
-However, I am very conscious about the performance of :mod:`natsort`, and want
-it to be a true drop-in replacement for :func:`sorted` without having to incur
-a performance penalty. For the purposes of :mod:`natsort`, there is no clear
-winner between the two algorithms - the data being passed to this function will
-likely be a mix of numeric and non-numeric string content. Do I use the
-``try: except`` method and hope the speed gains on numbers will offset the
-non-number performance, or do I use regular expressions and take the more
-stable performance?
-
-It turns out that within the context of :mod:`natsort`, some assumptions can be
-made that make a hybrid approach attractive. Because all strings are pre-split
-into numeric and non-numeric content *before* being passed to this coercion
-function, the assumption can be made that *if a string begins with a digit or a
-sign, it can be coerced into a number*.
-
-.. code-block:: pycon
-
- >>> def coerce_to_int(x):
- ... if x[0] in '0123456789+-':
- ... try:
- ... return int(x)
- ... except ValueError:
- ... return x
- ... else:
- ... return x
- ...
-
-So how does this perform compared to the standard coercion methods?
-
-.. code-block:: pycon
-
- In [6]: %timeit [coerce_to_int(x) for x in numbers]
- 10000 loops, best of 3: 71.6 µs per loop
-
- In [7]: %timeit [coerce_to_int(x) for x in not_numbers]
- 10000 loops, best of 3: 26.4 µs per loop
-
-The hybrid method eliminates most of the time wasted on numbers checking
-that it is in fact a number before passing to :func:`int`, and eliminates
-the time wasted in the exception stack for input that is not a number.
-
-That's as fast as we can get, right? In pure Python, probably. At least, it's
-close. But because I am crazy and a glutton for punishment, I decided to see
-if I could get any faster writing a C extension. It's called
-`fastnumbers`_ and contains a C implementation of the above coercion functions
-called :func:`fast_int`. How does it fair? Pretty well.
-
-.. code-block:: pycon
-
- In [8]: %timeit [fast_int(x) for x in numbers]
- 10000 loops, best of 3: 30.9 µs per loop
-
- In [9]: %timeit [fast_int(x) for x in not_numbers]
- 10000 loops, best of 3: 30 µs per loop
-
-During development of :mod:`natsort`, I wanted to ensure that using it did not
-get in the way of a user's program by introducing a performance penalty to
-their code. To that end, I do not feel like my adventures down the rabbit hole
-of optimization of coercion functions was a waste; I can confidently look users
-in the eye and say I considered every option in ensuring :mod:`natsort` is as
-efficient as possible. This is why if `fastnumbers`_ is installed it will be
-used for this step, and otherwise the hybrid method will be used.
-
-.. note::
-
- Modifying the hybrid coercion function for floats is straightforward.
-
- .. code-block:: pycon
-
- >>> def coerce_to_float(x):
- ... if x[0] in '.0123456789+-' or x.lower().lstrip()[:3] in ('nan', 'inf'):
- ... try:
- ... return float(x)
- ... except ValueError:
- ... return x
- ... else:
- ... return x
- ...
-
-.. _tldr1:
-
-TL;DR 1 - The Simple "No Special Cases" Algorithm
-+++++++++++++++++++++++++++++++++++++++++++++++++
-
-At this point, our :mod:`natsort` algorithm is essentially the following:
-
-.. code-block:: pycon
-
- >>> import re
- >>> def natsort_key(x, as_float=False, signed=False):
- ... if as_float:
- ... regex = signed_float if signed else unsigned_float
- ... else:
- ... regex = signed_int if signed else unsigned_int
- ... split_input = re.split(regex, x)
- ... split_input = filter(None, split_input) # removes null strings
- ... coerce = coerce_to_float if as_float else coerce_to_int
- ... return tuple(coerce(s) for s in split_input)
- ...
-
-I have written the above for clarity and not performance.
-This pretty much matches `most natural sort solutions for python on Stack Overflow`_
-(except the above includes customization of the definition of a number).
+This page has been moved to the
+`natsort wiki <https://github.com/SethMMorton/natsort/wiki/How-Does-Natsort-Work%3F>`_.
Special Cases Everywhere!
-------------------------
-.. contents::
- :local:
-
-.. image:: special_cases_everywhere.jpg
-
-If what I described in :ref:`TL;DR 1 <tldr1>` were
-all that :mod:`natsort` needed to
-do then there probably wouldn't be much need for a third-party module, right?
-Probably. But it turns out that in real-world data there are a lot of
-special cases that need to be handled, and in true `80%/20%`_ fashion, the
-majority of the code in :mod:`natsort` is devoted to handling special cases
-like those described below.
-
-Sorting Filesystem Paths
-++++++++++++++++++++++++
-
-`The first major special case I encountered was sorting filesystem paths`_
-(if you go to the link, you will see I didn't handle it well for a year...
-this was before I fully realized how much functionality I could really add
-to :mod:`natsort`). Let's apply the :func:`natsort_key` from above to some
-filesystem paths that you might see being auto-generated from your operating
-system:
-
-.. code-block:: pycon
-
- >>> paths = ['Folder (10)/file.tar.gz',
- ... 'Folder/file.tar.gz',
- ... 'Folder (1)/file (1).tar.gz',
- ... 'Folder (1)/file.tar.gz']
- >>> sorted(paths, key=natsort_key)
- ['Folder (1)/file (1).tar.gz', 'Folder (1)/file.tar.gz', 'Folder (10)/file.tar.gz', 'Folder/file.tar.gz']
-
-Well that's not right! What is ``'Folder/file.tar.gz'`` doing at the end?
-It has to do with the numerical ASCII code assigned to the space and
-``/`` characters in the `ASCII table`_. According to the `ASCII table`_, the
-space character (number 32) comes before the ``/`` character (number 47). If
-we remove the common prefix in all of the above strings (``'Folder'``), we
-can see why this happens:
-
-.. code-block:: pycon
-
- >>> ' (1)/file.tar.gz' < '/file.tar.gz'
- True
- >>> ' ' < '/'
- True
-
-This isn't very convenient... how do we solve it? We can split the path
-across the path separators and then sort. A convenient way do to this is
-with the :data:`Path.parts <pathlib.PurePath.parts>` property from
-:mod:`pathlib`:
-
-.. code-block:: pycon
-
- >>> import pathlib
- >>> sorted(paths, key=lambda x: tuple(natsort_key(s) for s in pathlib.Path(x).parts))
- ['Folder/file.tar.gz', 'Folder (1)/file (1).tar.gz', 'Folder (1)/file.tar.gz', 'Folder (10)/file.tar.gz']
-
-Almost! It seems like there is some funny business going on in the final
-filename component as well. We can solve that nicely and quickly with
-:data:`Path.suffixes <pathlib.PurePath.suffixes>` and :data:`Path.stem
-<pathlib.PurePath.stem>`.
-
-.. code-block:: pycon
-
- >>> def decompose_path_into_components(x):
- ... path_split = list(pathlib.Path(x).parts)
- ... # Remove the final filename component from the path.
- ... final_component = pathlib.Path(path_split.pop())
- ... # Split off all the extensions.
- ... suffixes = final_component.suffixes
- ... stem = final_component.name.replace(''.join(suffixes), '')
- ... # Remove the '.' prefix of each extension, and make that
- ... # final component a list of the stem and each suffix.
- ... final_component = [stem] + [x[1:] for x in suffixes]
- ... # Replace the split final filename component.
- ... path_split.extend(final_component)
- ... return path_split
- ...
- >>> def natsort_key_with_path_support(x):
- ... return tuple(natsort_key(s) for s in decompose_path_into_components(x))
- ...
- >>> sorted(paths, key=natsort_key_with_path_support)
- ['Folder/file.tar.gz', 'Folder (1)/file.tar.gz', 'Folder (1)/file (1).tar.gz', 'Folder (10)/file.tar.gz']
-
-This works because in addition to breaking the input by path separators,
-the final filename component is separated from its extensions as well.
-*Then*, each of these separated components is sent to the
-:mod:`natsort` algorithm, so the result is a tuple of tuples. Once that
-is done, we can see how comparisons can be done in the expected manner.
-
-.. code-block:: pycon
-
- >>> a = natsort_key_with_path_support('Folder (1)/file (1).tar.gz')
- >>> a
- (('Folder (', 1, ')'), ('file (', 1, ')'), ('tar',), ('gz',))
- >>>
- >>> b = natsort_key_with_path_support('Folder/file.tar.gz')
- >>> b
- (('Folder',), ('file',), ('tar',), ('gz',))
- >>>
- >>> a > b
- True
-
-.. note::
-
- The actual :meth:`decompose_path_into_components`-equivalent function in
- :mod:`natsort` actually has a few more heuristics than shown here so that
- it is not over-zealous in what it defines as a path suffix, but this has
- been omitted in this how-to for clarity.
-
-Comparing Different Types
-+++++++++++++++++++++++++
-
-`The second major special case I encountered was sorting of different types`_.
-On Python 2 (i.e. legacy Python), this mostly didnt't matter *too*
-much since it uses an arbitrary heuristic to allow traditionally un-comparable
-types to be compared (such as comparing ``'a'`` to ``1``). However, on Python 3
-(i.e. Python) it simply won't let you perform such nonsense, raising a
-:exc:`TypeError` instead.
-
-You can imagine that a module that breaks strings into tuples of numbers and
-strings is walking a dangerous line if it does not have special handling for
-comparing numbers and strings. My imagination was not so great at first.
-Let's take a look at all the ways this can fail with real-world data.
-
-.. code-block:: pycon
-
- >>> def natsort_key_with_poor_real_number_support(x):
- ... split_input = re.split(signed_float, x)
- ... split_input = filter(None, split_input) # removes null strings
- ... return tuple(coerce_to_float(s) for s in split_input)
- >>>
- >>> sorted([5, '4'], key=natsort_key_with_poor_real_number_support)
- Traceback (most recent call last):
- ...
- TypeError: ...
- >>>
- >>> sorted(['12 apples', 'apples'], key=natsort_key_with_poor_real_number_support)
- Traceback (most recent call last):
- ...
- TypeError: ...
- >>>
- >>> sorted(['version5.3.0', 'version5.3rc1'], key=natsort_key_with_poor_real_number_support)
- Traceback (most recent call last):
- ...
- TypeError: ...
-
-Let's break these down.
-
-#. The integer ``5`` is sent to ``re.split`` which expects only strings
- or bytes, which is a no-no.
-#. ``natsort_key_with_poor_real_number_support('12 apples') < natsort_key_with_poor_real_number_support('apples')``
- is the same as ``(12.0, ' apples') < ('apples',)``, and thus a number gets
- compared to a string [#f1]_ which also is a no-no.
-#. This one scores big on the astonishment scale, especially if one
- accidentally uses signed integers or real numbers when they mean
- to use unsigned integers.
- ``natsort_key_with_poor_real_number_support('version5.3.0') < natsort_key_with_poor_real_number_support('version5.3rc1')``
- is the same as ``('version', 5.3, 0.0) < ('version', 5.3, 'rc', 1.0)``,
- so in the third element a number gets compared to a string, once again
- the same old no-no. (The same would happen with ``'version5-3'`` and
- ``'version5-a'``, which would become ``('version', 5, -3)`` and
- ``('version', 5, '-a')``).
-
-As you might expect, the solution to the first issue is to wrap the
-``re.split`` call in a ``try: except:`` block and handle the number specially
-if a :exc:`TypeError` is raised. The second and third cases *could* be handled
-in a "special case" manner, meaning only respond and do something different
-if these problems are detected. But a less error-prone method is to ensure
-that the data is correct-by-construction, and this can be done by ensuring
-that the returned tuples *always* start with a string, and then alternate
-in a string-number-string-number-string pattern; this can be achieved by
-adding an empty string wherever the pattern is not followed [#f2]_. This ends
-up working out pretty nicely because empty strings are always "less" than
-any non-empty string, and we typically want numbers to come before strings.
-
-Let's take a look at how this works out.
-
-.. code-block:: pycon
-
- >>> from natsort.utils import sep_inserter
- >>> list(sep_inserter(iter(['apples']), ''))
- ['apples']
- >>>
- >>> list(sep_inserter(iter([12, ' apples']), ''))
- ['', 12, ' apples']
- >>>
- >>> list(sep_inserter(iter(['version', 5, -3]), ''))
- ['version', 5, '', -3]
- >>>
- >>> from natsort import natsort_keygen, ns
- >>> natsort_key_with_good_real_number_support = natsort_keygen(alg=ns.REAL)
- >>>
- >>> sorted([5, '4'], key=natsort_key_with_good_real_number_support)
- ['4', 5]
- >>>
- >>> sorted(['12 apples', 'apples'], key=natsort_key_with_good_real_number_support)
- ['12 apples', 'apples']
- >>>
- >>> sorted(['version5.3.0', 'version5.3rc1'], key=natsort_key_with_good_real_number_support)
- ['version5.3.0', 'version5.3rc1']
-
-How the "good" version works will be given in
-`TL;DR 2 - Handling Crappy, Real-World Input`_.
-
-Handling NaN
-++++++++++++
-
-`A rather unexpected special case I encountered was sorting collections containing NaN`_.
-Let's see what happens when you try to sort a plain old list of numbers when there
-is a **NaN** floating around in there.
-
-.. code-block:: pycon
-
- >>> danger = [7, float('nan'), 22.7, 19, -14, 59.123, 4]
- >>> sorted(danger)
- [7, nan, -14, 4, 19, 22.7, 59.123]
-
-Clearly that isn't correct, and for once it isn't my fault!
-`It's hard to compare floating point numbers`_. By definition, **NaN** is unorderable
-to any other number, and is never equal to any other number, including itself.
-
-.. code-block:: pycon
-
- >>> nan = float('nan')
- >>> 5 > nan
- False
- >>> 5 < nan
- False
- >>> 5 == nan
- False
- >>> 5 != nan
- True
- >>> nan == nan
- False
- >>> nan != nan
- True
-
-The implication of all this for us is that if there is an **NaN** in the
-data-set we are trying to sort, the data-set will end up being sorted in
-two separate yet individually sorted sequences - the one *before* the **NaN**,
-and the one *after*. This is because the ``<`` operation that is used
-to sort always returns :const:`False` with **NaN**.
-
-Because :mod:`natsort` aims to sort sequences in a way that does not surprise
-the user, keeping this behavior is not acceptable (I don't require my users
-to know how **NaN** will behave in a sorting algorithm). The simplest way to
-satisfy the "least astonishment" principle is to substitute **NaN** with
-some other value. But what value is *least* astonishing? I chose to replace
-**NaN** with :math:`-\infty` so that these poorly behaved elements always
-end up at the front where the users will most likely be alerted to their
-presence.
-
-.. code-block:: pycon
-
- >>> def fix_nan(x):
- ... if x != x: # only true for NaN
- ... return float('-inf')
- ... else:
- ... return x
- ...
-
-Let's check out :ref:`TL;DR 2 <tldr2>` to see how this can be
-incorporated into the simple key function from :ref:`TL;DR 1 <tldr1>`.
-
-.. _tldr2:
-
-TL;DR 2 - Handling Crappy, Real-World Input
-+++++++++++++++++++++++++++++++++++++++++++
-
-Let's see how our elegant key function from :ref:`TL;DR 1 <tldr1>` has
-become bastardized in order to support handling mixed real-world data
-and user customizations.
-
-.. code-block:: pycon
-
- >>> def natsort_key(x, as_float=False, signed=False, as_path=False):
- ... if as_float:
- ... regex = signed_float if signed else unsigned_float
- ... else:
- ... regex = signed_int if signed else unsigned_int
- ... try:
- ... if as_path:
- ... x = decompose_path_into_components(x) # Decomposes into list of strings
- ... # If this raises a TypeError, input is not a string.
- ... split_input = re.split(regex, x)
- ... except TypeError:
- ... try:
- ... # Does this need to be applied recursively (list-of-list)?
- ... return tuple(map(natsort_key, x))
- ... except TypeError:
- ... # Must be a number
- ... ret = ('', fix_nan(x)) # Maintain string-number-string pattern
- ... return (ret,) if as_path else ret # as_path returns tuple-of-tuples
- ... else:
- ... split_input = filter(None, split_input) # removes null strings
- ... # Note that the coerce_to_int/coerce_to_float functions
- ... # are also modified to use the fix_nan function.
- ... if as_float:
- ... coerced_input = (coerce_to_float(s) for s in split_input)
- ... else:
- ... coerced_input = (coerce_to_int(s) for s in split_input)
- ... return tuple(sep_inserter(coerced_input, ''))
- ...
-
-And this doesn't even show handling :class:`bytes` type! Notice that we have
-to do non-obvious things like modify the return form of numbers when ``as_path``
-is given, just to avoid comparing strings and numbers for the case in which a
-user provides input like ``['/home/me', 42]``.
-
-Let's take it out for a spin!
-
-.. code-block:: pycon
-
- >>> danger = [7, float('nan'), 22.7, '19', '-14', '59.123', 4]
- >>> sorted(danger, key=lambda x: natsort_key(x, as_float=True, signed=True))
- [nan, '-14', 4, 7, '19', 22.7, '59.123']
- >>>
- >>> paths = ['Folder (1)/file.tar.gz',
- ... 'Folder/file.tar.gz',
- ... 123456]
- >>> sorted(paths, key=lambda x: natsort_key(x, as_path=True))
- [123456, 'Folder/file.tar.gz', 'Folder (1)/file.tar.gz']
-
-Here Be Dragons: Adding Locale Support
---------------------------------------
-
-.. contents::
- :local:
-
-Probably the most challenging special case I had to handle was getting
-:mod:`natsort` to handle sorting the non-numerical parts of input
-correctly, and also allowing it to sort the numerical bits in different
-locales. This was in no way what I originally set out to do with this
-library, so I was
-`caught a bit off guard when the request was initially made`_.
-I discovered the :mod:`locale` library, and assumed that if it's part of
-Python's StdLib there can't be too many dragons, right?
-
-.. admonition:: INCOMPLETE LIST OF DRAGONS
-
- - https://github.com/SethMMorton/natsort/issues/21
- - https://github.com/SethMMorton/natsort/issues/22
- - https://github.com/SethMMorton/natsort/issues/23
- - https://github.com/SethMMorton/natsort/issues/36
- - https://github.com/SethMMorton/natsort/issues/44
- - https://bugs.python.org/issue2481
- - https://bugs.python.org/issue23195
- - https://stackoverflow.com/questions/3412933/python-not-sorting-unicode-properly-strcoll-doesnt-help
- - https://stackoverflow.com/questions/22203550/sort-dictionary-by-key-using-locale-collation
- - https://stackoverflow.com/questions/33459384/unicode-character-not-in-range-when-calling-locale-strxfrm
- - https://stackoverflow.com/questions/36431810/sort-numeric-lines-with-thousand-separators
- - https://stackoverflow.com/questions/45734562/how-can-i-get-a-reasonable-string-sorting-with-python
-
-These can be summed up as follows:
-
-#. :mod:`locale` is a thin wrapper over your operating system's *locale*
- library, so if *that* is broken (like it is on BSD and OSX) then
- :mod:`locale` is broken in Python.
-#. Because of a bug in legacy Python (i.e. Python 2), there was no uniform
- way to use the :mod:`locale` sorting functionality between legacy Python
- and Python (luckily this is no longer an issue now that Python 2 is EOL).
-#. People have differing opinions of how capitalization should affect word
- order.
-#. There is no built-in way to handle locale-dependent thousands separators
- and decimal points *robustly*.
-#. Proper handling of Unicode is complicated.
-#. Proper handling of :mod:`locale` is complicated.
-
-Easily over half of the code in :mod:`natsort` is in some way dealing with some
-aspect of :mod:`locale` or basic case handling. It would have been impossible
-to get right without a `really good`_ `testing strategy`_.
-
-Don't expect any more TL;DR's... if you want to see how all this is fully
-incorporated into the :mod:`natsort` algorithm then please take a look
-`at the code`_. However, I will hint at how specific steps are taken in
-each section.
-
-Let's see how we can handle some of the dragons, one-by-one.
-
-Basic Case Control Support
-++++++++++++++++++++++++++
-
-Without even thinking about the mess that is adding :mod:`locale` support,
-:mod:`natsort` can introduce support for controlling how case is interpreted.
-
-First, let's take a look at how it is sorted by default (due to
-where characters lie on the `ASCII table`_).
-
-.. code-block:: pycon
-
- >>> a = ['Apple', 'corn', 'Corn', 'Banana', 'apple', 'banana']
- >>> sorted(a)
- ['Apple', 'Banana', 'Corn', 'apple', 'banana', 'corn']
-
-All uppercase letters come before lowercase letters in the `ASCII table`_,
-so all capitalized words appear first. Not everyone agrees that this
-is the correct order. Some believe that the capitalized words should
-be last (``['apple', 'banana', 'corn', 'Apple', 'Banana', 'Corn']``).
-Some believe that both the lowercase and uppercase versions
-should appear together
-(``['Apple', 'apple', 'Banana', 'banana', 'Corn', 'corn']``).
-Some believe that both should be true ☹. Some people don't care at all [#f3]_.
-
-Solving the first case (I call it *LOWERCASEFIRST*) is actually pretty
-easy... just call the :meth:`str.swapcase` method on the input.
-
-.. code-block:: pycon
-
- >>> sorted(a, key=lambda x: x.swapcase())
- ['apple', 'banana', 'corn', 'Apple', 'Banana', 'Corn']
-
-The last (i call it *IGNORECASE*) is pretty easy.
-Simply call :meth:`str.casefold` on the input (it's like :meth:`std.lowercase`
-but does a better job on non-latin character sets).
-
-.. code-block:: pycon
-
- >>> sorted(a, key=lambda x: x.casefold())
- ['Apple', 'apple', 'Banana', 'banana', 'corn', 'Corn']
-
-The middle case (I call it *GROUPLETTERS*) is less straightforward.
-The most efficient way to handle this is to duplicate each character
-with its lowercase version and then the original character.
-
-.. code-block:: pycon
-
- >>> import itertools
- >>> def groupletters(x):
- ... return ''.join(itertools.chain.from_iterable((y.casefold(), y) for y in x))
- ...
- >>> groupletters('Apple')
- 'aAppppllee'
- >>> groupletters('apple')
- 'aappppllee'
- >>> sorted(a, key=groupletters)
- ['Apple', 'apple', 'Banana', 'banana', 'Corn', 'corn']
-
-The effect of this is that both ``'Apple'`` and ``'apple'`` are
-placed adjacent to each other because their transformations both begin
-with ``'a'``, and then the second character can be used to order them
-appropriately with respect to each other.
-
-There's a problem with this, though. Within the context of :mod:`natsort`
-we are trying to correctly sort numbers and those should be left alone.
-
-.. code-block:: pycon
-
- >>> a = ['Apple5', 'apple', 'Apple4E10', 'Banana']
- >>> sorted(a, key=lambda x: natsort_key(x, as_float=True))
- ['Apple5', 'Apple4E10', 'Banana', 'apple']
- >>> sorted(a, key=lambda x: natsort_key(groupletters(x), as_float=True))
- ['Apple4E10', 'Apple5', 'apple', 'Banana']
- >>> groupletters('Apple4E10')
- 'aAppppllee44eE1100'
-
-We messed up the numbers! Looks like :func:`groupletters` needs to be applied
-*after* the strings are broken into their components. I'm not going to show
-how this is done here, but basically it requires applying the function in
-the ``else:`` block of :func:`coerce_to_int`/:func:`coerce_to_float`.
-
-.. code-block:: pycon
-
- >>> better_groupletters = natsort_keygen(alg=ns.GROUPLETTERS | ns.REAL)
- >>> better_groupletters('Apple4E10')
- ('aAppppllee', 40000000000.0)
- >>> sorted(a, key=better_groupletters)
- ['Apple5', 'Apple4E10', 'apple', 'Banana']
-
-Of course, applying both *LOWERCASEFIRST* and *GROUPLETTERS* is just
-a matter of turning on both functions.
-
-Basic Unicode Support
-+++++++++++++++++++++
-
-Unicode is hard and complicated. Here's an example.
-
-.. code-block:: pycon
-
- >>> b = [b'\x66', b'\x65', b'\xc3\xa9', b'\x65\xcc\x81', b'\x61', b'\x7a']
- >>> a = [x.decode('utf8') for x in b]
- >>> a # doctest: +SKIP
- ['f', 'e', 'é', 'é', 'a', 'z']
- >>> sorted(a) # doctest: +SKIP
- ['a', 'e', 'é', 'f', 'z', 'é']
-
-There are more than one way to represent the character 'é' in Unicode.
-In fact, many characters have multiple representations. This is a challenge
-because comparing the two representations would return ``False`` even though
-they *look* the same.
-
-.. code-block:: pycon
-
- >>> a[2] == a[3]
- False
-
-Alas, since characters are compared based on the numerical value of their
-representation, sorting Unicode often gives unexpected results (like seeing
-'é' come both *before* and *after* 'z').
-
-The original approach that :mod:`natsort` took with respect to non-ASCII
-Unicode characters was to say "just use
-the :mod:`locale` or :mod:`PyICU` library" and then cross it's fingers
-and hope those libraries take care of it. As you will find in the following
-sections, that comes with its own baggage, and turned out to not always work
-anyway (see https://stackoverflow.com/q/45734562/1399279). A more robust
-approach is to handle the Unicode out-of-the-box without invoking a
-heavy-handed library like :mod:`locale` or :mod:`PyICU`.
-To do this, we must use *normalization*.
-
-To fully understand Unicode normalization,
-`check out some official Unicode documentation`_.
-Just kidding... that's too much text. The following StackOverflow answers do
-a good job at explaining Unicode normalization in simple terms:
-https://stackoverflow.com/a/7934397/1399279 and
-https://stackoverflow.com/a/7931547/1399279. Put simply, normalization
-ensures that Unicode characters with multiple representations are in
-some canonical and consistent representation so that (for example) comparisons
-of the characters can be performed in a sane way. The following discussion
-assumes you at least read the StackOverflow answers.
-
-Looking back at our 'é' example, we can see that the two versions were
-constructed with the byte strings ``b'\xc3\xa9'`` and ``b'\x65\xcc\x81'``.
-The former representation is actually
-`LATIN SMALL LETTER E WITH ACUTE <https://www.fileformat.info/info/unicode/char/e9/index.htm>`_
-and is a single character in the Unicode standard. This is known as the
-*compressed form* and corresponds to the 'NFC' normalization scheme.
-The latter representation is actually the letter 'e' followed by
-`COMBINING ACUTE ACCENT <https://www.fileformat.info/info/unicode/char/0301/index.htm>`_
-and so is two characters in the Unicode standard. This is known as the
-*decompressed form* and corresponds to the 'NFD' normalization scheme.
-Since the first character in the decompressed form is actually the letter 'e',
-when compared to other ASCII characters it fits where you might expect.
-Unfortunately, all Unicode compressed form characters come after the
-ASCII characters and so they always will be placed after 'z' when sorting.
-
-It seems that most Unicode data is stored and shared in the compressed form
-which makes it challenging to sort. This can be solved by normalizing all
-incoming Unicode data to the decompressed form ('NFD') and *then* sorting.
-
-.. code-block:: pycon
-
- >>> import unicodedata
- >>> c = [unicodedata.normalize('NFD', x) for x in a]
- >>> c # doctest: +SKIP
- ['f', 'e', 'é', 'é', 'a', 'z']
- >>> sorted(c) # doctest: +SKIP
- ['a', 'e', 'é', 'é', 'f', 'z']
-
-Huzzah! Sane sorting without having to resort to :mod:`locale`!
-
-Using Locale to Compare Strings
-+++++++++++++++++++++++++++++++
-
-The :mod:`locale` module is actually pretty cool, and provides lowly
-spare-time programmers like myself a way to handle the daunting task
-of proper locale-dependent support of their libraries and utilities.
-Having said that, it can be a bit of a bear to get right,
-`although they do point out in the documentation that it will be painful to use`_.
-Aside from the caveats spelled out in that link, it turns out that just
-comparing strings with :mod:`locale` in a cross-platform and
-cross-python-version manner is not as straightforward as one might hope.
-
-First, how to use :mod:`locale` to compare strings? It's actually
-pretty straightforward. Simply run the input through the :mod:`locale`
-transformation function :func:`locale.strxfrm`.
-
-.. code-block:: pycon
-
- >>> import locale, sys
- >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
- 'en_US.UTF-8'
- >>> a = ['a', 'b', 'ä']
- >>> sorted(a)
- ['a', 'b', 'ä']
- >>> # The below fails on OSX, so don't run doctest on darwin.
- >>> is_osx = sys.platform == 'darwin'
- >>> sorted(a, key=locale.strxfrm) if not is_osx else ['a', 'ä', 'b']
- ['a', 'ä', 'b']
- >>>
- >>> a = ['apple', 'Banana', 'banana', 'Apple']
- >>> sorted(a, key=locale.strxfrm) if not is_osx else ['apple', 'Apple', 'banana', 'Banana']
- ['apple', 'Apple', 'banana', 'Banana']
-
-It turns out that locale-aware sorting groups numbers in the same
-way as turning on *GROUPLETTERS* and *LOWERCASEFIRST*.
-The trick is that you have to apply :func:`locale.strxfrm` only to non-numeric
-characters; otherwise, numbers won't be parsed properly. Therefore, it must
-be applied as part of the :func:`coerce_to_int`/:func:`coerce_to_float`
-functions in a manner similar to :func:`groupletters`.
-
-Unicode Support With Local
-++++++++++++++++++++++++++
-
-Remember how in the `Basic Unicode Support`_ section I mentioned that we
-use the "decompressed" Unicode normalization form (e.g. NFD) on all inputs
-to ensure the order is as expected?
-
-If you have been following along so far, you probably expect that it is not
-that easy. You would be correct.
-
-It turns out that some locales (but not all) expect the input to be in
-"compressed form" (e.g. NFC) or the ordering is not as you might expect.
-`Check out this issue for a real-world example`_. Here's a relevant
-snippet of code
-
-.. code-block:: pycon
-
- In [1]: import locale, unicodedata
-
- In [2]: a = ['Aš', 'Cheb', 'Česko', 'Cibulov', 'Znojmo', 'Žilina']
-
- In [3]: locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
- Out[3]: 'en_US.UTF-8'
-
- In [4]: sorted(a, key=locale.strxfrm)
- Out[4]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
-
- In [5]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFD", x)))
- Out[5]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
-
- In [6]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFC", x)))
- Out[6]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
-
- In [7]: locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
- Out[7]: 'de_DE.UTF-8'
-
- In [8]: sorted(a, key=locale.strxfrm)
- Out[8]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
-
- In [9]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFD", x)))
- Out[9]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
-
- In [10]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFC", x)))
- Out[10]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
-
- In [11]: locale.setlocale(locale.LC_ALL, 'cs_CZ.UTF-8')
- Out[11]: 'cs_CZ.UTF-8'
-
- In [12]: sorted(a, key=locale.strxfrm)
- Out[12]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']
-
- In [13]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFD", x)))
- Out[13]: ['Aš', 'Česko', 'Cibulov', 'Cheb', 'Žilina', 'Znojmo']
-
- In [14]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFC", x)))
- Out[14]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']
-
-Two out of three locales sort the same data in the same order no matter how the unicode
-input was normalized, but Czech seems to care how the input is formatted!
-
-So, everthing mentioned in `Basic Unicode Support`_ is conditional on whether
-or not the user wants to use the :mod:`locale` library or not. If not, then
-"NFD" normalization is used. If they do, "NFC" normalization is used.
-
-Handling Broken Locale On OSX
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-But what if the underlying *locale* implementation that :mod:`locale`
-relies upon is simply broken? It turns out that the *locale* library on
-OSX (and other BSD systems) is broken (and for some reason has never been
-fixed?), and so :mod:`locale` does not work as expected.
-
-How do I define doesn't work as expected?
-
-.. code-block:: pycon
-
- >>> a = ['apple', 'Banana', 'banana', 'Apple']
- >>> sorted(a)
- ['Apple', 'Banana', 'apple', 'banana']
- >>>
- >>> sorted(a, key=locale.strxfrm) if is_osx else sorted(a)
- ['Apple', 'Banana', 'apple', 'banana']
-
-IT'S SORTING AS IF :func:`locale.stfxfrm` WAS NEVER USED!! (and it's worse
-once non-ASCII characters get thrown into the mix.) I'm really not
-sure why this is considered OK for the OSX/BSD maintainers to not fix,
-but it's more than frustrating for poor developers who have been dragged
-into the *locale* game kicking and screaming. *<deep breath>*.
-
-So, how to deal with this situation? There are two ways to do so.
-
-#. Detect if :mod:`locale` is sorting incorrectly (i.e. ``dumb``) by seeing
- if ``'A'`` is sorted before ``'a'`` (incorrect) or not.
-
- .. code-block:: pycon
-
- >>> # This is genuinely the name of this function.
- >>> # See natsort.compat.locale.py
- >>> def dumb_sort():
- ... return locale.strxfrm('A') < locale.strxfrm('a')
- ...
-
- If a ``dumb`` *locale* implementation is found, then automatically
- turn on *LOWERCASEFIRST* and *GROUPLETTERS*.
-#. Use an alternate library if installed. `ICU <http://site.icu-project.org/>`_
- is a great and powerful library that has a pretty decent Python port
- called (you guessed it) `PyICU <https://pypi.org/project/PyICU/>`_.
- If a user has this library installed on their computer, :mod:`natsort`
- chooses to use that instead of :mod:`locale`. With a little bit of
- planning, one can write a set of wrapper functions that call
- the correct library under the hood such that the business logic never
- has to know what library is being used (see `natsort.compat.locale.py`_).
-
-Let me tell you, this little complication really makes a challenge of testing
-the code, since one must set up different environments on different operating
-systems in order to test all possible code paths. Not to mention that
-certain checks *will* fail for certain operating systems and environments
-so one must be diligent in either writing the tests not to fail, or ignoring
-those tests when on offending environments.
-
-Handling Locale-Aware Numbers
-+++++++++++++++++++++++++++++
-
-`Thousands separator support`_ is a problem that I knew would someday be
-requested but had decided to push off until a rainy day. One day it finally
-rained, and I decided to tackle the problem.
-
-So what is the problem? Consider the number ``1,234,567`` (assuming the
-``','`` is the thousands separator). Try to run that through :func:`int`
-and you will get a :exc:`ValueError`. To handle this properly the thousands
-separators must be removed.
-
-.. code-block:: pycon
-
- >>> float('1,234,567'.replace(',', ''))
- 1234567.0
-
-What if, in our current locale, the thousands separator is ``'.'`` and
-the ``','`` is the decimal separator (like for the German locale *de_DE*)?
-
-.. code-block:: pycon
-
- >>> float('1.234.567'.replace('.', '').replace(',', '.'))
- 1234567.0
- >>> float('1.234.567,89'.replace('.', '').replace(',', '.'))
- 1234567.89
-
-This is pretty much what :func:`locale.atoi` and :func:`locale.atof` do
-under the hood. So what's the problem? Why doesn't :mod:`natsort` just
-use this method under its hood?
-Well, let's take a look at what would happen if we send some possible
-:mod:`natsort` input through our the above function:
-
-.. code-block:: pycon
-
- >>> natsort_key('1,234 apples, please.'.replace(',', ''))
- ('', 1234, ' apples please.')
- >>> natsort_key('Sir, €1.234,50 please.'.replace('.', '').replace(',', '.'), as_float=True)
- ('Sir. €', 1234.5, ' please')
-
-Any character matching the thousands separator was dropped, and anything
-matching the decimal separator was changed to ``'.'``! If these characters
-were critical to how your data was ordered, this would break :mod:`natsort`.
-
-The first solution one might consider would be to first decompose the
-input into sub-components (like we did for the *GROUPLETTERS* method
-above) and then only apply these transformations on the number components.
-This is a chicken-and-egg problem, though, because *we cannot appropriately
-separate out the numbers because of the thousands separators and
-non-'.' decimal separators* (well, at least not without making multiple
-passes over the data which I do not consider to be a valid option).
-
-Regular expressions to the rescue! With regular expressions, we can
-remove the thousands separators and change the decimal separator only
-when they are actually within a number. Once the input has been
-pre-processed with this regular expression, all the infrastructure
-shown previously will work.
-
-Beware, these regular expressions will make your eyes bleed.
-
-.. code-block:: pycon
-
- >>> decimal = ',' # Assume German locale, so decimal separator is ','
- >>> # Look-behind assertions cannot accept range modifiers, so instead of i.e.
- >>> # (?<!\.[0-9]{1,3}) I have to repeat the look-behind for 1, 2, and 3.
- >>> nodecimal = r'(?<!{dec}[0-9])(?<!{dec}[0-9]{{2}})(?<!{dec}[0-9]{{3}})'.format(dec=decimal)
- >>> strip_thousands = r'''
- ... (?<=[0-9]{{1}}) # At least 1 number
- ... (?<![0-9]{{4}}) # No more than 3 numbers
- ... {nodecimal} # Cannot follow decimal
- ... {thou} # The thousands separator
- ... (?=[0-9]{{3}} # Three numbers must follow
- ... ([^0-9]|$) # But a non-number after that
- ... )
- ... '''.format(nodecimal=nodecimal, thou=re.escape('.')) # Thousands separator is '.' in German locale.
- ...
- >>> re.sub(strip_thousands, '', 'Sir, €1.234,50 please.', flags=re.X)
- 'Sir, €1234,50 please.'
- >>>
- >>> # The decimal point must be preceded by a number or after
- >>> # a number. This option only needs to be performed in the
- >>> # case when the decimal separator for the locale is not '.'.
- >>> switch_decimal = r'(?<=[0-9]){decimal}|{decimal}(?=[0-9])'
- >>> switch_decimal = switch_decimal.format(decimal=decimal)
- >>> re.sub(switch_decimal, '.', 'Sir, €1234,50 please.', flags=re.X)
- 'Sir, €1234.50 please.'
- >>>
- >>> natsort_key('Sir, €1234.50 please.', as_float=True)
- ('Sir, €', 1234.5, ' please.')
-
-Final Thoughts
---------------
-
-My hope is that users of :mod:`natsort` never have to think about or worry
-about all the bookkeeping or any of the details described above, and that using
-:mod:`natsort` seems to magically "just work". For those of you who
-took the time to read this engineering description, I hope it has enlightened
-you to some of the issues that can be encountered when code is released
-into the wild and has to accept "real-world data", or to what happens
-to developers who naïvely make bold assumptions that are counter to
-what the rest of the world assumes.
-
-.. rubric:: Footnotes
-
-.. [#f1]
- *"But if you hadn't removed the leading empty string from re.split this
- wouldn't have happened!!"* I can hear you saying. Well, that's true. I don't
- have a *great* reason for having done that except that in an earlier
- non-optimal incarnation of the algorithm I needed to it, and it kind of
- stuck, and it made other parts of the code easier if the assumption that
- there were no empty strings was valid.
-.. [#f2]
- I'm not going to show how this is implemented in this document,
- but if you are interested you can look at the code to
- :func:`sep_inserter` in `util.py`_.
-.. [#f3]
- Handling each of these is straightforward, but coupled with the rapidly
- fracturing execution paths presented in :ref:`TL;DR 2 <tldr2>` one can
- imagine this will get out of hand quickly. If you take a look at
- `natsort.py`_ and `util.py`_ you can observe that to avoid this I take
- a more functional approach to construting the :mod:`natsort` algorithm
- as opposed to the procedural approach illustrated in
- :ref:`TL;DR 1 <tldr1>` and :ref:`TL;DR 2 <tldr2>`.
-
-.. _ASCII table: https://www.asciitable.com/
-.. _getting sorting right is surprisingly hard: http://www.compciv.org/guides/python/fundamentals/sorting-collections-with-sorted/
-.. _This astonished: https://github.com/SethMMorton/natsort/issues/19
-.. _a lot: https://stackoverflow.com/questions/29548742/python-natsort-sort-strings-recursively
-.. _of people: https://stackoverflow.com/questions/24045348/sort-set-of-numbers-in-the-form-xx-yy-in-python
-.. _and some people aren't very nice when they are astonished:
- https://github.com/xolox/python-naturalsort/blob/ed3e6b6ffaca3bdea3b76e08acbb8bd2a5fee463/README.rst#why-another-natsort-module
-.. _fastnumbers: https://github.com/SethMMorton/fastnumbers
-.. _as part of my testing: https://github.com/SethMMorton/natsort/blob/master/test_natsort/slow_splitters.py
-.. _this one for coercion: https://stackoverflow.com/questions/736043/checking-if-a-string-can-be-converted-to-float-in-python
-.. _this one for checking: https://stackoverflow.com/questions/354038/how-do-i-check-if-a-string-is-a-number-float
-.. _most natural sort solutions for python on Stack Overflow: https://stackoverflow.com/q/4836710/1399279
-.. _80%/20%: https://en.wikipedia.org/wiki/Pareto_principle
-.. _The first major special case I encountered was sorting filesystem paths: https://github.com/SethMMorton/natsort/issues/3
-.. _The second major special case I encountered was sorting of different types: https://github.com/SethMMorton/natsort/issues/7
-.. _A rather unexpected special case I encountered was sorting collections containing NaN:
- https://github.com/SethMMorton/natsort/issues/27
-.. _It's hard to compare floating point numbers: http://www.drdobbs.com/cpp/its-hard-to-compare-floating-point-numbe/240149806
-.. _caught a bit off guard when the request was initially made: https://github.com/SethMMorton/natsort/issues/14
-.. _at the code: https://github.com/SethMMorton/natsort/tree/master/natsort
-.. _natsort.py: https://github.com/SethMMorton/natsort/blob/master/natsort/natsort.py
-.. _util.py: https://github.com/SethMMorton/natsort/blob/master/natsort/util.py
-.. _although they do point out in the documentation that it will be painful to use:
- https://docs.python.org/3/library/locale.html#background-details-hints-tips-and-caveats
-.. _natsort.compat.locale.py: https://github.com/SethMMorton/natsort/blob/master/natsort/compat/locale.py
-.. _Thousands separator support: https://github.com/SethMMorton/natsort/issues/36
-.. _really good: https://hypothesis.readthedocs.io/en/latest/
-.. _testing strategy: https://docs.pytest.org/en/latest/
-.. _check out some official Unicode documentation: https://unicode.org/reports/tr15/
-.. _Check out this issue for a real-world example: https://github.com/SethMMorton/natsort/issues/140 \ No newline at end of file
+This page has been moved to the
+`natsort wiki <https://github.com/SethMMorton/natsort/wiki/How-Does-Natsort-Work%3F#special-cases-everywhere>`_.
diff --git a/docs/locale_issues.rst b/docs/locale_issues.rst
index 56cd5a9..3539904 100644
--- a/docs/locale_issues.rst
+++ b/docs/locale_issues.rst
@@ -6,92 +6,5 @@
Possible Issues with :func:`~natsort.humansorted` or ``ns.LOCALE``
==================================================================
-Being Locale-Aware Means Both Numbers and Non-Numbers
------------------------------------------------------
-
-In addition to modifying how characters are sorted, ``ns.LOCALE`` will take
-into account locale-dependent thousands separators (and locale-dependent
-decimal separators if ``ns.FLOAT`` is enabled). This means that if you are in a
-locale that uses commas as the thousands separator, a number like
-``123,456`` will be interpreted as ``123456``. If this is not what you want,
-you may consider using ``ns.LOCALEALPHA`` which will only enable locale-aware
-sorting for non-numbers (similarly, ``ns.LOCALENUM`` enables locale-aware
-sorting only for numbers).
-
-Regenerate Key With :func:`~natsort.natsort_keygen` After Changing Locale
--------------------------------------------------------------------------
-
-When :func:`~natsort.natsort_keygen` is called it returns a key function that
-hard-codes the provided settings. This means that the key returned when
-``ns.LOCALE`` is used contains the settings specified by the locale
-*loaded at the time the key is generated*. If you change the locale,
-you should regenerate the key to account for the new locale.
-
-Corollary: Do Not Reuse :func:`~natsort.natsort_keygen` After Changing Locale
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-If you change locale, the old function will not work as expected.
-The :mod:`locale` library works with a global state. When
-:func:`~natsort.natsort_keygen` is called it does the best job that it can to
-make the returned function as static as possible and independent of the global
-state, but the :func:`locale.strxfrm` function must access this global state to
-work; therefore, if you change locale and use ``ns.LOCALE`` then you should
-discard the old key.
-
-.. note:: If you use `PyICU`_ then you may be able to reuse keys after changing
- locale.
-
-The :mod:`locale` Module From the StdLib Has Issues
----------------------------------------------------
-
-:mod:`natsort` will use `PyICU`_ for :func:`~natsort.humansorted` or
-``ns.LOCALE`` if it is installed. If not, it will fall back on the
-:mod:`locale` library from the Python stdlib. If you do not have `PyICU`_
-installed, please keep the following known problems and issues in mind.
-
-.. note:: Remember, if you have `PyICU`_ installed you shouldn't need to worry
- about any of these.
-
-Explicitly Set the Locale Before Using ``ns.LOCALE``
-++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-I have found that unless you explicitly set a locale, the sorted order may not
-be what you expect. Setting this is straightforward
-(in the below example I use 'en_US.UTF-8', but you should use your
-locale):
-
-.. code-block:: pycon
-
- >>> import locale
- >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
- 'en_US.UTF-8'
-
-.. _bug_note:
-
-The :mod:`locale` Module Is Broken on Mac OS X
-++++++++++++++++++++++++++++++++++++++++++++++
-
-It's not Python's fault, but the OS... the locale library for BSD-based systems
-(of which Mac OS X is one) is broken. See the following links:
-
- - https://stackoverflow.com/questions/3412933/python-not-sorting-unicode-properly-strcoll-doesnt-help
- - https://bugs.python.org/issue23195
- - https://github.com/SethMMorton/natsort/issues/21 (contains instructons on installing)
- - https://stackoverflow.com/questions/33459384/unicode-character-not-in-range-when-calling-locale-strxfrm
- - https://github.com/SethMMorton/natsort/issues/34
-
-Of course, installing `PyICU`_ fixes this, but if you don't want to or cannot
-install this there is some hope.
-
- 1. As of ``natsort`` version 4.0.0, ``natsort`` is configured
- to compensate for a broken ``locale`` library. When sorting non-numbers
- it will handle case as you expect, but it will still not be able to
- comprehend non-ASCII characters properly. Additionally, it has
- a built-in lookup table of thousands separators that are incorrect
- on OS X/BSD (but is possible it is not complete... please file an
- issue if you see it is not complete)
- 2. Use "\*.ISO8859-1" locale (i.e. 'en_US.ISO8859-1') rather than
- "\*.UTF-8" locale. I have found that these have fewer issues than
- "UTF-8", but your mileage may vary.
-
-.. _PyICU: https://pypi.org/project/PyICU
+This page has been moved to the
+`natsort wiki <https://github.com/SethMMorton/natsort/wiki/Possible-Issues-with-natsort.humansorted-or-ns.LOCALE>`_.
diff --git a/docs/shell.rst b/docs/shell.rst
index 0d7d3c9..bf40874 100644
--- a/docs/shell.rst
+++ b/docs/shell.rst
@@ -6,153 +6,5 @@
Shell Script
============
-The ``natsort`` shell script is automatically installed when you install
-:mod:`natsort` with pip.
-
-Below is the usage and some usage examples for the ``natsort`` shell script.
-
-Usage
------
-
-.. code-block::
-
- usage: natsort [-h] [--version] [-p] [-f LOW HIGH] [-F LOW HIGH] [-e EXCLUDE]
- [-r] [-t {digit,int,float,version,ver}] [--nosign] [--noexp]
- [--locale]
- [entries [entries ...]]
-
- Performs a natural sort on entries given on the command-line.
- A natural sort sorts numerically then alphabetically, and will sort
- by numbers in the middle of an entry.
-
- positional arguments:
- entries The entries to sort. Taken from stdin if nothing is
- given on the command line.
-
- optional arguments:
- -h, --help show this help message and exit
- --version show program's version number and exit
- -p, --paths Interpret the input as file paths. This is not
- strictly necessary to sort all file paths, but in
- cases where there are OS-generated file paths like
- "Folder/" and "Folder (1)/", this option is needed to
- make the paths sorted in the order you expect
- ("Folder/" before "Folder (1)/").
- -f LOW HIGH, --filter LOW HIGH
- Used for keeping only the entries that have a number
- falling in the given range.
- -F LOW HIGH, --reverse-filter LOW HIGH
- Used for excluding the entries that have a number
- falling in the given range.
- -e EXCLUDE, --exclude EXCLUDE
- Used to exclude an entry that contains a specific
- number.
- -r, --reverse Returns in reversed order.
- -t {digit,int,float,version,ver,real,f,i,r,d},
- --number-type {digit,int,float,version,ver,real,f,i,r,d},
- --number_type {digit,int,float,version,ver,real,f,i,r,d}
- Choose the type of number to search for. "float" will
- search for floating-point numbers. "int" will only
- search for integers. "digit", "version", and "ver" are
- synonyms for "int"."real" is a shortcut for "float"
- with --sign. "i" and "d" are synonyms for "int", "f"
- is a synonym for "float", and "r" is a synonym for
- "real".The default is int.
- --nosign Do not consider "+" or "-" as part of a number, i.e.
- do not take sign into consideration. This is the
- default.
- -s, --sign Consider "+" or "-" as part of a number, i.e. take
- sign into consideration. The default is unsigned.
- --noexp Do not consider an exponential as part of a number,
- i.e. 1e4, would be considered as 1, "e", and 4, not as
- 10000. This only effects the --number-type=float.
- -l, --locale Causes natsort to use locale-aware sorting. You will
- get the best results if you install PyICU.
-
-Description
------------
-
-``natsort`` was originally written to aid in computational chemistry
-research so that it would be easy to analyze large sets of output files
-named after the parameter used:
-
-.. code-block:: console
-
- $ ls *.out
- mode1000.35.out mode1243.34.out mode744.43.out mode943.54.out
-
-(Obviously, in reality there would be more files, but you get the idea.) Notice
-that the shell sorts in lexicographical order. This is the behavior of programs like
-``find`` as well as ``ls``. The problem is passing these files to an
-analysis program causes them not to appear in numerical order, which can lead
-to bad analysis. To remedy this, use ``natsort``:
-
-.. code-block:: console
-
- $ natsort *.out
- mode744.43.out
- mode943.54.out
- mode1000.35.out
- mode1243.34.out
- $ natsort -t r *.out | xargs your_program
-
-``-t r`` is short for ``--number-type real``. You can also place natsort in
-the middle of a pipe:
-
-.. code-block:: console
-
- $ find . -name "*.out" | natsort -t r | xargs your_program
-
-To sort version numbers, use the default ``--number-type``:
-
-.. code-block:: console
-
- $ ls *
- prog-1.10.zip prog-1.9.zip prog-2.0.zip
- $ natsort *
- prog-1.9.zip
- prog-1.10.zip
- prog-2.0.zip
-
-In general, all ``natsort`` shell script options mirror the :func:`~natsorted`
-API, with notable exception of the ``--filter``, ``--reverse-filter``, and ``--exclude``
-options. These three options are used as follows:
-
-.. code-block:: console
-
- $ ls *.out
- mode1000.35.out mode1243.34.out mode744.43.out mode943.54.out
- $ natsort -t r *.out -f 900 1100 # Select only numbers between 900-1100
- mode943.54.out
- mode1000.35.out
- $ natsort -t r *.out -F 900 1100 # Select only numbers NOT between 900-1100
- mode744.43.out
- mode1243.34.out
- $ natsort -t r *.out -e 1000.35 # Exclude 1000.35 from search
- mode744.43.out
- mode943.54.out
- mode1243.34.out
-
-If you are sorting paths with OS-generated filenames, you may require the
-``--paths``/``-p`` option:
-
-.. code-block:: console
-
- $ find . ! -path . -type f
- ./folder/file (1).txt
- ./folder/file.txt
- ./folder (1)/file.txt
- ./folder (10)/file.txt
- ./folder (2)/file.txt
- $ find . ! -path . -type f | natsort
- ./folder (1)/file.txt
- ./folder (2)/file.txt
- ./folder (10)/file.txt
- ./folder/file (1).txt
- ./folder/file.txt
- $ find . ! -path . -type f | natsort -p
- ./folder/file.txt
- ./folder/file (1).txt
- ./folder (1)/file.txt
- ./folder (2)/file.txt
- ./folder (10)/file.txt
+This page has been moved to the
+`natsort wiki <https://github.com/SethMMorton/natsort/wiki/Shell-Script>`_.