summaryrefslogtreecommitdiff
path: root/docs/howitworks.rst
blob: d1c4559866313ea80388441fa1c92a3efc594091 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
.. default-domain:: py
.. currentmodule:: natsort

.. _howitworks:

How Does Natsort Work?
======================

.. contents::
    :local:

:mod:`natsort` works by breaking strings into smaller sub-components (numbers
or everything else), and returning these components in a tuple. Sorting
tuples in Python is well-defined, and this fact is used to sort the input
strings properly. But how does one break a string into sub-components?
And what does one do to those components once they are split? Below I
will explain the algorithm that was chosen for the :mod:`natsort` module,
and some of the thinking that went into those design decisions. I will
also mention some of the stumbling blocks I ran into because
`getting sorting right is surprisingly hard`_.

If you are impatient, you can skip to :ref:`tldr1` for the algorithm
in the simplest case, and :ref:`tldr2`
to see what extra code is needed to handle special cases.

First, How Does Natural Sorting Work At a High Level?
-----------------------------------------------------

If I want to compare '2 ft 7 in' to '2 ft 11 in', I might do the following

.. code-block:: pycon

    >>> '2 ft 7 in' < '2 ft 11 in'
    False

We as humans know that the above should be true, but why does Python think it
is false?  Here is how it is performing the comparison:

::

    '2' <=> '2' ==> equal, so keep going
    ' ' <=> ' ' ==> equal, so keep going
    'f' <=> 'f' ==> equal, so keep going
    't' <=> 't' ==> equal, so keep going
    ' ' <=> ' ' ==> equal, so keep going
    '7' <=> '1' ==> different, use result of '7' < '1'

'7' evaluates as greater than '1' so the statement is false. When sorting, if
a value is less than another it is placed first, so in our above example
'2 ft 11 in' would end up before '2 ft 7 in', which is not correct. What to do?

The best way to handle this is to break the string into sub-components
of numbers and non-numbers, and then convert the numeric parts into
:func:`float` or :func:`int` types. This will force Python to
actually understand the context of what it is sorting and then "do the
right thing." Luckily, it handles sorting lists of strings right
out-of-the-box, so the only hard part is actually making this string-to-list
transformation and then Python will handle the rest.

::

    '2 ft 7 in'  ==> (2, ' ft ', 7,  ' in')
    '2 ft 11 in' ==> (2, ' ft ', 11, ' in')

When Python compares the two, it roughly follows the below logic:

::

    2       <=> 2      ==> equal, so keep going
    ' ft '  <=> ' ft ' ==> a string is a special type of sequence - evaluate each character individually
                       ||
                       -->
                          ' ' <=> ' ' ==> equal, so keep going
                          'f' <=> 'f' ==> equal, so keep going
                          't' <=> 't' ==> equal, so keep going
                          ' ' <=> ' ' ==> equal, so keep going
                      <== Back to parent sequence
    7 <=> 11 ==> different, use the result of 7 < 11

Clearly, seven is less than eleven, so our comparison is as we expect, and we
would get the sorting order we wanted.

At its heart, :mod:`natsort` is simply a tool to break strings into tuples,
turning numbers in strings (i.e. ``'79'``) into *ints* and *floats* as it does this.

Natsort's Approach
------------------

.. contents::
    :local:

Decomposing Strings Into Sub-Components
+++++++++++++++++++++++++++++++++++++++

The first major hurtle to overcome is to decompose the string into
sub-components. Remarkably, this turns out to be the easy part, owing mostly
to Python's easy access to regular expressions. Breaking an arbitrary string
based on a pattern is pretty straightforward.

.. code-block:: pycon

    >>> import re
    >>> re.split(r'(\d+)', '2 ft 11 in')
    ['', '2', ' ft ', '11', ' in']

Clear (assuming you can read regular expressions) and concise.

The reason I began developing :mod:`natsort` in the first place was because I
needed to handle the natural sorting of strings containing *real numbers*, not
just unsigned integers as the above example contains. By real numbers, I mean
those like ``-45.4920E-23``. :mod:`natsort` can handle just about any number
definition; to that end, here are all the regular expressions used in
:mod:`natsort`:

.. code-block:: pycon

    >>> unsigned_int               = r'([0-9]+)'
    >>> signed_int                 = r'([-+]?[0-9]+)'
    >>> unsigned_float             = r'((?:[0-9]+\.?[0-9]*|\.[0-9]+)(?:[eE][-+]?[0-9]+)?)'
    >>> signed_float               = r'([-+]?(?:[0-9]+\.?[0-9]*|\.[0-9]+)(?:[eE][-+]?[0-9]+)?)'
    >>> unsigned_float_no_exponent = r'((?:[0-9]+\.?[0-9]*|\.[0-9]+))'
    >>> signed_float_no_exponent   = r'([-+]?(?:[0-9]+\.?[0-9]*|\.[0-9]+))'

Note that ``"inf"`` and ``"nan"`` are deliberately omitted from the float
definition because you wouldn't want (for example) ``"banana"`` to be converted
into ``['ba', 'nan', 'a']``, Let's see an example:

.. code-block:: pycon

    >>> re.split(signed_float, 'The mass of 3 electrons is 2.732815068E-30 kg')
    ['The mass of ', '3', ' electrons is ', '2.732815068E-30', ' kg']

.. note::

    It is a bit of a lie to say the above are the complete regular expressions. In the
    actual code there is also handling for non-ASCII unicode characters (such as ⑦),
    but I will ignore that aspect of :mod:`natsort` in this discussion.

Now, when the user wants to change the definition of a number, it is as easy as
changing the pattern supplied to the regular expression engine.

Choosing the right default is hard, though (well, in this case it shouldn't
have been but I was rather thick-headed). In retrospect, it should have been
obvious that since essentially all the code examples I had/have seen for
natural sorting were for *unsigned integers*, I should have made the default
definition of a number an *unsigned integer*. But, in the brash days of my
youth I assumed that since my use case was real numbers, everyone else would
be happier sorting by real numbers; so, I made the default definition of a
number a *signed float with exponent*. `This astonished`_ `a lot`_ `of people`_
(`and some people aren't very nice when they are astonished`_).
Starting with :mod:`natsort` version 4.0.0 the default number definition was
changed to an *unsigned integer* which satisfies the "least astonishment"
principle, and I have not heard a complaint since.

Coercing Strings Containing Numbers Into Numbers
++++++++++++++++++++++++++++++++++++++++++++++++

There has been some debate on Stack Overflow as to what method is best to
coerce a string to a number if it can be coerced, and leaving it alone otherwise
(see `this one for coercion`_ and `this one for checking`_ for some high traffic questions),
but it mostly boils down to two different solutions, shown here:

.. code-block:: pycon

    >>> def coerce_try_except(x):
    ...     try:
    ...         return int(x)
    ...     except ValueError:
    ...         return x
    ...
    >>> def coerce_regex(x):
    ...     # Note that precompiling the regex is more performant,
    ...     # but I do not show that here for clarity's sake.
    ...     return int(x) if re.match(r'[-+]?\d+$', x) else x
    ...

Here are some timing results run on my machine:

.. code-block:: pycon

    In [0]: numbers = list(map(str, range(100)))  # A list of numbers as strings

    In [1]: not_numbers = ['banana' + x for x in numbers]

    In [2]: %timeit [coerce_try_except(x) for x in numbers]
    10000 loops, best of 3: 51.1 µs per loop

    In [3]: %timeit [coerce_try_except(x) for x in not_numbers]
    1000 loops, best of 3: 289 µs per loop

    In [4]: %timeit [coerce_regex(x) for x in not_numbers]
    10000 loops, best of 3: 67.6 µs per loop

    In [5]: %timeit [coerce_regex(x) for x in numbers]
    10000 loops, best of 3: 123 µs per loop

What can we learn from this? The ``try: except`` method (arguably the most
"pythonic" of the solutions) is best for numeric input, but performs over 5X
slower for non-numeric input. Conversely, the regular expression method, though
slower than ``try: except`` for both input types, is more efficient for
non-numeric input than for input that can be converted to an ``int``. Further,
even though the regular expression method is slower for both input types, it is
always at least twice as fast as the worst case for the ``try: except``.

Why do I care? Shouldn't I just pick a method and not worry about it? Probably.
However, I am very conscious about the performance of :mod:`natsort`, and want
it to be a true drop-in replacement for :func:`sorted` without having to incur
a performance penalty. For the purposes of :mod:`natsort`, there is no clear
winner between the two algorithms - the data being passed to this function will
likely be a mix of numeric and non-numeric string content. Do I use the
``try: except`` method and hope the speed gains on numbers will offset the
non-number performance, or do I use regular expressions and take the more
stable performance?

It turns out that within the context of :mod:`natsort`, some assumptions can be
made that make a hybrid approach attractive. Because all strings are pre-split
into numeric and non-numeric content *before* being passed to this coercion
function, the assumption can be made that *if a string begins with a digit or a
sign, it can be coerced into a number*.

.. code-block:: pycon

    >>> def coerce_to_int(x):
    ...     if x[0] in '0123456789+-':
    ...         try:
    ...             return int(x)
    ...         except ValueError:
    ...             return x
    ...     else:
    ...         return x
    ...

So how does this perform compared to the standard coercion methods?

.. code-block:: pycon

    In [6]: %timeit [coerce_to_int(x) for x in numbers]
    10000 loops, best of 3: 71.6 µs per loop

    In [7]: %timeit [coerce_to_int(x) for x in not_numbers]
    10000 loops, best of 3: 26.4 µs per loop

The hybrid method eliminates most of the time wasted on numbers checking
that it is in fact a number before passing to :func:`int`, and eliminates
the time wasted in the exception stack for input that is not a number.

That's as fast as we can get, right? In pure Python, probably. At least, it's
close. But because I am crazy and a glutton for punishment, I decided to see
if I could get any faster writing a C extension. It's called
`fastnumbers`_ and contains a C implementation of the above coercion functions
called :func:`fast_int`. How does it fair? Pretty well.

.. code-block:: pycon

    In [8]: %timeit [fast_int(x) for x in numbers]
    10000 loops, best of 3: 30.9 µs per loop

    In [9]: %timeit [fast_int(x) for x in not_numbers]
    10000 loops, best of 3: 30 µs per loop

During development of :mod:`natsort`, I wanted to ensure that using it did not
get in the way of a user's program by introducing a performance penalty to
their code. To that end, I do not feel like my adventures down the rabbit hole
of optimization of coercion functions was a waste; I can confidently look users
in the eye and say I considered every option in ensuring :mod:`natsort` is as
efficient as possible. This is why if `fastnumbers`_ is installed it will be
used for this step, and otherwise the hybrid method will be used.

.. note::

    Modifying the hybrid coercion function for floats is straightforward.

    .. code-block:: pycon

        >>> def coerce_to_float(x):
        ...     if x[0] in '.0123456789+-' or x.lower().lstrip()[:3] in ('nan', 'inf'):
        ...         try:
        ...             return float(x)
        ...         except ValueError:
        ...             return x
        ...     else:
        ...         return x
        ...

.. _tldr1:

TL;DR 1 - The Simple "No Special Cases" Algorithm
+++++++++++++++++++++++++++++++++++++++++++++++++

At this point, our :mod:`natsort` algorithm is essentially the following:

.. code-block:: pycon

    >>> import re
    >>> def natsort_key(x, as_float=False, signed=False):
    ...     if as_float:
    ...         regex = signed_float if signed else unsigned_float
    ...     else:
    ...         regex = signed_int if signed else unsigned_int
    ...     split_input = re.split(regex, x)
    ...     split_input = filter(None, split_input)  # removes null strings
    ...     coerce = coerce_to_float if as_float else coerce_to_int
    ...     return tuple(coerce(s) for s in split_input)
    ...

I have written the above for clarity and not performance.
This pretty much matches `most natural sort solutions for python on Stack Overflow`_
(except the above includes customization of the definition of a number).

Special Cases Everywhere!
-------------------------

.. contents::
    :local:

.. image:: special_cases_everywhere.jpg

If what I described in :ref:`TL;DR 1 <tldr1>` were
all that :mod:`natsort` needed to
do then there probably wouldn't be much need for a third-party module, right?
Probably. But it turns out that in real-world data there are a lot of
special cases that need to be handled, and in true `80%/20%`_ fashion, the
majority of the code in :mod:`natsort` is devoted to handling special cases
like those described below.

Sorting Filesystem Paths
++++++++++++++++++++++++

`The first major special case I encountered was sorting filesystem paths`_
(if you go to the link, you will see I didn't handle it well for a year...
this was before I fully realized how much functionality I could really add
to :mod:`natsort`). Let's apply the :func:`natsort_key` from above to some
filesystem paths that you might see being auto-generated from your operating
system:

.. code-block:: pycon

    >>> paths = ['Folder (10)/file.tar.gz',
    ...          'Folder/file.tar.gz',
    ...          'Folder (1)/file (1).tar.gz',
    ...          'Folder (1)/file.tar.gz']
    >>> sorted(paths, key=natsort_key)
    ['Folder (1)/file (1).tar.gz', 'Folder (1)/file.tar.gz', 'Folder (10)/file.tar.gz', 'Folder/file.tar.gz']

Well that's not right! What is ``'Folder/file.tar.gz'`` doing at the end?
It has to do with the numerical ASCII code assigned to the space and
``/`` characters in the `ASCII table`_. According to the `ASCII table`_, the
space character (number 32) comes before the ``/`` character (number 47). If
we remove the common prefix in all of the above strings (``'Folder'``), we
can see why this happens:

.. code-block:: pycon

    >>> ' (1)/file.tar.gz' < '/file.tar.gz'
    True
    >>> ' ' < '/'
    True

This isn't very convenient... how do we solve it? We can split the path
across the path separators and then sort. A convenient way do to this is
with the :data:`Path.parts <pathlib.PurePath.parts>` property from
:mod:`pathlib`:

.. code-block:: pycon

    >>> import pathlib
    >>> sorted(paths, key=lambda x: tuple(natsort_key(s) for s in pathlib.Path(x).parts))
    ['Folder/file.tar.gz', 'Folder (1)/file (1).tar.gz', 'Folder (1)/file.tar.gz', 'Folder (10)/file.tar.gz']

Almost! It seems like there is some funny business going on in the final
filename component as well. We can solve that nicely and quickly with
:data:`Path.suffixes <pathlib.PurePath.suffixes>` and :data:`Path.stem
<pathlib.PurePath.stem>`.

.. code-block:: pycon

    >>> def decompose_path_into_components(x):
    ...     path_split = list(pathlib.Path(x).parts)
    ...     # Remove the final filename component from the path.
    ...     final_component = pathlib.Path(path_split.pop())
    ...     # Split off all the extensions.
    ...     suffixes = final_component.suffixes
    ...     stem = final_component.name.replace(''.join(suffixes), '')
    ...     # Remove the '.' prefix of each extension, and make that
    ...     # final component a list of the stem and each suffix.
    ...     final_component = [stem] + [x[1:] for x in suffixes]
    ...     # Replace the split final filename component.
    ...     path_split.extend(final_component)
    ...     return path_split
    ...
    >>> def natsort_key_with_path_support(x):
    ...     return tuple(natsort_key(s) for s in decompose_path_into_components(x))
    ...
    >>> sorted(paths, key=natsort_key_with_path_support)
    ['Folder/file.tar.gz', 'Folder (1)/file.tar.gz', 'Folder (1)/file (1).tar.gz', 'Folder (10)/file.tar.gz']

This works because in addition to breaking the input by path separators,
the final filename component is separated from its extensions as well.
*Then*, each of these separated components is sent to the
:mod:`natsort` algorithm, so the result is a tuple of tuples. Once that
is done, we can see how comparisons can be done in the expected manner.

.. code-block:: pycon

    >>> a = natsort_key_with_path_support('Folder (1)/file (1).tar.gz')
    >>> a
    (('Folder (', 1, ')'), ('file (', 1, ')'), ('tar',), ('gz',))
    >>>
    >>> b = natsort_key_with_path_support('Folder/file.tar.gz')
    >>> b
    (('Folder',), ('file',), ('tar',), ('gz',))
    >>>
    >>> a > b
    True

.. note::

    The actual :meth:`decompose_path_into_components`-equivalent function in
    :mod:`natsort` actually has a few more heuristics than shown here so that
    it is not over-zealous in what it defines as a path suffix, but this has
    been omitted in this how-to for clarity.

Comparing Different Types
+++++++++++++++++++++++++

`The second major special case I encountered was sorting of different types`_.
On Python 2 (i.e. legacy Python), this mostly didnt't matter *too*
much since it uses an arbitrary heuristic to allow traditionally un-comparable
types to be compared (such as comparing ``'a'`` to ``1``). However, on Python 3
(i.e. Python) it simply won't let you perform such nonsense, raising a
:exc:`TypeError` instead.

You can imagine that a module that breaks strings into tuples of numbers and
strings is walking a dangerous line if it does not have special handling for
comparing numbers and strings. My imagination was not so great at first.
Let's take a look at all the ways this can fail with real-world data.

.. code-block:: pycon

    >>> def natsort_key_with_poor_real_number_support(x):
    ...     split_input = re.split(signed_float, x)
    ...     split_input = filter(None, split_input)  # removes null strings
    ...     return tuple(coerce_to_float(s) for s in split_input)
    >>>
    >>> sorted([5, '4'], key=natsort_key_with_poor_real_number_support)
    Traceback (most recent call last):
        ...
    TypeError: ...
    >>>
    >>> sorted(['12 apples', 'apples'], key=natsort_key_with_poor_real_number_support)
    Traceback (most recent call last):
        ...
    TypeError: ...
    >>>
    >>> sorted(['version5.3.0', 'version5.3rc1'], key=natsort_key_with_poor_real_number_support)
    Traceback (most recent call last):
        ...
    TypeError: ...

Let's break these down.

#. The integer ``5`` is sent to ``re.split`` which expects only strings
   or bytes, which is a no-no.
#. ``natsort_key_with_poor_real_number_support('12 apples') < natsort_key_with_poor_real_number_support('apples')``
   is the same as ``(12.0, ' apples') < ('apples',)``, and thus a number gets
   compared to a string [#f1]_ which also is a no-no.
#. This one scores big on the astonishment scale, especially if one
   accidentally uses signed integers or real numbers when they mean
   to use unsigned integers.
   ``natsort_key_with_poor_real_number_support('version5.3.0') < natsort_key_with_poor_real_number_support('version5.3rc1')``
   is the same as ``('version', 5.3, 0.0) < ('version', 5.3, 'rc', 1.0)``,
   so in the third element a number gets compared to a string, once again
   the same old no-no. (The same would happen with ``'version5-3'`` and
   ``'version5-a'``, which would become ``('version', 5, -3)`` and
   ``('version', 5, '-a')``).

As you might expect, the solution to the first issue is to wrap the
``re.split`` call in a ``try: except:`` block and handle the number specially
if a :exc:`TypeError` is raised. The second and third cases *could* be handled
in a "special case" manner, meaning only respond and do something different
if these problems are detected. But a less error-prone method is to ensure
that the data is correct-by-construction, and this can be done by ensuring
that the returned tuples *always* start with a string, and then alternate
in a string-number-string-number-string pattern; this can be achieved by
adding an empty string wherever the pattern is not followed [#f2]_. This ends
up working out pretty nicely because empty strings are always "less" than
any non-empty string, and we typically want numbers to come before strings.

Let's take a look at how this works out.

.. code-block:: pycon

    >>> from natsort.utils import sep_inserter
    >>> list(sep_inserter(iter(['apples']), ''))
    ['apples']
    >>>
    >>> list(sep_inserter(iter([12, ' apples']), ''))
    ['', 12, ' apples']
    >>>
    >>> list(sep_inserter(iter(['version', 5, -3]), ''))
    ['version', 5, '', -3]
    >>>
    >>> from natsort import natsort_keygen, ns
    >>> natsort_key_with_good_real_number_support = natsort_keygen(alg=ns.REAL)
    >>>
    >>> sorted([5, '4'], key=natsort_key_with_good_real_number_support)
    ['4', 5]
    >>>
    >>> sorted(['12 apples', 'apples'], key=natsort_key_with_good_real_number_support)
    ['12 apples', 'apples']
    >>>
    >>> sorted(['version5.3.0', 'version5.3rc1'], key=natsort_key_with_good_real_number_support)
    ['version5.3.0', 'version5.3rc1']

How the "good" version works will be given in
`TL;DR 2 - Handling Crappy, Real-World Input`_.

Handling NaN
++++++++++++

`A rather unexpected special case I encountered was sorting collections containing NaN`_.
Let's see what happens when you try to sort a plain old list of numbers when there
is a **NaN** floating around in there.

.. code-block:: pycon

    >>> danger = [7, float('nan'), 22.7, 19, -14, 59.123, 4]
    >>> sorted(danger)
    [7, nan, -14, 4, 19, 22.7, 59.123]

Clearly that isn't correct, and for once it isn't my fault!
`It's hard to compare floating point numbers`_. By definition, **NaN** is unorderable
to any other number, and is never equal to any other number, including itself.

.. code-block:: pycon

    >>> nan = float('nan')
    >>> 5 > nan
    False
    >>> 5 < nan
    False
    >>> 5 == nan
    False
    >>> 5 != nan
    True
    >>> nan == nan
    False
    >>> nan != nan
    True

The implication of all this for us is that if there is an **NaN** in the
data-set we are trying to sort, the data-set will end up being sorted in
two separate yet individually sorted sequences - the one *before* the **NaN**,
and the one *after*. This is because the ``<`` operation that is used
to sort always returns :const:`False` with **NaN**.

Because :mod:`natsort` aims to sort sequences in a way that does not surprise
the user, keeping this behavior is not acceptable (I don't require my users
to know how **NaN** will behave in a sorting algorithm). The simplest way to
satisfy the "least astonishment" principle is to substitute **NaN** with
some other value. But what value is *least* astonishing? I chose to replace
**NaN** with :math:`-\infty` so that these poorly behaved elements always
end up at the front where the users will most likely be alerted to their
presence.

.. code-block:: pycon

    >>> def fix_nan(x):
    ...     if x != x:  # only true for NaN
    ...         return float('-inf')
    ...     else:
    ...         return x
    ...

Let's check out :ref:`TL;DR 2 <tldr2>` to see how this can be
incorporated into the simple key function from :ref:`TL;DR 1 <tldr1>`.

.. _tldr2:

TL;DR 2 - Handling Crappy, Real-World Input
+++++++++++++++++++++++++++++++++++++++++++

Let's see how our elegant key function from :ref:`TL;DR 1 <tldr1>` has
become bastardized in order to support handling mixed real-world data
and user customizations.

.. code-block:: pycon

    >>> def natsort_key(x, as_float=False, signed=False, as_path=False):
    ...     if as_float:
    ...         regex = signed_float if signed else unsigned_float
    ...     else:
    ...         regex = signed_int if signed else unsigned_int
    ...     try:
    ...         if as_path:
    ...             x = decompose_path_into_components(x)  # Decomposes into list of strings
    ...         # If this raises a TypeError, input is not a string.
    ...         split_input = re.split(regex, x)
    ...     except TypeError:
    ...         try:
    ...             # Does this need to be applied recursively (list-of-list)?
    ...             return tuple(map(natsort_key, x))
    ...         except TypeError:
    ...             # Must be a number
    ...             ret = ('', fix_nan(x))  # Maintain string-number-string pattern
    ...             return (ret,) if as_path else ret  # as_path returns tuple-of-tuples
    ...     else:
    ...         split_input = filter(None, split_input)  # removes null strings
    ...         # Note that the coerce_to_int/coerce_to_float functions
    ...         # are also modified to use the fix_nan function.
    ...         if as_float:
    ...             coerced_input = (coerce_to_float(s) for s in split_input)
    ...         else:
    ...             coerced_input = (coerce_to_int(s) for s in split_input)
    ...         return tuple(sep_inserter(coerced_input, ''))
    ...

And this doesn't even show handling :class:`bytes` type! Notice that we have
to do non-obvious things like modify the return form of numbers when ``as_path``
is given, just to avoid comparing strings and numbers for the case in which a
user provides input like ``['/home/me', 42]``.

Let's take it out for a spin!

.. code-block:: pycon

    >>> danger = [7, float('nan'), 22.7, '19', '-14', '59.123', 4]
    >>> sorted(danger, key=lambda x: natsort_key(x, as_float=True, signed=True))
    [nan, '-14', 4, 7, '19', 22.7, '59.123']
    >>>
    >>> paths = ['Folder (1)/file.tar.gz',
    ...          'Folder/file.tar.gz',
    ...          123456]
    >>> sorted(paths, key=lambda x: natsort_key(x, as_path=True))
    [123456, 'Folder/file.tar.gz', 'Folder (1)/file.tar.gz']

Here Be Dragons: Adding Locale Support
--------------------------------------

.. contents::
    :local:

Probably the most challenging special case I had to handle was getting
:mod:`natsort` to handle sorting the non-numerical parts of input
correctly, and also allowing it to sort the numerical bits in different
locales. This was in no way what I originally set out to do with this
library, so I was
`caught a bit off guard when the request was initially made`_.
I discovered the :mod:`locale` library, and assumed that if it's part of
Python's StdLib there can't be too many dragons, right?

.. admonition:: INCOMPLETE LIST OF DRAGONS

    - https://github.com/SethMMorton/natsort/issues/21
    - https://github.com/SethMMorton/natsort/issues/22
    - https://github.com/SethMMorton/natsort/issues/23
    - https://github.com/SethMMorton/natsort/issues/36
    - https://github.com/SethMMorton/natsort/issues/44
    - https://bugs.python.org/issue2481
    - https://bugs.python.org/issue23195
    - https://stackoverflow.com/questions/3412933/python-not-sorting-unicode-properly-strcoll-doesnt-help
    - https://stackoverflow.com/questions/22203550/sort-dictionary-by-key-using-locale-collation
    - https://stackoverflow.com/questions/33459384/unicode-character-not-in-range-when-calling-locale-strxfrm
    - https://stackoverflow.com/questions/36431810/sort-numeric-lines-with-thousand-separators
    - https://stackoverflow.com/questions/45734562/how-can-i-get-a-reasonable-string-sorting-with-python

These can be summed up as follows:

#. :mod:`locale` is a thin wrapper over your operating system's *locale*
   library, so if *that* is broken (like it is on BSD and OSX) then
   :mod:`locale` is broken in Python.
#. Because of a bug in legacy Python (i.e. Python 2), there was no uniform
   way to use the :mod:`locale` sorting functionality between legacy Python
   and Python (luckily this is no longer an issue now that Python 2 is EOL).
#. People have differing opinions of how capitalization should affect word
   order.
#. There is no built-in way to handle locale-dependent thousands separators
   and decimal points *robustly*.
#. Proper handling of Unicode is complicated.
#. Proper handling of :mod:`locale` is complicated.

Easily over half of the code in :mod:`natsort` is in some way dealing with some
aspect of :mod:`locale` or basic case handling. It would have been impossible
to get right without a `really good`_ `testing strategy`_.

Don't expect any more TL;DR's... if you want to see how all this is fully
incorporated into the :mod:`natsort` algorithm then please take a look
`at the code`_.  However, I will hint at how specific steps are taken in
each section.

Let's see how we can handle some of the dragons, one-by-one.

Basic Case Control Support
++++++++++++++++++++++++++

Without even thinking about the mess that is adding :mod:`locale` support,
:mod:`natsort` can introduce support for controlling how case is interpreted.

First, let's take a look at how it is sorted by default (due to
where characters lie on the `ASCII table`_).

.. code-block:: pycon

    >>> a = ['Apple', 'corn', 'Corn', 'Banana', 'apple', 'banana']
    >>> sorted(a)
    ['Apple', 'Banana', 'Corn', 'apple', 'banana', 'corn']

All uppercase letters come before lowercase letters in the `ASCII table`_,
so all capitalized words appear first. Not everyone agrees that this
is the correct order. Some believe that the capitalized words should
be last (``['apple', 'banana', 'corn', 'Apple', 'Banana', 'Corn']``).
Some believe that both the lowercase and uppercase versions
should appear together
(``['Apple', 'apple', 'Banana', 'banana', 'Corn', 'corn']``).
Some believe that both should be true ☹. Some people don't care at all [#f3]_.

Solving the first case (I call it *LOWERCASEFIRST*) is actually pretty
easy... just call the :meth:`str.swapcase` method on the input.

.. code-block:: pycon

    >>> sorted(a, key=lambda x: x.swapcase())
    ['apple', 'banana', 'corn', 'Apple', 'Banana', 'Corn']

The last (i call it *IGNORECASE*) is pretty easy.
Simply call :meth:`str.casefold` on the input (it's like :meth:`std.lowercase`
but does a better job on non-latin character sets).

.. code-block:: pycon

    >>> sorted(a, key=lambda x: x.casefold())
    ['Apple', 'apple', 'Banana', 'banana', 'corn', 'Corn']

The middle case (I call it *GROUPLETTERS*) is less straightforward.
The most efficient way to handle this is to duplicate each character
with its lowercase version and then the original character.

.. code-block:: pycon

    >>> import itertools
    >>> def groupletters(x):
    ...     return ''.join(itertools.chain.from_iterable((y.casefold(), y) for y in x))
    ...
    >>> groupletters('Apple')
    'aAppppllee'
    >>> groupletters('apple')
    'aappppllee'
    >>> sorted(a, key=groupletters)
    ['Apple', 'apple', 'Banana', 'banana', 'Corn', 'corn']

The effect of this is that both ``'Apple'`` and ``'apple'`` are
placed adjacent to each other because their transformations both begin
with ``'a'``, and then the second character can be used to order them
appropriately with respect to each other.

There's a problem with this, though. Within the context of :mod:`natsort`
we are trying to correctly sort numbers and those should be left alone.

.. code-block:: pycon

    >>> a = ['Apple5', 'apple', 'Apple4E10', 'Banana']
    >>> sorted(a, key=lambda x: natsort_key(x, as_float=True))
    ['Apple5', 'Apple4E10', 'Banana', 'apple']
    >>> sorted(a, key=lambda x: natsort_key(groupletters(x), as_float=True))
    ['Apple4E10', 'Apple5', 'apple', 'Banana']
    >>> groupletters('Apple4E10')
    'aAppppllee44eE1100'

We messed up the numbers! Looks like :func:`groupletters` needs to be applied
*after* the strings are broken into their components. I'm not going to show
how this is done here, but basically it requires applying the function in
the ``else:`` block of :func:`coerce_to_int`/:func:`coerce_to_float`.

.. code-block:: pycon

    >>> better_groupletters = natsort_keygen(alg=ns.GROUPLETTERS | ns.REAL)
    >>> better_groupletters('Apple4E10')
    ('aAppppllee', 40000000000.0)
    >>> sorted(a, key=better_groupletters)
    ['Apple5', 'Apple4E10', 'apple', 'Banana']

Of course, applying both *LOWERCASEFIRST* and *GROUPLETTERS* is just
a matter of turning on both functions.

Basic Unicode Support
+++++++++++++++++++++

Unicode is hard and complicated. Here's an example.

.. code-block:: pycon

    >>> b = [b'\x66', b'\x65', b'\xc3\xa9', b'\x65\xcc\x81', b'\x61', b'\x7a']
    >>> a = [x.decode('utf8') for x in b]
    >>> a  # doctest: +SKIP
    ['f', 'e', 'é', 'é', 'a', 'z']
    >>> sorted(a)  # doctest: +SKIP
    ['a', 'e', 'é', 'f', 'z', 'é']

There are more than one way to represent the character 'é' in Unicode.
In fact, many characters have multiple representations. This is a challenge
because comparing the two representations would return ``False`` even though
they *look* the same.

.. code-block:: pycon

    >>> a[2] == a[3]
    False

Alas, since characters are compared based on the numerical value of their
representation, sorting Unicode often gives unexpected results (like seeing
'é' come both *before* and *after* 'z').

The original approach that :mod:`natsort` took with respect to non-ASCII
Unicode characters was to say "just use
the :mod:`locale` or :mod:`PyICU` library" and then cross it's fingers
and hope those libraries take care of it. As you will find in the following
sections, that comes with its own baggage, and turned out to not always work
anyway (see https://stackoverflow.com/q/45734562/1399279). A more robust
approach is to handle the Unicode out-of-the-box without invoking a
heavy-handed library like :mod:`locale` or :mod:`PyICU`.
To do this, we must use *normalization*.

To fully understand Unicode normalization,
`check out some official Unicode documentation`_.
Just kidding... that's too much text. The following StackOverflow answers do
a good job at explaining Unicode normalization in simple terms:
https://stackoverflow.com/a/7934397/1399279 and
https://stackoverflow.com/a/7931547/1399279. Put simply, normalization
ensures that Unicode characters with multiple representations are in
some canonical and consistent representation so that (for example) comparisons
of the characters can be performed in a sane way. The following discussion
assumes you at least read the StackOverflow answers.

Looking back at our 'é' example, we can see that the two versions were
constructed with the byte strings ``b'\xc3\xa9'`` and ``b'\x65\xcc\x81'``.
The former representation is actually
`LATIN SMALL LETTER E WITH ACUTE <https://www.fileformat.info/info/unicode/char/e9/index.htm>`_
and is a single character in the Unicode standard. This is known as the
*compressed form* and corresponds to the 'NFC' normalization scheme.
The latter representation is actually the letter 'e' followed by
`COMBINING ACUTE ACCENT <https://www.fileformat.info/info/unicode/char/0301/index.htm>`_
and so is two characters in the Unicode standard. This is known as the
*decompressed form* and corresponds to the 'NFD' normalization scheme.
Since the first character in the decompressed form is actually the letter 'e',
when compared to other ASCII characters it fits where you might expect.
Unfortunately, all Unicode compressed form characters come after the
ASCII characters and so they always will be placed after 'z' when sorting.

It seems that most Unicode data is stored and shared in the compressed form
which makes it challenging to sort. This can be solved by normalizing all
incoming Unicode data to the decompressed form ('NFD') and *then* sorting.

.. code-block:: pycon

    >>> import unicodedata
    >>> c = [unicodedata.normalize('NFD', x) for x in a]
    >>> c  # doctest: +SKIP
    ['f', 'e', 'é', 'é', 'a', 'z']
    >>> sorted(c)  # doctest: +SKIP
    ['a', 'e', 'é', 'é', 'f', 'z']

Huzzah! Sane sorting without having to resort to :mod:`locale`!

Using Locale to Compare Strings
+++++++++++++++++++++++++++++++

The :mod:`locale` module is actually pretty cool, and provides lowly
spare-time programmers like myself a way to handle the daunting task
of proper locale-dependent support of their libraries and utilities.
Having said that, it can be a bit of a bear to get right,
`although they do point out in the documentation that it will be painful to use`_.
Aside from the caveats spelled out in that link, it turns out that just
comparing strings with :mod:`locale` in a cross-platform and
cross-python-version manner is not as straightforward as one might hope.

First, how to use :mod:`locale` to compare strings? It's actually
pretty straightforward. Simply run the input through the :mod:`locale`
transformation function :func:`locale.strxfrm`.

.. code-block:: pycon

    >>> import locale, sys
    >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
    'en_US.UTF-8'
    >>> a = ['a', 'b', 'ä']
    >>> sorted(a)
    ['a', 'b', 'ä']
    >>> # The below fails on OSX, so don't run doctest on darwin.
    >>> is_osx = sys.platform == 'darwin'
    >>> sorted(a, key=locale.strxfrm) if not is_osx else ['a', 'ä', 'b']
    ['a', 'ä', 'b']
    >>>
    >>> a = ['apple', 'Banana', 'banana', 'Apple']
    >>> sorted(a, key=locale.strxfrm) if not is_osx else ['apple', 'Apple', 'banana', 'Banana']
    ['apple', 'Apple', 'banana', 'Banana']

It turns out that locale-aware sorting groups numbers in the same
way as turning on *GROUPLETTERS* and *LOWERCASEFIRST*.
The trick is that you have to apply :func:`locale.strxfrm` only to non-numeric
characters; otherwise, numbers won't be parsed properly. Therefore, it must
be applied as part of the :func:`coerce_to_int`/:func:`coerce_to_float`
functions in a manner similar to :func:`groupletters`.

Unicode Support With Local
++++++++++++++++++++++++++

Remember how in the `Basic Unicode Support`_ section I mentioned that we
use the "decompressed" Unicode normalization form (e.g. NFD) on all inputs
to ensure the order is as expected?

If you have been following along so far, you probably expect that it is not
that easy. You would be correct.

It turns out that some locales (but not all) expect the input to be in
"compressed form" (e.g. NFC) or the ordering is not as you might expect.
`Check out this issue for a real-world example`_. Here's a relevant
snippet of code

.. code-block:: pycon

    In [1]: import locale, unicodedata

    In [2]: a = ['Aš', 'Cheb', 'Česko', 'Cibulov', 'Znojmo', 'Žilina']

    In [3]: locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
    Out[3]: 'en_US.UTF-8'

    In [4]: sorted(a, key=locale.strxfrm)
    Out[4]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

    In [5]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFD", x)))
    Out[5]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

    In [6]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFC", x)))
    Out[6]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

    In [7]: locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
    Out[7]: 'de_DE.UTF-8'

    In [8]: sorted(a, key=locale.strxfrm)
    Out[8]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

    In [9]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFD", x)))
    Out[9]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

    In [10]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFC", x)))
    Out[10]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

    In [11]: locale.setlocale(locale.LC_ALL, 'cs_CZ.UTF-8')
    Out[11]: 'cs_CZ.UTF-8'

    In [12]: sorted(a, key=locale.strxfrm)
    Out[12]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']

    In [13]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFD", x)))
    Out[13]: ['Aš', 'Česko', 'Cibulov', 'Cheb', 'Žilina', 'Znojmo']

    In [14]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFC", x)))
    Out[14]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']

Two out of three locales sort the same data in the same order no matter how the unicode
input was normalized, but Czech seems to care how the input is formatted!

So, everthing mentioned in `Basic Unicode Support`_ is conditional on whether
or not the user wants to use the :mod:`locale` library or not. If not, then
"NFD" normalization is used. If they do, "NFC" normalization is used.

Handling Broken Locale On OSX
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

But what if the underlying *locale* implementation that :mod:`locale`
relies upon is simply broken? It turns out that the *locale* library on
OSX (and other BSD systems) is broken (and for some reason has never been
fixed?), and so :mod:`locale` does not work as expected.

How do I define doesn't work as expected?

.. code-block:: pycon

    >>> a = ['apple', 'Banana', 'banana', 'Apple']
    >>> sorted(a)
    ['Apple', 'Banana', 'apple', 'banana']
    >>>
    >>> sorted(a, key=locale.strxfrm) if is_osx else sorted(a)
    ['Apple', 'Banana', 'apple', 'banana']

IT'S SORTING AS IF :func:`locale.stfxfrm` WAS NEVER USED!! (and it's worse
once non-ASCII characters get thrown into the mix.) I'm really not
sure why this is considered OK for the OSX/BSD maintainers to not fix,
but it's more than frustrating for poor developers who have been dragged
into the *locale* game kicking and screaming. *<deep breath>*.

So, how to deal with this situation? There are two ways to do so.

#.  Detect if :mod:`locale` is sorting incorrectly (i.e. ``dumb``) by seeing
    if ``'A'`` is sorted before ``'a'`` (incorrect) or not.

    .. code-block:: pycon

        >>> # This is genuinely the name of this function.
        >>> # See natsort.compat.locale.py
        >>> def dumb_sort():
        ...     return locale.strxfrm('A') < locale.strxfrm('a')
        ...

    If a ``dumb`` *locale* implementation is found, then automatically
    turn on *LOWERCASEFIRST* and *GROUPLETTERS*.
#.  Use an alternate library if installed. `ICU <http://site.icu-project.org/>`_
    is a great and powerful library that has a pretty decent Python port
    called (you guessed it) `PyICU <https://pypi.org/project/PyICU/>`_.
    If a user has this library installed on their computer, :mod:`natsort`
    chooses to use that instead of :mod:`locale`. With a little bit of
    planning, one can write a set of wrapper functions that call
    the correct library under the hood such that the business logic never
    has to know what library is being used (see `natsort.compat.locale.py`_).

Let me tell you, this little complication really makes a challenge of testing
the code, since one must set up different environments on different operating
systems in order to test all possible code paths. Not to mention that
certain checks *will* fail for certain operating systems and environments
so one must be diligent in either writing the tests not to fail, or ignoring
those tests when on offending environments.

Handling Locale-Aware Numbers
+++++++++++++++++++++++++++++

`Thousands separator support`_ is a problem that I knew would someday be
requested but had decided to push off until a rainy day. One day it finally
rained, and I decided to tackle the problem.

So what is the problem? Consider the number ``1,234,567`` (assuming the
``','`` is the thousands separator). Try to run that through :func:`int`
and you will get a :exc:`ValueError`. To handle this properly the thousands
separators must be removed.

.. code-block:: pycon

    >>> float('1,234,567'.replace(',', ''))
    1234567.0

What if, in our current locale, the thousands separator is ``'.'`` and
the ``','`` is the decimal separator (like for the German locale *de_DE*)?

.. code-block:: pycon

    >>> float('1.234.567'.replace('.', '').replace(',', '.'))
    1234567.0
    >>> float('1.234.567,89'.replace('.', '').replace(',', '.'))
    1234567.89

This is pretty much what :func:`locale.atoi` and :func:`locale.atof` do
under the hood. So what's the problem? Why doesn't :mod:`natsort` just
use this method under its hood?
Well, let's take a look at what would happen if we send some possible
:mod:`natsort` input through our the above function:

.. code-block:: pycon

    >>> natsort_key('1,234 apples, please.'.replace(',', ''))
    ('', 1234, ' apples please.')
    >>> natsort_key('Sir, €1.234,50 please.'.replace('.', '').replace(',', '.'), as_float=True)
    ('Sir. €', 1234.5, ' please')

Any character matching the thousands separator was dropped, and anything
matching the decimal separator was changed to ``'.'``! If these characters
were critical to how your data was ordered, this would break :mod:`natsort`.

The first solution one might consider would be to first decompose the
input into sub-components (like we did for the *GROUPLETTERS* method
above) and then only apply these transformations on the number components.
This is a chicken-and-egg problem, though, because *we cannot appropriately
separate out the numbers because of the thousands separators and
non-'.' decimal separators* (well, at least not without making multiple
passes over the data which I do not consider to be a valid option).

Regular expressions to the rescue! With regular expressions, we can
remove the thousands separators and change the decimal separator only
when they are actually within a number. Once the input has been
pre-processed with this regular expression, all the infrastructure
shown previously will work.

Beware, these regular expressions will make your eyes bleed.

.. code-block:: pycon

    >>> decimal = ','  # Assume German locale, so decimal separator is ','
    >>> # Look-behind assertions cannot accept range modifiers, so instead of i.e.
    >>> # (?<!\.[0-9]{1,3}) I have to repeat the look-behind for 1, 2, and 3.
    >>> nodecimal = r'(?<!{dec}[0-9])(?<!{dec}[0-9]{{2}})(?<!{dec}[0-9]{{3}})'.format(dec=decimal)
    >>> strip_thousands = r'''
    ...     (?<=[0-9]{{1}})  # At least 1 number
    ...     (?<![0-9]{{4}})  # No more than 3 numbers
    ...     {nodecimal}      # Cannot follow decimal
    ...     {thou}           # The thousands separator
    ...     (?=[0-9]{{3}}    # Three numbers must follow
    ...      ([^0-9]|$)      # But a non-number after that
    ...     )
    ... '''.format(nodecimal=nodecimal, thou=re.escape('.'))  # Thousands separator is '.' in German locale.
    ...
    >>> re.sub(strip_thousands, '', 'Sir, €1.234,50 please.', flags=re.X)
    'Sir, €1234,50 please.'
    >>>
    >>> # The decimal point must be preceded by a number or after
    >>> # a number. This option only needs to be performed in the
    >>> # case when the decimal separator for the locale is not '.'.
    >>> switch_decimal = r'(?<=[0-9]){decimal}|{decimal}(?=[0-9])'
    >>> switch_decimal = switch_decimal.format(decimal=decimal)
    >>> re.sub(switch_decimal, '.', 'Sir, €1234,50 please.', flags=re.X)
    'Sir, €1234.50 please.'
    >>>
    >>> natsort_key('Sir, €1234.50 please.', as_float=True)
    ('Sir, €', 1234.5, ' please.')

Final Thoughts
--------------

My hope is that users of :mod:`natsort` never have to think about or worry
about all the bookkeeping or any of the details described above, and that using
:mod:`natsort` seems to magically "just work". For those of you who
took the time to read this engineering description, I hope it has enlightened
you to some of the issues that can be encountered when code is released
into the wild and has to accept "real-world data", or to what happens
to developers who naïvely make bold assumptions that are counter to
what the rest of the world assumes.

.. rubric:: Footnotes

.. [#f1]
    *"But if you hadn't removed the leading empty string from re.split this
    wouldn't have happened!!"* I can hear you saying. Well, that's true. I don't
    have a *great* reason for having done that except that in an earlier
    non-optimal incarnation of the algorithm I needed to it, and it kind of
    stuck, and it made other parts of the code easier if the assumption that
    there were no empty strings was valid.
.. [#f2]
    I'm not going to show how this is implemented in this document,
    but if you are interested you can look at the code to
    :func:`sep_inserter` in `util.py`_.
.. [#f3]
    Handling each of these is straightforward, but coupled with the rapidly
    fracturing execution paths presented in :ref:`TL;DR 2 <tldr2>` one can
    imagine this will get out of hand quickly. If you take a look at
    `natsort.py`_ and `util.py`_ you can observe that to avoid this I take
    a more functional approach to construting the :mod:`natsort` algorithm
    as opposed to the procedural approach illustrated in
    :ref:`TL;DR 1 <tldr1>` and :ref:`TL;DR 2 <tldr2>`.

.. _ASCII table: https://www.asciitable.com/
.. _getting sorting right is surprisingly hard: http://www.compciv.org/guides/python/fundamentals/sorting-collections-with-sorted/
.. _This astonished: https://github.com/SethMMorton/natsort/issues/19
.. _a lot: https://stackoverflow.com/questions/29548742/python-natsort-sort-strings-recursively
.. _of people: https://stackoverflow.com/questions/24045348/sort-set-of-numbers-in-the-form-xx-yy-in-python
.. _and some people aren't very nice when they are astonished:
    https://github.com/xolox/python-naturalsort/blob/ed3e6b6ffaca3bdea3b76e08acbb8bd2a5fee463/README.rst#why-another-natsort-module
.. _fastnumbers: https://github.com/SethMMorton/fastnumbers
.. _as part of my testing: https://github.com/SethMMorton/natsort/blob/master/test_natsort/slow_splitters.py
.. _this one for coercion: https://stackoverflow.com/questions/736043/checking-if-a-string-can-be-converted-to-float-in-python
.. _this one for checking: https://stackoverflow.com/questions/354038/how-do-i-check-if-a-string-is-a-number-float
.. _most natural sort solutions for python on Stack Overflow: https://stackoverflow.com/q/4836710/1399279
.. _80%/20%: https://en.wikipedia.org/wiki/Pareto_principle
.. _The first major special case I encountered was sorting filesystem paths: https://github.com/SethMMorton/natsort/issues/3
.. _The second major special case I encountered was sorting of different types: https://github.com/SethMMorton/natsort/issues/7
.. _A rather unexpected special case I encountered was sorting collections containing NaN:
   https://github.com/SethMMorton/natsort/issues/27
.. _It's hard to compare floating point numbers: http://www.drdobbs.com/cpp/its-hard-to-compare-floating-point-numbe/240149806
.. _caught a bit off guard when the request was initially made: https://github.com/SethMMorton/natsort/issues/14
.. _at the code: https://github.com/SethMMorton/natsort/tree/master/natsort
.. _natsort.py: https://github.com/SethMMorton/natsort/blob/master/natsort/natsort.py
.. _util.py: https://github.com/SethMMorton/natsort/blob/master/natsort/util.py
.. _although they do point out in the documentation that it will be painful to use:
   https://docs.python.org/3/library/locale.html#background-details-hints-tips-and-caveats
.. _natsort.compat.locale.py: https://github.com/SethMMorton/natsort/blob/master/natsort/compat/locale.py
.. _Thousands separator support: https://github.com/SethMMorton/natsort/issues/36
.. _really good: https://hypothesis.readthedocs.io/en/latest/
.. _testing strategy: https://docs.pytest.org/en/latest/
.. _check out some official Unicode documentation: https://unicode.org/reports/tr15/
.. _Check out this issue for a real-world example: https://github.com/SethMMorton/natsort/issues/140