diff options
author | Seth Morton <seth.m.morton@gmail.com> | 2022-01-29 17:20:14 -0800 |
---|---|---|
committer | Seth Morton <seth.m.morton@gmail.com> | 2022-01-29 17:31:37 -0800 |
commit | 9aad50ddc877cf97b9141604cd198b3b06d88e63 (patch) | |
tree | 0437ec097b1cc1fa8494bd6c6a0609af7b7388fb | |
parent | 961d3bbd28d134280ebff30552ae209ee4f26b5e (diff) | |
download | natsort-9aad50ddc877cf97b9141604cd198b3b06d88e63.tar.gz |
Add some limiting heuristics to the PATH suffix splitting
The prior algorithm went as follows: Obtain ALL suffixes from the base
component of the filename. Then, starting from the back, keep the
suffixes split until a suffix is encountered that begins with the
regular expression /.\d/. It was assumed that this was intended to be a
floating point number, and not an extension, and thus the splitting
would stop at that point.
Some input has been seen where the filenames are composed nearly entirely
of Word.then.dot.and.then.dot. One entry amongst them contained
Word.then.dot.5.then.dot. This caused this one entry to be treated
differently from the rest of the entries due to the ".5", and the
sorting order was not as expected.
The new algorithm is as follows: Obtain a maxium of two suffixes. Keep
these suffixes until one of them has a length greater than 4 or starts
with the regular expression /.\d/.
This heuristic of course is not bullet-proof, but it will do a better
job on most real-world filenames than the previous algorithm.
-rw-r--r-- | natsort/utils.py | 29 |
1 files changed, 17 insertions, 12 deletions
diff --git a/natsort/utils.py b/natsort/utils.py index 8d56b06..3832318 100644 --- a/natsort/utils.py +++ b/natsort/utils.py @@ -893,16 +893,21 @@ def path_splitter( path_parts = [] base = str(s) - # Now, split off the file extensions until we reach a decimal number at - # the beginning of the suffix or there are no more extensions. - suffixes = PurePath(base).suffixes - try: - digit_index = next(i for i, x in enumerate(reversed(suffixes)) if _d_match(x)) - except StopIteration: - pass - else: - digit_index = len(suffixes) - digit_index - suffixes = suffixes[digit_index:] - + # Now, split off the file extensions until + # - we reach a decimal number at the beginning of the suffix + # - more than two suffixes have been seen + # - a suffix is more than five characters (including leading ".") + # - there are no more extensions + suffixes = [] + for i, suffix in enumerate(reversed(PurePath(base).suffixes)): + if _d_match(suffix) or i > 1 or len(suffix) > 5: + break + suffixes.append(suffix) + suffixes.reverse() + + # Remove the suffixes from the base component base = base.replace("".join(suffixes), "") - return filter(None, ichain(path_parts, [base], suffixes)) + base_component = [base] if base else [] + + # Join all path comonents in an iterator + return filter(None, ichain(path_parts, base_component, suffixes)) |