diff options
author | Paul Wicking <paul.wicking@qt.io> | 2023-05-03 11:00:25 +0200 |
---|---|---|
committer | Paul Wicking <paul.wicking@qt.io> | 2023-05-13 22:01:10 +0200 |
commit | 7057d01fbb9f8f37c707b33e3b92c10a78919ddc (patch) | |
tree | a919ab62d892885bde7d9049011b3b658b920dce /src | |
parent | 941a9b5e5963f8c0798415e3cb69f031da1f4109 (diff) | |
download | qttools-7057d01fbb9f8f37c707b33e3b92c10a78919ddc.tar.gz |
QDoc: Append hash to canonical titles with non-alnum characters
When generating fragment identifiers from a title, QDoc normalizes the
string that's used as fragment identifier. This normalization is done by
`Doc::canonicalTitle()`. This method returns a string that is stripped
from non-alphanumeric characters, has space(s) replaced by one hyphen,
and any repeating or trailing hyphens removed.
This causes the removal of certain characters, such as 'ß', '大', etc.
For documentation written in languages that contain mostly non-latin1
characters, such as Chinese, this means fragment identifiers may be
empty, such that links to these anchors (e.g. from a table of contents)
lead to nowhere.
This patch adds test data to QDoc's generated output test to reproduce
the issue. The Chinese test data is courtesy of the bug reporter. The
test data also contains other characters from Latin scripts, as during
investigation of a solution to the bug, these appeared as separate
triggers of the misbehavior. The modified test also serves to catch
possible future regressions.
The patch modifies `Doc::canonicalTitle` such that it appends a hash to
"canonical" titles that contain characters that are not considered legal
entities in a canonical title. In this context, legal characters are
lowercase a-z, digits 0-9, and the dash (`-`). Other symbols and
characters are removed. When encountering any character that is either a
non-printable ascii character or ascii character outside a subset (ascii
decimal 32-126, inclusive), QDoc will append a hash of the original
string to the fragment identifier it generates. This means that the
canonical title for a string that contains, for example, a mix of
allowed and disallowed characters, will consist of the allowed
characters and a hash of the original string appended to the final
string.
The patch changes the loop in `canonicalTitle` to a ranged for loop over
a const-ref, and adds precision to a code comment (precision based on
timing the execution of the two implementations of this method one
million times).
Finally, the patch adds documentation for `Doc::canonicalTitle`, as that
didn't exist previously.
[ChangeLog][QDoc] QDoc now appends a hash of the original title to the
fragment identifier generated for that title if the title contains
non-ascii characters. This means QDoc now generates fragment identifiers
for titles that are written in non-latin characters.
Fixes: QTBUG-64506
Change-Id: Idc62677b9950becea662d8ff5ead1f631ec26bc3
Reviewed-by: Topi Reiniö <topi.reinio@qt.io>
Diffstat (limited to 'src')
-rw-r--r-- | src/qdoc/qdoc/doc.cpp | 57 |
1 files changed, 50 insertions, 7 deletions
diff --git a/src/qdoc/qdoc/doc.cpp b/src/qdoc/qdoc/doc.cpp index 762af2ebd..a4d196e36 100644 --- a/src/qdoc/qdoc/doc.cpp +++ b/src/qdoc/qdoc/doc.cpp @@ -13,6 +13,8 @@ #include "quoter.h" #include "text.h" +#include <qcryptographichash.h> + QT_BEGIN_NAMESPACE using namespace Qt::StringLiterals; @@ -407,10 +409,37 @@ CodeMarker *Doc::quoteFromFile(const Location &location, Quoter "er, Resolve return marker; } +/*! + \brief Generates a url-friendly string representation from \a title. + + "Url-friendly" in this context is a string that contains only a subset of + printable ascii characters. + + The subset includes alphanumeric (alnum) characters ([a-zA-Z0-9]), printable + ascii characters, space, punctuation characters, and common symbols. + Non-alnum characters in this subset are replaced by a single dash. Leading + and trailing dashes are removed, such that the resulting string does not + start or end with a dash. Any capital character is replaced by its lowercase + counterpart. + + If any character in \a title is non-latin, or latin and not found in the + aforementioned subset (e.g. 'ß', 'å', or 'ö'), a hash of \a title is + appended to the final string. + + Returns a string that is normalized for the purpose of generating fragment + identifiers for \a title in URLs. + */ QString Doc::canonicalTitle(const QString &title) { - // The code below is equivalent to the following chunk, but _much_ - // faster (accounts for ~10% of total running time) + auto legal_ascii = [](const uint value) { + const uint start_ascii_subset{ 32 }; + const uint end_ascii_subset{ 126 }; + + return value >= start_ascii_subset && value <= end_ascii_subset; + }; + + // The code below is equivalent to the following chunk, but + // has been measured to be approximately 4 times faster. // // QRegularExpression attributeExpr("[^A-Za-z0-9]+"); // QString result = title.toLower(); @@ -421,11 +450,16 @@ QString Doc::canonicalTitle(const QString &title) QString result; result.reserve(title.size()); - bool dashAppended = false; - bool begun = false; - qsizetype lastAlnum = 0; - for (int i = 0; i != title.size(); ++i) { - uint c = title.at(i).unicode(); + bool dashAppended{false}; + bool begun{false}; + qsizetype lastAlnum{0}; + bool has_non_alnum_content{false}; + + for (const auto &i : title) { + uint c = i.unicode(); + + if (!legal_ascii(c)) + has_non_alnum_content = true; if (c >= 'A' && c <= 'Z') c += 'a' - 'A'; bool alnum = (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9'); @@ -441,6 +475,15 @@ QString Doc::canonicalTitle(const QString &title) } } result.truncate(lastAlnum); + + if (has_non_alnum_content) { + auto title_hash = QString::fromLocal8Bit( + QCryptographicHash::hash(title.toUtf8(), QCryptographicHash::Md5).toHex()); + title_hash.truncate(8); + if (!result.isEmpty()) + result.append(QLatin1Char('-')); + result.append(title_hash); + } return result; } |