QDoc: Append hash to canonical titles with non-alnum characters

When generating fragment identifiers from a title, QDoc normalizes the string that's used as fragment identifier. This normalization is done by `Doc::canonicalTitle()`. This method returns a string that is stripped from non-alphanumeric characters, has space(s) replaced by one hyphen, and any repeating or trailing hyphens removed. This causes the removal of certain characters, such as 'ß', '大', etc. For documentation written in languages that contain mostly non-latin1 characters, such as Chinese, this means fragment identifiers may be empty, such that links to these anchors (e.g. from a table of contents) lead to nowhere. This patch adds test data to QDoc's generated output test to reproduce the issue. The Chinese test data is courtesy of the bug reporter. The test data also contains other characters from Latin scripts, as during investigation of a solution to the bug, these appeared as separate triggers of the misbehavior. The modified test also serves to catch possible future regressions. The patch modifies `Doc::canonicalTitle` such that it appends a hash to "canonical" titles that contain characters that are not considered legal entities in a canonical title. In this context, legal characters are lowercase a-z, digits 0-9, and the dash (`-`). Other symbols and characters are removed. When encountering any character that is either a non-printable ascii character or ascii character outside a subset (ascii decimal 32-126, inclusive), QDoc will append a hash of the original string to the fragment identifier it generates. This means that the canonical title for a string that contains, for example, a mix of allowed and disallowed characters, will consist of the allowed characters and a hash of the original string appended to the final string. The patch changes the loop in `canonicalTitle` to a ranged for loop over a const-ref, and adds precision to a code comment (precision based on timing the execution of the two implementations of this method one million times). Finally, the patch adds documentation for `Doc::canonicalTitle`, as that didn't exist previously. [ChangeLog][QDoc] QDoc now appends a hash of the original title to the fragment identifier generated for that title if the title contains non-ascii characters. This means QDoc now generates fragment identifiers for titles that are written in non-latin characters. Fixes: QTBUG-64506 Change-Id: Idc62677b9950becea662d8ff5ead1f631ec26bc3 Reviewed-by: Topi Reiniö <topi.reinio@qt.io>
author: Paul Wicking <paul.wicking@qt.io> 2023-05-03 11:00:25 +0200
committer: Paul Wicking <paul.wicking@qt.io> 2023-05-13 22:01:10 +0200
commit: 7057d01fbb9f8f37c707b33e3b92c10a78919ddc (patch)
tree: a919ab62d892885bde7d9049011b3b658b920dce /src
parent: 941a9b5e5963f8c0798415e3cb69f031da1f4109 (diff)
download: qttools-7057d01fbb9f8f37c707b33e3b92c10a78919ddc.tar.gz
1 files changed, 50 insertions, 7 deletions
diff --git a/src/qdoc/qdoc/doc.cpp b/src/qdoc/qdoc/doc.cpp
index 762af2ebd..a4d196e36 100644
--- a/src/qdoc/qdoc/doc.cpp
+++ b/src/qdoc/qdoc/doc.cpp
@@ -13,6 +13,8 @@
 #include "quoter.h"
 #include "text.h"
 
+#include <qcryptographichash.h>
+
 QT_BEGIN_NAMESPACE
 
 using namespace Qt::StringLiterals;
@@ -407,10 +409,37 @@ CodeMarker *Doc::quoteFromFile(const Location &location, Quoter &quoter, Resolve
     return marker;
 }
 
+/*!
+    \brief Generates a url-friendly string representation from \a title.
+
+    "Url-friendly" in this context is a string that contains only a subset of
+    printable ascii characters.
+
+    The subset includes alphanumeric (alnum) characters ([a-zA-Z0-9]), printable
+    ascii characters, space, punctuation characters, and common symbols.
+    Non-alnum characters in this subset are replaced by a single dash. Leading
+    and trailing dashes are removed, such that the resulting string does not
+    start or end with a dash. Any capital character is replaced by its lowercase
+    counterpart.
+
+    If any character in \a title is non-latin, or latin and not found in the
+    aforementioned subset (e.g. 'ß', 'å', or 'ö'), a hash of \a title is
+    appended to the final string.
+
+    Returns a string that is normalized for the purpose of generating fragment
+    identifiers for \a title in URLs.
+ */
 QString Doc::canonicalTitle(const QString &title)
 {
-    // The code below is equivalent to the following chunk, but _much_
-    // faster (accounts for ~10% of total running time)
+    auto legal_ascii = [](const uint value) {
+        const uint start_ascii_subset{ 32 };
+        const uint end_ascii_subset{ 126 };
+
+        return value >= start_ascii_subset && value <= end_ascii_subset;
+    };
+
+    // The code below is equivalent to the following chunk, but
+    // has been measured to be approximately 4 times faster.
     //
     //  QRegularExpression attributeExpr("[^A-Za-z0-9]+");
     //  QString result = title.toLower();
@@ -421,11 +450,16 @@ QString Doc::canonicalTitle(const QString &title)
     QString result;
     result.reserve(title.size());
 
-    bool dashAppended = false;
-    bool begun = false;
-    qsizetype lastAlnum = 0;
-    for (int i = 0; i != title.size(); ++i) {
-        uint c = title.at(i).unicode();
+    bool dashAppended{false};
+    bool begun{false};
+    qsizetype lastAlnum{0};
+    bool has_non_alnum_content{false};
+
+    for (const auto &i : title) {
+        uint c = i.unicode();
+
+        if (!legal_ascii(c))
+            has_non_alnum_content = true;
         if (c >= 'A' && c <= 'Z')
             c += 'a' - 'A';
         bool alnum = (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9');
@@ -441,6 +475,15 @@ QString Doc::canonicalTitle(const QString &title)
         }
     }
     result.truncate(lastAlnum);
+
+    if (has_non_alnum_content) {
+        auto title_hash = QString::fromLocal8Bit(
+                QCryptographicHash::hash(title.toUtf8(), QCryptographicHash::Md5).toHex());
+        title_hash.truncate(8);
+        if (!result.isEmpty())
+            result.append(QLatin1Char('-'));
+        result.append(title_hash);
+    }
     return result;
 }
author	Paul Wicking <paul.wicking@qt.io>	2023-05-03 11:00:25 +0200
committer	Paul Wicking <paul.wicking@qt.io>	2023-05-13 22:01:10 +0200
commit	7057d01fbb9f8f37c707b33e3b92c10a78919ddc (patch)
tree	a919ab62d892885bde7d9049011b3b658b920dce /src
parent	941a9b5e5963f8c0798415e3cb69f031da1f4109 (diff)
download	qttools-7057d01fbb9f8f37c707b33e3b92c10a78919ddc.tar.gz