summaryrefslogtreecommitdiff
path: root/python/rpmsystem-py.h
diff options
context:
space:
mode:
authorPanu Matilainen <pmatilai@redhat.com>2019-02-22 19:44:16 +0200
committerPanu Matilainen <pmatilai@redhat.com>2019-02-22 20:37:20 +0200
commit84920f898315d09a57a3f1067433eaeb7de5e830 (patch)
tree887a7e6c406a1730a743d770c19919fe578a1b6a /python/rpmsystem-py.h
parentba85c95963f9b62f237c0442f6b5aca3e355fa83 (diff)
downloadrpm-84920f898315d09a57a3f1067433eaeb7de5e830.tar.gz
In Python 3, return all our string data as surrogate-escaped utf-8 strings
In the almost ten years of rpm sort of supporting Python 3 bindings, quite obviously nobody has actually tried to use them. There's a major mismatch between what the header API outputs (bytes) and what all the other APIs accept (strings), resulting in hysterical TypeErrors all over the place, including but not limited to labelCompare() (RhBug:1631292). Also a huge number of other places have been returning strings and silently assuming utf-8 through use of Py_BuildValue("s", ...), which will just irrevocably fail when non-utf8 data is encountered. The politically Python 3-correct solution would be declaring all our data as bytes with unspecified encoding - that's exactly what it historically is. However doing so would by definition break every single rpm script people have developed on Python 2. And when 99% of the rpm content in the world actually is utf-8 encoded even if it doesn't say so (and in recent times packages even advertise themselves as utf-8 encoded), the bytes-only route seems a wee bit too draconian, even to this grumpy old fella. Instead, route all our string returns through a single helper macro which on Python 2 just does what we always did, but in Python 3 converts the data to surrogate-escaped utf-8 strings. This makes stuff "just work" out of the box pretty much everywhere even with Python 3 (including our own test-suite!), while still allowing to handle the non-utf8 case. Handling the non-utf8 case is a bit more uglier but still possible, which is exactly how you want corner-cases to be. There might be some uses for retrieving raw byte data from the header, but worrying about such an API is a case for some other rainy day, for now we mostly only care that stuff works again. Also add test-cases for mixed data source labelCompare() and non-utf8 insert to + retrieve from header.
Diffstat (limited to 'python/rpmsystem-py.h')
-rw-r--r--python/rpmsystem-py.h7
1 files changed, 7 insertions, 0 deletions
diff --git a/python/rpmsystem-py.h b/python/rpmsystem-py.h
index 955d60cd3..87c750571 100644
--- a/python/rpmsystem-py.h
+++ b/python/rpmsystem-py.h
@@ -19,4 +19,11 @@
#define PyInt_AsSsize_t PyLong_AsSsize_t
#endif
+/* In Python 3, we return all strings as surrogate-escaped utf-8 */
+#if PY_MAJOR_VERSION >= 3
+#define utf8FromString(_s) PyUnicode_DecodeUTF8(_s, strlen(_s), "surrogateescape")
+#else
+#define utf8FromString(_s) PyBytes_FromString(_s)
+#endif
+
#endif /* H_SYSTEM_PYTHON */