pdfwrite - ToUnicode revamp

Bug 695461 " Why does the search results changes after pdf optimzation in ghostscript?" The existing ToUnicode functions were written against the original ToUnicode specification, and assume that there will be no more than 2 unsigned shorts returned as the Unicode code point for any given glyph. Sadly Adobe have revised the ToUnicode CMap making it the same as a regular CMap, and then extending it still further. It is now possible for a single glyph to map to a string of up to 512 bytes. This commit revises the existing C 'decode_glyph's so that instead of returning a gs_char for the Unicode code point, we return a string of bytes. If the caller initially says that the string it is passing is zero bytes, then we do not copy the bytes, we just return the required size of the string (in bytes). A return value of 0 from a decode_glyph function means that the glyph was not in the map and so could not be 'decoded'. As a consequence of this change, and to further permit more than 2 unsigned shorts for ToUnicode CMaps, the CMap lookup enumerator now needs to be able to allocate memory, so the 'next_lookup' methods all now take a gs_memory_t pointer to make ths possible. The ToUnicode cmap table also has to change. Formerly it was a simple 4 bytes per code, either 255 or 65535 codes array. For simplicity I've chosen to keep it as a large continuous array, but each entry is now the number of bytes required to store the longest defined Unicode value for that font, plus 2 bytes. The 2 bytes give the length of the reserved space actually used by each Unicode code point. The bytes are stored immediately following the length (so a 2 byte length Pascal string if you like). This may possibly cause ToUnicode maps to use a lot of memory, in the short term we'll live with it because these only exist with pdfwrite, and that only really is expected to run on decent sized platforms. May need to do something better in future.
author: Ken Sharp <ken.sharp@artifex.com> 2016-05-25 16:51:20 +0100
committer: Ken Sharp <ken.sharp@artifex.com> 2016-05-26 09:23:34 +0100
commit: 9dba57f0f9a53c130ec2771c0ed1d7bd6bbef6ab (patch)
tree: 76ed1bc0ba4a053f93bfaae981d8883d130c036a /base/gxfcmap.h
parent: a60087bafbaabf7052e65eb7f08548ae596d91d4 (diff)
download: ghostpdl-9dba57f0f9a53c130ec2771c0ed1d7bd6bbef6ab.tar.gz
1 files changed, 10 insertions, 2 deletions
diff --git a/base/gxfcmap.h b/base/gxfcmap.h
index 890b0b809..0e0cff13e 100644
--- a/base/gxfcmap.h
+++ b/base/gxfcmap.h
@@ -197,6 +197,14 @@ struct gs_cmap_s {
     GS_CMAP_COMMON;
 };
 
+typedef struct gs_cmap_ToUnicode_s {
+    GS_CMAP_COMMON;
+    int num_codes;
+    int key_size;
+    int value_size;
+    bool is_identity;
+} gs_cmap_ToUnicode_t;
+
 /* ---------------- Enumerators ---------------- */
 
 /*
@@ -222,7 +230,7 @@ struct gs_cmap_ranges_enum_s {
 };
 
 typedef struct gs_cmap_lookups_enum_procs_s {
-    int (*next_lookup)(gs_cmap_lookups_enum_t *penum);
+    int (*next_lookup)(gs_memory_t *mem, gs_cmap_lookups_enum_t *penum);
     int (*next_entry)(gs_cmap_lookups_enum_t *penum);
 } gs_cmap_lookups_enum_procs_t;
 struct gs_cmap_lookups_enum_s {
@@ -286,7 +294,7 @@ int gs_cmap_enum_next_range(gs_cmap_ranges_enum_t *penum);
  */
 void gs_cmap_lookups_enum_init(const gs_cmap_t *pcmap, int which,
                                gs_cmap_lookups_enum_t *penum);
-int gs_cmap_enum_next_lookup(gs_cmap_lookups_enum_t *penum);
+int gs_cmap_enum_next_lookup(gs_memory_t *mem, gs_cmap_lookups_enum_t *penum);
 int gs_cmap_enum_next_entry(gs_cmap_lookups_enum_t *penum);
 
 /* ---------------- Implementation procedures ---------------- */
author	Ken Sharp <ken.sharp@artifex.com>	2016-05-25 16:51:20 +0100
committer	Ken Sharp <ken.sharp@artifex.com>	2016-05-26 09:23:34 +0100
commit	9dba57f0f9a53c130ec2771c0ed1d7bd6bbef6ab (patch)
tree	76ed1bc0ba4a053f93bfaae981d8883d130c036a /base/gxfcmap.h
parent	a60087bafbaabf7052e65eb7f08548ae596d91d4 (diff)
download	ghostpdl-9dba57f0f9a53c130ec2771c0ed1d7bd6bbef6ab.tar.gz