GhostPDF - early sanity check on Pages tree root node /Count

A number of OSS-fuzz files timeout due to having an excessively large /Count (eg 213804087) in the root node of the Pages tree. This doesn't cause us any real problems, except that it takes a long time to fail to render that many pages. On the other hand, validating the entire pages tree for a large file with many nodes could take a reasonable amount of time. Not huge but it would mean a performance hit on sensible files just to avoid this penalty on broken files. As a compromise; this commit checks the root node /Kids array, if it has sufficient entries to match the /Count then we assume its a flat tree and the Count is correct. If it has fewer then we assume it's a tree and we check each of the entries in the /Kids array. If its a leaf node then we add 1 to the running count. If it is an intermediate node then we add the /Count of the node to the running count. If the one-level check matches the root node /Count then we assume the Count is correct. If it does not, then we assume that the count of the first level is correct. We could, if desired, validate the entire tree instead at this point but I don't think it's worth it. If the tree is really broken then the file is going to fail. This commit prevents the timeout and if the corruption is limited to the Count of the root node then it will recover from that.
author: Ken Sharp <ken.sharp@artifex.com> 2022-09-08 10:11:15 +0100
committer: Chris Liddell <chris.liddell@artifex.com> 2022-09-08 14:11:06 +0100
commit: c315444579f3b54a890569d3ac3c5e938e227843 (patch)
tree: fe120df202dedfdb48c83cae1d4fd50fa0592348
parent: 582245bb6d96c06516004db021eda02f45917dbd (diff)
download: ghostpdl-c315444579f3b54a890569d3ac3c5e938e227843.tar.gz
2 files changed, 83 insertions, 2 deletions
diff --git a/pdf/pdf_doc.c b/pdf/pdf_doc.c
index 1658941a5..6e236456a 100644
--- a/pdf/pdf_doc.c
+++ b/pdf/pdf_doc.c
@@ -348,7 +348,8 @@ error:
 int pdfi_read_Pages(pdf_context *ctx)
 {
     pdf_obj *o, *o1;
-    int code;
+    pdf_array *a = NULL;
+    int code, pagecount = 0;
     double d;
 
     if (ctx->args.pdfdebug)
@@ -426,6 +427,86 @@ int pdfi_read_Pages(pdf_context *ctx)
         ctx->num_pages = (int)floor(d);
     }
 
+    /* A simple confidence check in the value of Count. We only do this because
+     * the OSS-fuzz tool keeps on coming up with files that time out because the
+     * initial Count is insanely huge, and we spend much time trying to find
+     * millions of pages which don't exist.
+     */
+    code = pdfi_dict_knownget_type(ctx, (pdf_dict *)o1, "Kids", PDF_ARRAY, (pdf_obj **)&a);
+    if (code == 0)
+        code = gs_note_error(gs_error_undefined);
+    if (code < 0) {
+        pdfi_countdown(o1);
+        return code;
+    }
+
+    /* Firstly check if the Kids array has enough nodes, in which case it's
+     * probably flat (the common case)
+     */
+    if (a->size != ctx->num_pages) {
+        int i = 0;
+        pdf_obj *p = NULL, *p1 = NULL;
+        pdf_num *c = NULL;
+
+        /* Either its not a flat tree, or the top node /Count is incorrect.
+         * Get each entry in the Kids array in turn and total the /Count of
+         * each node and add any leaf nodes.
+         */
+        for (i=0;i < a->size; i++) {
+            code = pdfi_array_get(ctx, a, i, &p);
+            if (code < 0)
+                continue;
+            if (pdfi_type_of(p) != PDF_DICT) {
+                pdfi_countdown(p);
+                p = NULL;
+                continue;
+            }
+            code = pdfi_dict_knownget_type(ctx, (pdf_dict *)p, "Type", PDF_NAME, (pdf_obj **)&p1);
+            if (code <= 0) {
+                pdfi_countdown(p);
+                p = NULL;
+                continue;
+            }
+            if (pdfi_name_is((pdf_name *)p1, "Page")) {
+                pagecount++;
+            } else {
+                if (pdfi_name_is((pdf_name *)p1, "Pages")) {
+                    code = pdfi_dict_knownget(ctx, (pdf_dict *)o1, "Count", (pdf_obj **)&c);
+                    if (code >= 0) {
+                        if (pdfi_type_of(c) == PDF_INT)
+                            pagecount += c->value.i;
+                        if (pdfi_type_of(c) == PDF_REAL)
+                            pagecount += (int)c->value.d;
+                        pdfi_countdown(c);
+                        c = NULL;
+                    }
+                }
+            }
+            pdfi_countdown(p1);
+            p1 = NULL;
+            pdfi_countdown(p);
+            p = NULL;
+        }
+    } else
+        pagecount = a->size;
+
+    pdfi_countdown(a);
+
+    /* If the count of the top level of the tree doesn't match the /Count
+     * of the root node then something is wrong. We could abort right now
+     * and will if this continues to be a problem, but initially let's assume
+     * the count of the top level is correct and the root node /Count is wrong.
+     * This will allow us to recover if only the root /Count gets corrupted.
+     * In future we could also try validating the entire tree at this point,
+     * though I suspect that's pointless; if the tree is corrupted we aren't
+     * likely to get much that's usable from it.
+     */
+    if (pagecount != ctx->num_pages) {
+        ctx->num_pages = pagecount;
+        code = pdfi_dict_put_int(ctx, (pdf_dict *)o1, "Count", ctx->num_pages);
+        pdfi_set_error(ctx, 0, NULL, E_PDF_BADPAGECOUNT, "pdfi_read_Pages", NULL);
+    }
+
     /* We don't pdfi_countdown(o1) now, because we've transferred our
      * reference to the pointer in the pdf_context structure.
      */
diff --git a/pdf/pdf_errors.h b/pdf/pdf_errors.h
index dada4d0b0..d62d6acb9 100644
--- a/pdf/pdf_errors.h
+++ b/pdf/pdf_errors.h
@@ -59,5 +59,5 @@ PARAM(E_BAD_HALFTONE,                  "Error setting a halftone"),
 PARAM(E_PDF_BADENCRYPT,                "Encrypt diciotnary not a dictionary"),
 PARAM(E_PDF_MISSINGTYPE,               "A dictionary is missing a required /Type key."),
 PARAM(E_PDF_NESTEDTOODEEP,             "Dictionaries/arrays nested too deeply"),
-
+PARAM(E_PDF_BADPAGECOUNT,              "page tree root node /Count did not match the actual number of pages in the tree."),
 #undef PARAM
author	Ken Sharp <ken.sharp@artifex.com>	2022-09-08 10:11:15 +0100
committer	Chris Liddell <chris.liddell@artifex.com>	2022-09-08 14:11:06 +0100
commit	c315444579f3b54a890569d3ac3c5e938e227843 (patch)
tree	fe120df202dedfdb48c83cae1d4fd50fa0592348
parent	582245bb6d96c06516004db021eda02f45917dbd (diff)
download	ghostpdl-c315444579f3b54a890569d3ac3c5e938e227843.tar.gz