GhostPDF - revamp PDF information extraction

A customer requested that we make pdf_info.ps work with the new PDF interpreter, and generate the same information. This commit modifies the way we extract information on a page-by-page basis to potentially include the names of spot inks and information about fonts used on the page. This is now returned to the PostScript environment using a PDF dictionary instead of a C structure. The pdf_info.ps program has been updated so that it use the new information in broadly the same way as the information from the old PDF interpreter. There are differences; pdf_info.ps extracts font information itself, rather than having the interpreter do it. This is not possible with the new interpreter which is why we have the PDF interpreter do it for us. In addition the pdf_info.ps program only descended to the page level whereas the new PDF interpreter evaluates all objects on the page, potentially meaning that more fonts (and technically spot inks) might be detected. We now have an additional PostScript operator '.PDFPageInfoExt' which returns 'extended' information about a page. This is the same as .PDFPageInfo but includes the font and spot ink information. Running with -dPDFINFO using either Ghostscript or GhostPDF will print more information than before, including the spot inks and considerably more information about fonts than the pdf_info.ps program emits, including embedding status, descendant fonts (and their membedding status) and the presence of ToUnicode CMaps. Updated documentation for all of the above.
author: Ken Sharp <ken.sharp@artifex.com> 2022-05-08 15:13:16 +0100
committer: Ken Sharp <ken.sharp@artifex.com> 2022-05-10 11:27:36 +0100
commit: 398bfc844bde6e2b2a4f6552ce326ad619471316 (patch)
tree: 88a4a82e5f1b00ac5654186c16d7c0afe1058d3f /doc
parent: bdc105a686f0c8fa1e29312302091685d27a9464 (diff)
download: ghostpdl-398bfc844bde6e2b2a4f6552ce326ad619471316.tar.gz
2 files changed, 39 insertions, 4 deletions
diff --git a/doc/Language.htm b/doc/Language.htm
index 6ed12d3a3..44c569d96 100644
--- a/doc/Language.htm
+++ b/doc/Language.htm
@@ -2219,14 +2219,13 @@ This function needs to write any required output intents, load and send Outlines
 and Keywords from the Info dict to the output device, copy Optional Content Properties (OCProperties) to the output device.
 If an AcroForm is present send all its fields and link widget annotations to fields, and finally copy the PageLabels. If we add support for anything else, it will be here too..
 </dd><dt><code>PDFcontext int .PDFPageInfo -</code></dt>
-<dd>     The integer argument is the page number to retrieve information for.
+<dd>     The integer argument is the page number to retrieve information for. This value starts from zero for the first page.
 Returns a dictionary with the following key/value pairs:
 <blockquote>
     <code>/UsesTransparency</code> true|false<br>
-	<code>/SpotColours</code> array of names, may be empty|<br>
+	<code>/NumSpots</code> integer containing the number of spot inks on this page<br>
 	<code>/MediaBox</code> [llx lly urx ury]<br>
 	<code>/HasAnnots</code> true|false<br>
-	<code>/FontsUsed</code> array of names, may be empty.<br>
 </blockquote>
 May also contain (if they are present in the Page dictionary)
 <blockquote>
@@ -2235,6 +2234,23 @@ May also contain (if they are present in the Page dictionary)
 	<code>/BleedBox</code> [llx lly urx ury]<br>
 	<code>/TrimBox</code> [llx lly urx ury]<br>
 	<code>/UserUnit</code> int<br>
+	<code>/Rotate</code> number<br>
+</blockquote>
+</dd>
+</dd><dt><code>PDFcontext int .PDFPageInfoExt -</code></dt>
+<dd>     As per .PDFPageInfo above but returns 'Extended' information. This consists of two additional arrays in the returned dictionary:
+<blockquote>
+	<code>/Spots</code> array of names, may be empty<br>
+	<code>/Fonts</code> array of dictionaries, one dictionary per font used on the page.
+</blockquote>
+Each font dictionary contains
+<blockquote>
+    <code>/BaseFont</code> string containing the name of the font.<br>
+    <code>/Subtype</code> string containing the type of the font, as per the PDF Reference.<br>
+    <code>/ObjectNum</code> If present, the object number of the font in the file (fonts may be defined inline and have no object number).<br>
+    <code>/Embedded</code> boolean indicating if the font's FontDescriptor includes a FontFile and is therefore embedded.<br>
+    Type 0 fonts also contain <br>
+    <code>/Descendants</code> An array containing a single font dictionary, contents as above.<br>
 </blockquote>
 </dd>
 <dt><code>PDFcontext int .PDFDrawPage -</code></dt>
diff --git a/doc/Use.htm b/doc/Use.htm
index 263bc5cef..33e4f679b 100644
--- a/doc/Use.htm
+++ b/doc/Use.htm
@@ -630,7 +630,26 @@ when true.</p>
 </dd>
 </dl>
 
- <dl>
+ <d1>
+<dt><code>-dPDFINFO</code></dt>
+<dd>Starting with release 9.56.0 this new switch will work with the PDF interpreter (GhostPDF) and with
+the PDF interpreter integrated into Ghostscript. When this switch is set the interpreter will emit
+information regarding the file, similar to that produced by the old pdf_info.ps program in the 'lib'
+folder.
+<p>
+The format is not entirely the same, and the search for fonts and spot colours is 'deeper' than the
+old program; pdf_info.ps stops at the page level whereas the PDFINFO switch will descend into objects
+such as Forms, Images, type 3 fonts and Patterns. In addition different instances of fonts with the
+same name are now enumerated.
+</p>
+<p>
+Unlike the pdf_info.ps program there is no need to add the input file to the list of permitted files
+for reading (using --permit-file-read).
+</p>
+</dd>
+</d1>
+
+<dl>
 <dt><code>-dPDFFitPage</code></dt>
 <dd>Rather than selecting a PageSize given by the PDF MediaBox, BleedBox (see -dUseBleedBox),
 TrimBox (see -dUseTrimBox), ArtBox (see -dUseArtBox), or CropBox (see -dUseCropBox),
author	Ken Sharp <ken.sharp@artifex.com>	2022-05-08 15:13:16 +0100
committer	Ken Sharp <ken.sharp@artifex.com>	2022-05-10 11:27:36 +0100
commit	398bfc844bde6e2b2a4f6552ce326ad619471316 (patch)
tree	88a4a82e5f1b00ac5654186c16d7c0afe1058d3f /doc
parent	bdc105a686f0c8fa1e29312302091685d27a9464 (diff)
download	ghostpdl-398bfc844bde6e2b2a4f6552ce326ad619471316.tar.gz