| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Rework the pdfwrite OCR code extensively in order to create a large
'strip' bitmap from a number of glyphs in a single call to the text
routine. The hope is that this will provide better context for
Tesseract and improved recognition.
Due to the way that text enumerators work, and the requirement to exit
to the PostScript interpreter in order to render glyph bitmaps, I've had
to abandon efforts to run as a fully 'on demand' system. We can't wait
until we find a glyph with no Unicode value and then try to render all
the glyphs up to that point (and all the following ones as well). It is
probably possible to do this but it would mean rewriting the text
processing code which is quite hideous enough as it is.
So now we render each glyph in the text string, and store them in a
linked list. When we're done with the text we free the memory. If we
find a glyph with no Unicode value then on the first pass we take the
list of glyphs, create a big bitmap from them and send it to Tesseract.
That should then return all the character codes, which we keep. On
subsequent missing Unicode values we consult the stored list.
We need to deal specially with space glyphs (which make no marks) as
Tesseract (clearly!) can't spot those.
Modify makefile (devs.mak) so that we have a preprocessor flag we can
use for conditional compilation. Currently OCR_VERSION is 0 for no OCR,
1 for Tesseract, there may be higher numbers in future.
Add a new function to the OCR interface to process and return multiple
glyphs at once from a bitmap. Don't delete the code for a single bitmap
because we may want to use that in future enhancements.
If we don't get the expected number of characters back from the OCR
engine then we currently abort the processing. Future enhancements;
fall back to using a single bitmap instead of a strip of text, if we get
*more* characters than expected, check for ligatures (fi, ffi etc).
Even if we've already seen a glyph, if we have not yet assigned it a
Unicode value then attempt to OCR it. So if we fail a character in one
place we may be able to recognise it in another. This requires new code
in gsfcmap.c to determine if we have a Unicode code point assigned.
Make all the relevant code, especially the params code, only compile
if OCR is enabled (Tesseract and Leptonica present and built).
Remove some debugging print code.
Add documentation
Remove vestiges of earlier OCR attempt
Trying to identify each glyph bitmap individually didn't work as well
and is replaced by the new 'all the characters in the text operation'
approach. There were a few vestiges of the old approach lying around
and one of them was causing problems when OCR was not enabled. Remove
all of that cruft here.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Actually making use of the legacy engine requires a different set
of eng.traineddata be used, and for the engine to be selected away
from LSTM.
Suitable traineddata can be found here, for instance (open the link,
and click the download button):
https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reset the stage of the state machine after processing a returned value
Set the unicode value used by the ToUnicode processing from the value
returned by OCR.
Much more complex than previously thought; process_text_return_width()
processes all the contents of the text in the enumerator on the first
pass, because its trying to decide if we can use a fast case (all
widths are default) or not.
This means that if we want to jump out an OCR a glyph, we need to track
which index in the string process_text_return_width was dealing with,
rather than the text enumerator index. Fortunately we are already
using a copy of the enumerator to run the glyph, so we simply need
to capture the index and set the copied enumerator index from it.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is the initial framework for pdfwrite to send a bitmap of a glyph
to an OCR engine in order to generate a Unicode code point for it.
This code must not be used as-is, in particular it prevents the function
gs_font_map_glyph_to_unicode from functioning properly in the absence
of OCR software, and the conenction between pdfwrite and the OCR engine
is not present.
We need to add either compile-time or run-time detection of an OCR
engine and only use on if present, as well as some control to decide
when to use OCR. We might always use OCR, or only when there is no
Glyph2Unicode dictionary available, or simply when all other fallbacks
fail.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Thanks to Peter Cherepanov for spotting this. This patch is slightly
different to his in that it makes -dNOINTERPOLATE=false enable interpolation
at the default level (the original patch and the old code would have
disabled interpolation even with 'false'. This version makes the NOINTEPOLATE
and DOINTERPOLATE options operate more symmetrically, however it is sort of
moot since both of these options are intended to be replaced by the better
control on image interpolation provided by -dInterpolateControl=#
Operation tested with fts_17_1712.pdf using the various command line options.
|
|
|
|
|
|
|
|
| |
This contributed device is odd how it changes its color model.
Unfortunately it does not change the ICC profile. This mismatch
between the ICC profile and the color information that is
being changed by the device causes all sorts of problems. This
should fix the issue.
|
|
|
|
|
| |
While we have 64bit configurations, these will only work for 32
bit builds at the moment, due to MSVC not supporting 64bit builds.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Bug #703105 "PDF file gives "Unable to determine object number..." and output is missing some images."
As per the bug thread; the PDF file has annotations which are deeply
nested forms. The final form stream draws an Image XObject but the
Form dictionary does not contain a /Resources dictionary so we are unable
to resolve the name.
The form which calls the final form *does* define the missing XObject,
this is pretty clearly illegal but Acrobat copes with it. In fact the
Ghostscript PDF interpreter has code to deal with it too, but there
was a bug in it, it pops an object that was never pushed, resulting in
the code being unable to find the resource.
Fixed very simple here. Also uploaded the simplified file for this bug
and the file for the original bug (700493) to the test repository.
|
| |
|
|
|
|
|
| |
We can get pointer reuse that can vary from run to run, so we
resort to just using null/non-null pointer hiding.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Bug #703017 "When print a file created with pdf2ps command with a PDF driver, the image shifts to the upper right as the number of pages increases."
The bug report here is, unfortunately, insufficient to duplicate and
resolve the problem. The missing information was supplied quite
independently by the user 'vyvup' on the #ghostscript IRC channel. See
the #ghostscript logs at around 08:45 on November 9th 2020.
https://ghostscript.com/irclogs/20201109.html
The important missing information is that the device must have the
Ghostscript-specific page device key /.HWMargins set. When this is set
and has non-zero values the offsets are applied cumulatively on every
page by the code in pdfread.ps. This is because we set up the page
'view' before we save the current setup in the ps2write dictionary
under the 'pagesave' key. When we restore back to that at the end of the
page it therefore does not remove the translation applied to account
for the /.HWMargins.
Here we just shift the save so that it occurs before we apply the page
size setup.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The example files of this bug do not have %%EndComments before other
BoundingBox comments which confuse the image placement logic resulting
in a blank page.
This patch, provided by Peter Cherepanov, fixes the problem by stopping
the DSC processing when (%%BeginProlog) (%%BeginSetup) or %%EndComments
is encountered.
Cluster regression passes, showing only 1 file that rotates differently
when -dEPSFitPage is used (just to insure that DSC processing is OK).
|
|
|
|
| |
This should enable the runtests values to be consistent.
|
|
|
|
|
|
| |
If we end up in the rectfill operation and we have transparency
make sure that we take the path that uses gs_fill to ensure that
pdf14_fill_path is executed, which will update the marking parameters.
|
| |
|
|
|
|
|
|
| |
If the source space is CIEBased PS type, then be
sure to use the equivalent ICC color space in
determining the overprint drawn components.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Movie annotations are not implemented. This fix just disables the attempt to
preserve them, so that the pdfwrite output will be valid.
This file has a /Movie annotation that references a local file that
isn't included in the PDF, so it will never play properly anyway.
The annotation tries to reference a stream in its /Poster entry (for
the image preview of the Movie), and this was not being handled
correctly.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The file from Bug 693300 has a blank image when going to DeviceN devices
such as psdcmyk and tiffsep. The CompatibleOverprint blend mode must be
set before the transparency group is pushed.
Add a .special_op for SupportsDevn to let the PostScript PDF interpreter
detect that the device supports separations.
Make sure the mark fill rect for the knockout case handles
the hybrid case of additive process colors with subtractive spots
and overprint.
And make sure the group that is pushed in gstrans.c for text knockouts
uses compatible overprint if needed.
Ray did the PS parts and Michael did the pdf14dev parts.
|
|
|
|
|
|
|
| |
Bug #702885 " ICC profile management can lead to a crash due to lack of reference counting of profiles" ICC profile management can lead to a crash due to lack of reference counting of profiles
See the bug report for details, this commit removes the dereference of
the ICC profile as recommended by Michael in that thread.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Thanks to Peter Cherepanov for this fix.
The page skipping was caused by not disabling the PageHabdler in
even or odd page selection modes due to a typo (Pagelist instead of
PageList). The PDF interpreter fed only the odd (or even pages),
and then the PageHandler also skipped every other page.
Also correct log messages and associated operand stack mess-up.
|
|
|
|
|
|
|
|
|
|
| |
The printer device needs to have the bg_print structure external to the
device so that when the device is freed the bg_print communication area
shared with the thread doesn't get freed. This is similar to the clist
band_range_list issue.
The SEGV was seen with 15-01.BIN as it unsubclasses the PCL_Mono_Palette
gs_pcl_mono_palette_device device.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The 'band_range_list' was a structure of two pointers within the device
(gx_device_clist_writer) so when it was copied, the 'ccl' pointer could
point to the band_range_list structure in the child device. This pointer
would no longer be valid when the child device was freed as the device
unsubclass did.
Detected with 15-01.BIN as it called gx_device_unsubclass for the PCL_mono
subclass device. With -Z@ the band_range_list would be overwritten with
(known, but invalid pointer) data resulting in the SEGV.
Cured by putting the band_range_list into the clist_writer 'data' area.
This area is not GC'ed and since it points into other memory in the
clist writer 'cbuf' area, it is internal to the clist writer.
|
|
|
|
|
| |
This cures data races with gs_next_ids seen with helgrind and BGPrint
and/or NumRenderingThreads.
|
|
|
|
|
|
| |
The 'heap_available' is documented as a 'snapshot', but is not a
thead risk. This change locks around the collection of the info to
be returned in the gs_memory_status structure.
|
|
|
|
|
|
|
|
|
| |
The unconditional call to enable multi-threaded rendering during set up of the
gx_device_printer as a clist caused the SEGV of bug 702181, but enabling
multi-threaded rendering here isn't correct since it is usually done when we
output the page (gdev_prn_output_page_aux). The problem is that BGPrint would
setup a new clist device to render the page, but needs to enable multi-threaded
rendering for the background clist device. Do this if NumRenderThreads > 0.
|
|
|
|
|
|
|
|
|
| |
The gpmisc.c does allocations for gp_file objects and buffers used by
gp_fprintf, as well as gp_validate_path_len. The helgrind run with
-dBGPrint -dNumRenderingThreads=4 and PCL input showed up the gp_fprintf
problem since the clist rendering would call gp_fprintf using the same
allocator (PCL's chunk allocator which is non_gc_memory). The chunk
allocator is intentionally not thread safe (for performance).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The issue is that the pdf14 device will close and reopen the
target device under certain cases and the X11 devices were not
cleaning themselves up sufficiently. Also added a finalize
method where the call to XCloseDisplay should actually be made.
The pdf14 device does this close and open dance to ensure that
the page_has_transparency parameter will be set in the device.
It is possible that we could end up in the pdf14 device
push without page_has_transparency if the call is done from
Postscript code. The PDF interpreter
always sets the page_has_transparency value before doing the
push so this should not be a problem with PDF.
Thanks to Chris Liddell for helping with this.
|
|
|
|
|
| |
This is technically required, but appears to be harmless to omit it.
However, since the fix is trivial, I have fixed it.
|
|
|
|
|
|
| |
The predefined headers for the ios build were missing the size_t updates.
We also don't want to try using CAL with ios (at least, for the moment).
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Bug #702698 "convert Grayscale error in output"
Images with a /Mask where the Mask is a range of values, and hence a
chroma-keyed image, as opposed to the Mask being an external Image
Mask, and therefore a stencil cannot be preserved as such when we are
colour converting.
There is no reliable way to be certain that the colour of the image
samples after conversion and the converted Mask values will relate in
the same way. Its entirely possible for multiple RGB values to map to
the same gray value for instance, and if that happens to be a masked
value then pixels will be masked which were not originally.
This commit checks the ColorConversionStrategy and if it is not
LeaveColorUnchanged then we further examine the Mask. If the Mask is a
range of values then we consider the colour space of the image.
If ColorConversionStrategy is not LeaveColorunchanged and the image has
a range of values and it is not in the target colour space, then we
cannot preserve it. In this case we fall back to either preserving the
image, and creating a clip path from the chroma key and the image
samples, or (if Version < 1.3) we fall right back to writing each image
sample as a filled rectangle.
This does, of course, result in larger output.
|
|
|
|
|
|
|
|
|
|
|
| |
Change it and the documentation. It may be that older Acrobat defaulted to
ignoring AcroForm, but current Adobe doesn't.
Also fix pdf_draw draw_form_field to match check in pdf_main process_trailer_attrs
for the file from Bug 692447 which has null entries in Fields array.
Fix indentation in pdf_main process_trailer_attrs area that processes
AcroForm dict.
|
|
|
|
|
|
|
|
|
| |
We need to defer the EPSFitPage scaling, centering and rotation until
after both the the %%BoundingBox and/or the %%HiResBoundingBox have been
processed. Doing one after the other results in slight rounding diffs
depending on the resolution. Change to save the value (prefering the
HiResBoundingBox) and do the scaling/translate/rotate only once when
%%EndComments is processed.
|
|
|
|
|
|
|
|
|
| |
Make the same adjustment in image_render_interpolate_landscape_icc() as
was done in image_render_interpolate_icc() by the commit a936cf76.
Thanks to Peter Cherepanov for this bug report and the patch. Regression
testing looks fine and the previous code looks like it would bump the
p_cm_interp value twice in some cases previously.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The global_glyph_code callback API relies on the name table being avaiable in
the gs_lib_ctx and accessible via the gs_memory_t object.
This is not something that is true, nor can we make it true, for pdfi. Because
pdfi is required to integrate with the Postscript interpreter, we cannot have
the gs_lib_ctx "top_of_system" pointer point to the pdfi context.
So, change the global_glyph_code API so it takes a gs_font pointer, instead of
a gs_memory_t pointer. The Postscript implementation will work exactly the same,
just accessing the gs_memory_t through the gs_font pointer.
But it allows pdfi to grab it's own context through the gs_font "client_data"
pointer.
|
|
|
|
|
|
|
| |
Also includes:
Work around a change in the zlib API for 1.2.11
where it's used in the Freetype/zlib interface debugging code.
|
|
|
|
|
|
|
| |
gx_device_text_begin may return an error if there is still
work to be done on a pattern device. Catch that before
the check on *ppte, which will not be a valid value in this
case.
|
|
|
|
| |
As (mostly) spotted by Lisa Fenn, fix comments in typos and messages.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The sparc architecture requires pointers within structures to
be at 64bit alignment, even if they are 32bit pointers.
LCMS2 allows for this by having a CMS_PTR_ALIGNMENT symbol
that can be predefined. If it's not predefined, it defaults to
sizeof(void *).
We update the source here so that when building for sparc, it
defaults to 8. This shouldn't affect gs, as it sets the value
via configure/make. I believe our lcms2 repo as used in MuPDF
is autogenerated from this though, and this will help us there.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
From 2.10.3, Freetype disappeared the FT_CALLBACK_DEF() macro, which is what
we used when defining our callbacks from Freetype.
No guidance forthcoming from the Freetype developer who made those changes,
so change to explicitly declaring the callbacks file static.
Should fix the reported build failures.
|
|
|
|
|
|
|
| |
Text extraction now works for Python2.pdf and zlib.3.pdf.
Added GlyphWidths[] and SpanDeltaX[] arrays, containing information required
for generating intermediate format used by extract system.
|
| |
|
|
|
|
|
|
|
| |
When checking for pdf14 being in an opaque state, we check to see
whether we are either constructing an SMask or are within an
SMask. We need to use a dev spec op for this, as we might be within
a clipper device, and so not actually be directly a pdf14 device.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While documenting the process for creating a ZUGFeRD invoice from a PDF
file and an XML invoice it became clear to me that it was beyond any
reasonable expectation of a user to be able to use it unaided. So this
program assists in the creation of a ZUGFeRD document.
The program requires two command line parameters; -sZUGFeRDProfile=
which contains a fully qualified path to an appropriate (correct colour
space) ICC profile, and -sZUGFeRDXMLFile= which contains a fully
qualified path to the XML invoice file.
Example command line is in the comments, and a usage message if the user
fails to provide either of the required elements. Obviously the user
must also set -dPDFA=3 and -sColorConversionStrategy in order to create
a valid PDF/A-3b file.
|