summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Improve use of pdfwrite with OCRpdfwrite_ocrKen Sharp2020-11-1213-94/+698
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rework the pdfwrite OCR code extensively in order to create a large 'strip' bitmap from a number of glyphs in a single call to the text routine. The hope is that this will provide better context for Tesseract and improved recognition. Due to the way that text enumerators work, and the requirement to exit to the PostScript interpreter in order to render glyph bitmaps, I've had to abandon efforts to run as a fully 'on demand' system. We can't wait until we find a glyph with no Unicode value and then try to render all the glyphs up to that point (and all the following ones as well). It is probably possible to do this but it would mean rewriting the text processing code which is quite hideous enough as it is. So now we render each glyph in the text string, and store them in a linked list. When we're done with the text we free the memory. If we find a glyph with no Unicode value then on the first pass we take the list of glyphs, create a big bitmap from them and send it to Tesseract. That should then return all the character codes, which we keep. On subsequent missing Unicode values we consult the stored list. We need to deal specially with space glyphs (which make no marks) as Tesseract (clearly!) can't spot those. Modify makefile (devs.mak) so that we have a preprocessor flag we can use for conditional compilation. Currently OCR_VERSION is 0 for no OCR, 1 for Tesseract, there may be higher numbers in future. Add a new function to the OCR interface to process and return multiple glyphs at once from a bitmap. Don't delete the code for a single bitmap because we may want to use that in future enhancements. If we don't get the expected number of characters back from the OCR engine then we currently abort the processing. Future enhancements; fall back to using a single bitmap instead of a strip of text, if we get *more* characters than expected, check for ligatures (fi, ffi etc). Even if we've already seen a glyph, if we have not yet assigned it a Unicode value then attempt to OCR it. So if we fail a character in one place we may be able to recognise it in another. This requires new code in gsfcmap.c to determine if we have a Unicode code point assigned. Make all the relevant code, especially the params code, only compile if OCR is enabled (Tesseract and Leptonica present and built). Remove some debugging print code. Add documentation Remove vestiges of earlier OCR attempt Trying to identify each glyph bitmap individually didn't work as well and is replaced by the new 'all the characters in the text operation' approach. There were a few vestiges of the old approach lying around and one of them was causing problems when OCR was not enabled. Remove all of that cruft here.
* Add gs_param_list_dump() debug function.Robin Watts2020-11-123-0/+34
|
* Add OCRLanguage and OCREngine parameters to pdfwrite.Robin Watts2020-11-124-1/+71
|
* Tweak Tesseract build to include legacy engine.Robin Watts2020-11-121-3/+4
| | | | | | | | | | | Actually making use of the legacy engine requires a different set of eng.traineddata be used, and for the engine to be selected away from LSTM. Suitable traineddata can be found here, for instance (open the link, and click the download button): https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
* More work on pdfwrite + OCRKen Sharp2020-11-122-3/+16
| | | | | | | | | | | | | | | | | | Reset the stage of the state machine after processing a returned value Set the unicode value used by the ToUnicode processing from the value returned by OCR. Much more complex than previously thought; process_text_return_width() processes all the contents of the text in the enumerator on the first pass, because its trying to decide if we can use a fast case (all widths are default) or not. This means that if we want to jump out an OCR a glyph, we need to track which index in the string process_text_return_width was dealing with, rather than the text enumerator index. Fortunately we are already using a copy of the enumerator to run the glyph, so we simply need to capture the index and set the copied enumerator index from it.
* Hook Tesseract up to pdfwrite.Robin Watts2020-11-123-4/+147
|
* Interim commit for pdfwrite+OCRKen Sharp2020-11-125-27/+70
| | | | | | | | | | | | | | | | This is the initial framework for pdfwrite to send a bitmap of a glyph to an OCR engine in order to generate a Unicode code point for it. This code must not be used as-is, in particular it prevents the function gs_font_map_glyph_to_unicode from functioning properly in the absence of OCR software, and the conenction between pdfwrite and the OCR engine is not present. We need to add either compile-time or run-time detection of an OCR engine and only use on if present, as well as some control to decide when to use OCR. We might always use OCR, or only when there is no Glyph2Unicode dictionary available, or simply when all other fallbacks fail.
* Fix bug 703125: -dNOINTERPOLATE leaves 'true' boolean on the stack.Ray Johnston2020-11-111-2/+2
| | | | | | | | | | | | Thanks to Peter Cherepanov for spotting this. This patch is slightly different to his in that it makes -dNOINTERPOLATE=false enable interpolation at the default level (the original patch and the old code would have disabled interpolation even with 'false'. This version makes the NOINTEPOLATE and DOINTERPOLATE options operate more symmetrically, however it is sort of moot since both of these options are intended to be replaced by the better control on image interpolation provided by -dInterpolateControl=# Operation tested with fts_17_1712.pdf using the various command line options.
* Bug 701804: Fix for device that causes buffer overflowsMichael Vrhel2020-11-111-34/+45
| | | | | | | | This contributed device is odd how it changes its color model. Unfortunately it does not change the ICC profile. This mismatch between the ICC profile and the color information that is being changed by the device causes all sorts of problems. This should fix the issue.
* MSVC: Add sanitize configurations/targets.Robin Watts2020-11-118-15/+326
| | | | | While we have 64bit configurations, these will only work for 32 bit builds at the moment, due to MSVC not supporting 64bit builds.
* PDF interpreter - fix searching for missing Resources in parent objectKen Sharp2020-11-101-1/+0
| | | | | | | | | | | | | | | | | | Bug #703105 "PDF file gives "Unable to determine object number..." and output is missing some images." As per the bug thread; the PDF file has annotations which are deeply nested forms. The final form stream draws an Image XObject but the Form dictionary does not contain a /Resources dictionary so we are unable to resolve the name. The form which calls the final form *does* define the missing XObject, this is pretty clearly illegal but Acrobat copes with it. In fact the Ghostscript PDF interpreter has code to deal with it too, but there was a bug in it, it pops an object that was never pushed, resulting in the code being unable to find the resource. Fixed very simple here. Also uploaded the simplified file for this bug and the file for the original bug (700493) to the test repository.
* toolbin/localcluster/clusterpush.pl: exclude extract's src/build.Julian Smith2020-11-091-0/+4
|
* api_test: Simplify pointer hiding.Robin Watts2020-11-091-17/+1
| | | | | We can get pointer reuse that can vary from run to run, so we resort to just using null/non-null pointer hiding.
* ps2write - fix use of /.HWMargins with ps2write outputKen Sharp2020-11-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | Bug #703017 "When print a file created with pdf2ps command with a PDF driver, the image shifts to the upper right as the number of pages increases." The bug report here is, unfortunately, insufficient to duplicate and resolve the problem. The missing information was supplied quite independently by the user 'vyvup' on the #ghostscript IRC channel. See the #ghostscript logs at around 08:45 on November 9th 2020. https://ghostscript.com/irclogs/20201109.html The important missing information is that the device must have the Ghostscript-specific page device key /.HWMargins set. When this is set and has non-zero values the offsets are applied cumulatively on every page by the code in pdfread.ps. This is because we set up the page 'view' before we save the current setup in the ps2write dictionary under the 'pagesave' key. When we restore back to that at the end of the page it therefore does not remove the translation applied to account for the /.HWMargins. Here we just shift the save so that it occurs before we apply the page size setup.
* Fix bug 688166. EPS DSC comment processing not terminating properly.Ray Johnston2020-11-071-1/+21
| | | | | | | | | | | | | The example files of this bug do not have %%EndComments before other BoundingBox comments which confuse the image placement logic resulting in a blank page. This patch, provided by Peter Cherepanov, fixes the problem by stopping the DSC processing when (%%BeginProlog) (%%BeginSetup) or %%EndComments is encountered. Cluster regression passes, showing only 1 file that rotates differently when -dEPSFitPage is used (just to insure that DSC processing is OK).
* apitest: Hide pointer values in output.Robin Watts2020-11-071-4/+40
| | | | This should enable the runtests values to be consistent.
* Bug 702005 : rectfill and transparencyMichael Vrhel2020-11-061-1/+6
| | | | | | If we end up in the rectfill operation and we have transparency make sure that we take the path that uses gs_fill to ensure that pdf14_fill_path is executed, which will update the marking parameters.
* jbig2dec: Add casts to silence a compiler warning.Sebastian Rasmussen2020-11-071-2/+2
|
* Bug 703087: CIEBased color space with overprintMichael Vrhel2020-11-051-1/+4
| | | | | | If the source space is CIEBased PS type, then be sure to use the equivalent ICC color space in determining the overprint drawn components.
* Bug 703086 -- Disable trying to preserve Movie annotationsNancy Durgin2020-11-051-1/+2
| | | | | | | | | | | | Movie annotations are not implemented. This fix just disables the attempt to preserve them, so that the pdfwrite output will be valid. This file has a /Movie annotation that references a local file that isn't included in the PDF, so it will never play properly anyway. The annotation tries to reference a stream in its /Poster entry (for the image preview of the Movie), and this was not being handled correctly.
* Fix Bug 702034. Missing image to DeviceN devices.Ray Johnston2020-11-045-14/+73
| | | | | | | | | | | | | | | | | | The file from Bug 693300 has a blank image when going to DeviceN devices such as psdcmyk and tiffsep. The CompatibleOverprint blend mode must be set before the transparency group is pushed. Add a .special_op for SupportsDevn to let the PostScript PDF interpreter detect that the device supports separations. Make sure the mark fill rect for the knockout case handles the hybrid case of additive process colors with subtractive spots and overprint. And make sure the group that is pushed in gstrans.c for text knockouts uses compatible overprint if needed. Ray did the PS parts and Michael did the pdf14dev parts.
* pdfwrite - Fix potential seg faults with ColorConversionStrategyKen Sharp2020-11-041-12/+0
| | | | | | | Bug #702885 " ICC profile management can lead to a crash due to lack of reference counting of profiles" ICC profile management can lead to a crash due to lack of reference counting of profiles See the bug report for details, this commit removes the dereference of the ICC profile as recommended by Michael in that thread.
* Update clusterpush.pl for extract jobs.Robin Watts2020-11-031-4/+16
|
* Fix bug 702957, 702971. PageList problems with PDF input files.Ray Johnston2020-11-031-6/+4
| | | | | | | | | | | Thanks to Peter Cherepanov for this fix. The page skipping was caused by not disabling the PageHabdler in even or odd page selection modes due to a typo (Pagelist instead of PageList). The PDF interpreter fed only the odd (or even pages), and then the PageHandler also skipped every other page. Also correct log messages and associated operand stack mess-up.
* Fix another SEGV with BGPrint due to device_unsubclassRay Johnston2020-11-024-63/+80
| | | | | | | | | | The printer device needs to have the bg_print structure external to the device so that when the device is freed the bg_print communication area shared with the thread doesn't get freed. This is similar to the clist band_range_list issue. The SEGV was seen with 15-01.BIN as it unsubclasses the PCL_Mono_Palette gs_pcl_mono_palette_device device.
* Set orig_spec_op for printer class devices so it forwards properly.Ray Johnston2020-10-311-1/+1
|
* FIx SEGV with gx_device_unsubclass when child is a clist device.Ray Johnston2020-10-314-12/+13
| | | | | | | | | | | | | | | | The 'band_range_list' was a structure of two pointers within the device (gx_device_clist_writer) so when it was copied, the 'ccl' pointer could point to the band_range_list structure in the child device. This pointer would no longer be valid when the child device was freed as the device unsubclass did. Detected with 15-01.BIN as it called gx_device_unsubclass for the PCL_mono subclass device. With -Z@ the band_range_list would be overwritten with (known, but invalid pointer) data resulting in the SEGV. Cured by putting the band_range_list into the clist_writer 'data' area. This area is not GC'ed and since it points into other memory in the clist writer 'cbuf' area, it is internal to the clist writer.
* Make gs_next_ids thread safe by using the core->monitor.Ray Johnston2020-10-313-2/+24
| | | | | This cures data races with gs_next_ids seen with helgrind and BGPrint and/or NumRenderingThreads.
* Pacify helgrind so that gs_heap_status is thread safe.Ray Johnston2020-10-311-2/+8
| | | | | | The 'heap_available' is documented as a 'snapshot', but is not a thead risk. This change locks around the collection of the info to be returned in the gs_memory_status structure.
* Fix problem with BGPrint and multi-threaded rendering caused by commit cca27988Ray Johnston2020-10-311-7/+6
| | | | | | | | | The unconditional call to enable multi-threaded rendering during set up of the gx_device_printer as a clist caused the SEGV of bug 702181, but enabling multi-threaded rendering here isn't correct since it is usually done when we output the page (gdev_prn_output_page_aux). The problem is that BGPrint would setup a new clist device to render the page, but needs to enable multi-threaded rendering for the background clist device. Do this if NumRenderThreads > 0.
* Fix gp_file allocations to use thread_safe_memory.Ray Johnston2020-10-311-4/+4
| | | | | | | | | The gpmisc.c does allocations for gp_file objects and buffers used by gp_fprintf, as well as gp_validate_path_len. The helgrind run with -dBGPrint -dNumRenderingThreads=4 and PCL input showed up the gp_fprintf problem since the clist rendering would call gp_fprintf using the same allocator (PCL's chunk allocator which is non_gc_memory). The chunk allocator is intentionally not thread safe (for performance).
* Bug 702671: Make sure X11 device cleans up with closureMichael Vrhel2020-10-313-12/+37
| | | | | | | | | | | | | | | | | The issue is that the pdf14 device will close and reopen the target device under certain cases and the X11 devices were not cleaning themselves up sufficiently. Also added a finalize method where the call to XCloseDisplay should actually be made. The pdf14 device does this close and open dance to ensure that the page_has_transparency parameter will be set in the device. It is possible that we could end up in the pdf14 device push without page_has_transparency if the call is done from Postscript code. The PDF interpreter always sets the page_has_transparency value before doing the push so this should not be a problem with PDF. Thanks to Chris Liddell for helping with this.
* Add /Type /Outlines to Outlines entry for pdfwriteNancy Durgin2020-10-291-1/+1
| | | | | This is technically required, but appears to be harmless to omit it. However, since the fix is trivial, I have fixed it.
* Fix ios build script and headersChris Liddell2020-10-293-6/+26
| | | | | | The predefined headers for the ios build were missing the size_t updates. We also don't want to try using CAL with ios (at least, for the moment).
* Fix an option typo: "nonredundnat" -> "nonredundant"Chris Liddell2020-10-281-1/+1
|
* pdfwrite - fix Type 4 Chroma-keyed images with Colour ConversionKen Sharp2020-10-281-1/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Bug #702698 "convert Grayscale error in output" Images with a /Mask where the Mask is a range of values, and hence a chroma-keyed image, as opposed to the Mask being an external Image Mask, and therefore a stencil cannot be preserved as such when we are colour converting. There is no reliable way to be certain that the colour of the image samples after conversion and the converted Mask values will relate in the same way. Its entirely possible for multiple RGB values to map to the same gray value for instance, and if that happens to be a masked value then pixels will be masked which were not originally. This commit checks the ColorConversionStrategy and if it is not LeaveColorUnchanged then we further examine the Mask. If the Mask is a range of values then we consider the colour space of the image. If ColorConversionStrategy is not LeaveColorunchanged and the image has a range of values and it is not in the target colour space, then we cannot preserve it. In this case we fall back to either preserving the image, and creating a clip path from the chroma key and the image samples, or (if Version < 1.3) we fall right back to writing each image sample as a filled rectangle. This does, of course, result in larger output.
* Change default ShowAcroForm to true to match Adobe Acrobat.Ray Johnston2020-10-233-38/+41
| | | | | | | | | | | Change it and the documentation. It may be that older Acrobat defaulted to ignoring AcroForm, but current Adobe doesn't. Also fix pdf_draw draw_form_field to match check in pdf_main process_trailer_attrs for the file from Bug 692447 which has null entries in Fields array. Fix indentation in pdf_main process_trailer_attrs area that processes AcroForm dict.
* Fix Bug702995: Inconsistent auto-rotation of EPS with EPSFitPage.Ray Johnston2020-10-231-8/+44
| | | | | | | | | We need to defer the EPSFitPage scaling, centering and rotation until after both the the %%BoundingBox and/or the %%HiResBoundingBox have been processed. Doing one after the other results in slight rounding diffs depending on the resolution. Change to save the value (prefering the HiResBoundingBox) and do the scaling/translate/rotate only once when %%EndComments is processed.
* Fix bug 702951. Valgrind error in image_render_interpolate_landscape_iccRay Johnston2020-10-231-1/+1
| | | | | | | | | Make the same adjustment in image_render_interpolate_landscape_icc() as was done in image_render_interpolate_icc() by the commit a936cf76. Thanks to Peter Cherepanov for this bug report and the patch. Regression testing looks fine and the previous code looks like it would bump the p_cm_interp value twice in some cases previously.
* Revise font dir global_glyph_code callback APIChris Liddell2020-10-233-5/+5
| | | | | | | | | | | | | | | | The global_glyph_code callback API relies on the name table being avaiable in the gs_lib_ctx and accessible via the gs_memory_t object. This is not something that is true, nor can we make it true, for pdfi. Because pdfi is required to integrate with the Postscript interpreter, we cannot have the gs_lib_ctx "top_of_system" pointer point to the pdfi context. So, change the global_glyph_code API so it takes a gs_font pointer, instead of a gs_memory_t pointer. The Postscript implementation will work exactly the same, just accessing the gs_memory_t through the gs_font pointer. But it allows pdfi to grab it's own context through the gs_font "client_data" pointer.
* Update freetype to 2.10.4Chris Liddell2020-10-23119-0/+89590
| | | | | | | Also includes: Work around a change in the zlib API for 1.2.11 where it's used in the Freetype/zlib interface debugging code.
* Add error check on gx_device_text_begin return_errorMichael Vrhel2020-10-221-1/+1
| | | | | | | gx_device_text_begin may return an error if there is still work to be done on a pattern device. Catch that before the check on *ppte, which will not be a valid value in this case.
* Fix typos in comments of zugferd programKen Sharp2020-10-221-11/+11
| | | | As (mostly) spotted by Lisa Fenn, fix comments in typos and messages.
* lcms2: automatically align blocks appropriately on sparc.Robin Watts2020-10-211-1/+5
| | | | | | | | | | | | | | The sparc architecture requires pointers within structures to be at 64bit alignment, even if they are 32bit pointers. LCMS2 allows for this by having a CMS_PTR_ALIGNMENT symbol that can be predefined. If it's not predefined, it defaults to sizeof(void *). We update the source here so that when building for sparc, it defaults to 8. This shouldn't affect gs, as it sets the value via configure/make. I believe our lcms2 repo as used in MuPDF is autogenerated from this though, and this will help us there.
* Correct OCR docs for multiple languages.Robin Watts2020-10-201-1/+1
|
* Bug 702985: drop use of FT_CALLBACK_DEF() defChris Liddell2020-10-201-3/+3
| | | | | | | | | | From 2.10.3, Freetype disappeared the FT_CALLBACK_DEF() macro, which is what we used when defining our callbacks from Freetype. No guidance forthcoming from the Freetype developer who made those changes, so change to explicitly declaring the callbacks file static. Should fix the reported build failures.
* devices/vector/gdevtxtw.c: updated extract output to match mupdf.Julian Smith2020-10-191-18/+86
| | | | | | | Text extraction now works for Python2.pdf and zlib.3.pdf. Added GlyphWidths[] and SpanDeltaX[] arrays, containing information required for generating intermediate format used by extract system.
* doc/VectorDevices.htm: added brief info about txtwrite TextFormat=4.Julian Smith2020-10-191-1/+2
|
* Fix indeterminisms within halftoned rendering.Robin Watts2020-10-191-35/+27
| | | | | | | When checking for pdf14 being in an opaque state, we check to see whether we are either constructing an SMask or are within an SMask. We need to use a dev spec op for this, as we might be within a clipper device, and so not actually be directly a pdf14 device.
* New file - program to assist in creating ZUGFeRD electronic invoicesKen Sharp2020-10-191-0/+316
| | | | | | | | | | | | | | | | | While documenting the process for creating a ZUGFeRD invoice from a PDF file and an XML invoice it became clear to me that it was beyond any reasonable expectation of a user to be able to use it unaided. So this program assists in the creation of a ZUGFeRD document. The program requires two command line parameters; -sZUGFeRDProfile= which contains a fully qualified path to an appropriate (correct colour space) ICC profile, and -sZUGFeRDXMLFile= which contains a fully qualified path to the XML invoice file. Example command line is in the comments, and a usage message if the user fails to provide either of the required elements. Obviously the user must also set -dPDFA=3 and -sColorConversionStrategy in order to create a valid PDF/A-3b file.