diff options
author | Ken Sharp <ken.sharp@artifex.com> | 2020-06-25 13:56:08 +0100 |
---|---|---|
committer | Robin Watts <Robin.Watts@artifex.com> | 2020-11-12 12:13:57 +0000 |
commit | 75e260886473a74a8d8ec133b0bae9fdee30818b (patch) | |
tree | 594ca2d2d64cc6329d2a109cf8201ea6cd74c071 /base/gsparaml.c | |
parent | 8aa60c55d789c9c4d9e600162a1233a2da7ba516 (diff) | |
download | ghostpdl-75e260886473a74a8d8ec133b0bae9fdee30818b.tar.gz |
Squashed commit of pdfwrite_ocr branch.
This introduces OCR operation to the pdfwrite device.
The full development history can be seen on the pdfwrite_ocr branch.
The list of individual commits is as follows:
--------------------------------------------
Interim commit for pdfwrite+OCR
This is the initial framework for pdfwrite to send a bitmap of a glyph
to an OCR engine in order to generate a Unicode code point for it.
This code must not be used as-is, in particular it prevents the function
gs_font_map_glyph_to_unicode from functioning properly in the absence
of OCR software, and the conenction between pdfwrite and the OCR engine
is not present.
We need to add either compile-time or run-time detection of an OCR
engine and only use on if present, as well as some control to decide
when to use OCR. We might always use OCR, or only when there is no
Glyph2Unicode dictionary available, or simply when all other fallbacks
fail.
--------------------------------------------
Hook Tesseract up to pdfwrite.
--------------------------------------------
More work on pdfwrite + OCR
Reset the stage of the state machine after processing a returned value
Set the unicode value used by the ToUnicode processing from the value
returned by OCR.
Much more complex than previously thought; process_text_return_width()
processes all the contents of the text in the enumerator on the first
pass, because its trying to decide if we can use a fast case (all
widths are default) or not.
This means that if we want to jump out an OCR a glyph, we need to track
which index in the string process_text_return_width was dealing with,
rather than the text enumerator index. Fortunately we are already
using a copy of the enumerator to run the glyph, so we simply need
to capture the index and set the copied enumerator index from it.
--------------------------------------------
Tweak Tesseract build to include legacy engine.
Actually making use of the legacy engine requires a different set
of eng.traineddata be used, and for the engine to be selected away
from LSTM.
Suitable traineddata can be found here, for instance (open the link,
and click the download button):
https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
--------------------------------------------
Add OCRLanguage and OCREngine parameters to pdfwrite.
--------------------------------------------
Add gs_param_list_dump() debug function.
--------------------------------------------
Improve use of pdfwrite with OCR
Rework the pdfwrite OCR code extensively in order to create a large
'strip' bitmap from a number of glyphs in a single call to the text
routine. The hope is that this will provide better context for
Tesseract and improved recognition.
Due to the way that text enumerators work, and the requirement to exit
to the PostScript interpreter in order to render glyph bitmaps, I've had
to abandon efforts to run as a fully 'on demand' system. We can't wait
until we find a glyph with no Unicode value and then try to render all
the glyphs up to that point (and all the following ones as well). It is
probably possible to do this but it would mean rewriting the text
processing code which is quite hideous enough as it is.
So now we render each glyph in the text string, and store them in a
linked list. When we're done with the text we free the memory. If we
find a glyph with no Unicode value then on the first pass we take the
list of glyphs, create a big bitmap from them and send it to Tesseract.
That should then return all the character codes, which we keep. On
subsequent missing Unicode values we consult the stored list.
We need to deal specially with space glyphs (which make no marks) as
Tesseract (clearly!) can't spot those.
Modify makefile (devs.mak) so that we have a preprocessor flag we can
use for conditional compilation. Currently OCR_VERSION is 0 for no OCR,
1 for Tesseract, there may be higher numbers in future.
Add a new function to the OCR interface to process and return multiple
glyphs at once from a bitmap. Don't delete the code for a single bitmap
because we may want to use that in future enhancements.
If we don't get the expected number of characters back from the OCR
engine then we currently abort the processing. Future enhancements;
fall back to using a single bitmap instead of a strip of text, if we get
*more* characters than expected, check for ligatures (fi, ffi etc).
Even if we've already seen a glyph, if we have not yet assigned it a
Unicode value then attempt to OCR it. So if we fail a character in one
place we may be able to recognise it in another. This requires new code
in gsfcmap.c to determine if we have a Unicode code point assigned.
Make all the relevant code, especially the params code, only compile
if OCR is enabled (Tesseract and Leptonica present and built).
Remove some debugging print code.
Add documentation
--------------------------------------------
Remove vestiges of earlier OCR attempt
Trying to identify each glyph bitmap individually didn't work as well
and is replaced by the new 'all the characters in the text operation'
approach. There were a few vestiges of the old approach lying around
and one of them was causing problems when OCR was not enabled. Remove
all of that cruft here.
Diffstat (limited to 'base/gsparaml.c')
-rw-r--r-- | base/gsparaml.c | 28 |
1 files changed, 28 insertions, 0 deletions
diff --git a/base/gsparaml.c b/base/gsparaml.c index d7e5fcdbf..3d7e49b14 100644 --- a/base/gsparaml.c +++ b/base/gsparaml.c @@ -1046,3 +1046,31 @@ int gs_param_list_to_string(gs_param_list *plist, gs_param_name key, char *value *value = 0; return to_string(plist, key, &out); } + +int gs_param_list_dump(gs_param_list *plist) +{ + gs_param_enumerator_t enumerator; + gs_param_key_t key; + int code; + char buffer[4096]; + int len; + + param_init_enumerator(&enumerator); + while ((code = param_get_next_key(plist, &enumerator, &key)) == 0) { + char string_key[256]; /* big enough for any reasonable key */ + + if (key.size > sizeof(string_key) - 1) { + code = gs_note_error(gs_error_rangecheck); + break; + } + memcpy(string_key, key.data, key.size); + string_key[key.size] = 0; + dlprintf1("%s ", string_key); + code = gs_param_list_to_string(plist, string_key, buffer, &len); + if (code < 0) + break; + dlprintf1("%s ", buffer); + } + dlprintf("\n"); + return code; +} |