Squashed commit of pdfwrite_ocr branch.

This introduces OCR operation to the pdfwrite device. The full development history can be seen on the pdfwrite_ocr branch. The list of individual commits is as follows: -------------------------------------------- Interim commit for pdfwrite+OCR This is the initial framework for pdfwrite to send a bitmap of a glyph to an OCR engine in order to generate a Unicode code point for it. This code must not be used as-is, in particular it prevents the function gs_font_map_glyph_to_unicode from functioning properly in the absence of OCR software, and the conenction between pdfwrite and the OCR engine is not present. We need to add either compile-time or run-time detection of an OCR engine and only use on if present, as well as some control to decide when to use OCR. We might always use OCR, or only when there is no Glyph2Unicode dictionary available, or simply when all other fallbacks fail. -------------------------------------------- Hook Tesseract up to pdfwrite. -------------------------------------------- More work on pdfwrite + OCR Reset the stage of the state machine after processing a returned value Set the unicode value used by the ToUnicode processing from the value returned by OCR. Much more complex than previously thought; process_text_return_width() processes all the contents of the text in the enumerator on the first pass, because its trying to decide if we can use a fast case (all widths are default) or not. This means that if we want to jump out an OCR a glyph, we need to track which index in the string process_text_return_width was dealing with, rather than the text enumerator index. Fortunately we are already using a copy of the enumerator to run the glyph, so we simply need to capture the index and set the copied enumerator index from it. -------------------------------------------- Tweak Tesseract build to include legacy engine. Actually making use of the legacy engine requires a different set of eng.traineddata be used, and for the engine to be selected away from LSTM. Suitable traineddata can be found here, for instance (open the link, and click the download button): https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata -------------------------------------------- Add OCRLanguage and OCREngine parameters to pdfwrite. -------------------------------------------- Add gs_param_list_dump() debug function. -------------------------------------------- Improve use of pdfwrite with OCR Rework the pdfwrite OCR code extensively in order to create a large 'strip' bitmap from a number of glyphs in a single call to the text routine. The hope is that this will provide better context for Tesseract and improved recognition. Due to the way that text enumerators work, and the requirement to exit to the PostScript interpreter in order to render glyph bitmaps, I've had to abandon efforts to run as a fully 'on demand' system. We can't wait until we find a glyph with no Unicode value and then try to render all the glyphs up to that point (and all the following ones as well). It is probably possible to do this but it would mean rewriting the text processing code which is quite hideous enough as it is. So now we render each glyph in the text string, and store them in a linked list. When we're done with the text we free the memory. If we find a glyph with no Unicode value then on the first pass we take the list of glyphs, create a big bitmap from them and send it to Tesseract. That should then return all the character codes, which we keep. On subsequent missing Unicode values we consult the stored list. We need to deal specially with space glyphs (which make no marks) as Tesseract (clearly!) can't spot those. Modify makefile (devs.mak) so that we have a preprocessor flag we can use for conditional compilation. Currently OCR_VERSION is 0 for no OCR, 1 for Tesseract, there may be higher numbers in future. Add a new function to the OCR interface to process and return multiple glyphs at once from a bitmap. Don't delete the code for a single bitmap because we may want to use that in future enhancements. If we don't get the expected number of characters back from the OCR engine then we currently abort the processing. Future enhancements; fall back to using a single bitmap instead of a strip of text, if we get *more* characters than expected, check for ligatures (fi, ffi etc). Even if we've already seen a glyph, if we have not yet assigned it a Unicode value then attempt to OCR it. So if we fail a character in one place we may be able to recognise it in another. This requires new code in gsfcmap.c to determine if we have a Unicode code point assigned. Make all the relevant code, especially the params code, only compile if OCR is enabled (Tesseract and Leptonica present and built). Remove some debugging print code. Add documentation -------------------------------------------- Remove vestiges of earlier OCR attempt Trying to identify each glyph bitmap individually didn't work as well and is replaced by the new 'all the characters in the text operation' approach. There were a few vestiges of the old approach lying around and one of them was causing problems when OCR was not enabled. Remove all of that cruft here.
author: Ken Sharp <ken.sharp@artifex.com> 2020-06-25 13:56:08 +0100
committer: Robin Watts <Robin.Watts@artifex.com> 2020-11-12 12:13:57 +0000
commit: 75e260886473a74a8d8ec133b0bae9fdee30818b (patch)
tree: 594ca2d2d64cc6329d2a109cf8201ea6cd74c071 /base/gsparaml.c
parent: 8aa60c55d789c9c4d9e600162a1233a2da7ba516 (diff)
download: ghostpdl-75e260886473a74a8d8ec133b0bae9fdee30818b.tar.gz
1 files changed, 28 insertions, 0 deletions
diff --git a/base/gsparaml.c b/base/gsparaml.c
index d7e5fcdbf..3d7e49b14 100644
--- a/base/gsparaml.c
+++ b/base/gsparaml.c
@@ -1046,3 +1046,31 @@ int gs_param_list_to_string(gs_param_list *plist, gs_param_name key, char *value
         *value = 0;
     return to_string(plist, key, &out);
 }
+
+int gs_param_list_dump(gs_param_list *plist)
+{
+    gs_param_enumerator_t enumerator;
+    gs_param_key_t key;
+    int code;
+    char buffer[4096];
+    int len;
+
+    param_init_enumerator(&enumerator);
+    while ((code = param_get_next_key(plist, &enumerator, &key)) == 0) {
+        char string_key[256];	/* big enough for any reasonable key */
+
+        if (key.size > sizeof(string_key) - 1) {
+            code = gs_note_error(gs_error_rangecheck);
+            break;
+        }
+        memcpy(string_key, key.data, key.size);
+        string_key[key.size] = 0;
+        dlprintf1("%s ", string_key);
+        code = gs_param_list_to_string(plist, string_key, buffer, &len);
+        if (code < 0)
+            break;
+        dlprintf1("%s ", buffer);
+    }
+    dlprintf("\n");
+    return code;
+}
author	Ken Sharp <ken.sharp@artifex.com>	2020-06-25 13:56:08 +0100
committer	Robin Watts <Robin.Watts@artifex.com>	2020-11-12 12:13:57 +0000
commit	75e260886473a74a8d8ec133b0bae9fdee30818b (patch)
tree	594ca2d2d64cc6329d2a109cf8201ea6cd74c071 /base/gsparaml.c
parent	8aa60c55d789c9c4d9e600162a1233a2da7ba516 (diff)
download	ghostpdl-75e260886473a74a8d8ec133b0bae9fdee30818b.tar.gz