PDF OCR results review form
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

We are processing some PDFs with AWS textract. The data it outputs is great but we need to run this through a human for review especially on the low confidence items. What I am looking to make here is a quick utility to render a pdf on left and the recognized form on the right so that a human can go through and review. AWS has a similar approach to their online demo which looks like this - https://www.dropbox.com/s/mz017b2foxdnmv3/2021-06-21_11-39-18.png?dl=0 my feeling is that we want to use the results json to try and build this form on right and overlay on left with the coordinates and bounding boxes. Except unlike what AWS has we want to be able to overwrite values/inputs and then dump out a new json file like the original inputted one.

This bounty is to attempt to get something like whats described above. I know it likely wont or cant be reasonably perfect but it should be something we can take an extend. Im mostly focused on rendering the bounding boxes on top of the PDF via pdfjs or similar and then rendering the input form based on the relationships implied in the results json.

Please post your result to repl.it so we can easily test.

here is a sample pdf and json output to work with for this.

Re: ui, i think sticking to vanilla bootstrap is probably a good idea here. the AWS ui layout is similarly just bootstrap so I dont see why not follow.

Let me know any questions

Note. It might be tough to render things on top of the pdf all client side. if we need to convert pdf to an image I think its fine for your code to make that assumption. I just need to know what size to convert to so they coordinates are still working based on what we get in the results.
Qdev 7 months ago
Found this for dealing with the polygons ... seems they are almost a percent instead of a pixel - https://docs.aws.amazon.com/textract/latest/dg/text-location.html
Qdev 7 months ago

Crowdsource coding tasks.

0 Solutions