Find paragraphs in an image
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Looking for some code to find text blocks in an image and output a separate image for each block found, moving top to bottom then left to right. For example, this image http://imgur.com/IRvC6HP should split into these 3: http://imgur.com/WdeIJVD , http://imgur.com/617pvYO , and http://imgur.com/3Riexu7 .
pThen it creases a file called "Used" and places all original images in it. Output images should have filename_p1.format, filename_p2.format, etc. Output should be same dpi as input (no shrinkage or enlarging, no quality loss in image).

The code should operate on every image in the selected folder sequentially.
Tip conditions for selected solution:

  • Add code to convert from pdf to images (some examples might use the following: http://codetheory.in/convert-split-pdf-files-into-images-with-imagemagick-and-ghostscript/ ) at the front end. The process then would be PDF is input, a folder would be created in the parent folder for the file with the name filename_images, each page of the pdf will split into an individual jpg with the name filename1.jpg, filename2.jpg, filename3.jpg, etc, then the process outlined at the start of the bounty will execute.-

  • Correct orientation of textblocks. For example, this http://imgur.com/617pvYO becomes this http://imgur.com/MWy8nWg

  • Separate executable to merge all images back into PDF in name order with 1 inch open space margin between them.

Satisfaction of tip conditions for selected solution greatly increases tip size. The most tip conditions satisfied, the less additional bounties I have to post and potential code conflicts to deal with.

In "Separate executable to merge all images back into PAF", you mean "PDF", right?
slang almost 4 years ago
Also, is this entire process just to increase the margin between sections? I ask because you'd probably be better off just doing a straight conversion to plain-text with something like https://github.com/tesseract-ocr/tesseract so the text spacing can be manipulated arbitrarily with something like CSS.
slang almost 4 years ago
I corrected my typo. Thanks slang800. No, actually it's intended as a pre-process to ocr for relatively poor scans of books.
Need More Marbles almost 4 years ago
If all you want is to improve the ocr performance, then there can be other ways to achieve it. How about dropping some marker in the scanned images (such as a line) between the paragraphs? This would force the ocr engine to distinguish between the paragraphs.
rehangit almost 4 years ago
The other issue I'm battling is it picking up extraneous marks and picking them up as characters, and also page orientation between 2 paragraphs side by side at slightly different angles on the same page causing junk character recognition on one or both sets. I'm open to any solution which results in the paragraphs being isolated and, hopefully, correctly oriented. However, as there are 2 hours until this bounty concludes, I don't think that's likely to happen at this point.
Need More Marbles almost 4 years ago

Crowdsource coding tasks.

0 Solutions