How to approach building a PDF comparison like
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

we have a need to be able to compare PDFs in a system we are working on. I starte dto do some research and most of what I found tries to do a visual comparison of the files but seems to fail in areas where things overflow either to a new column to across pages. I was able to find a system that appears to do a neat job at it without the pitfalls of the image comparison and I am wondering what might be some pointers on the right direction to head for doing something like this using a stack of open source tech/stack. (some demos about half way down) - upload your own file

A couple observations

they seem to stick with pdf on the output and render the files as pdfs with pdfjs viewers. I can image they store the diff markers (shown in the middle scroll area) as some sort of offset while rendering the style changes on the server. But the area I am at a loss with is how to actually do something like comparing text and then reformatting it for the output.

I tried to do some quick inspections of the pdf/html file that is rendered from one of the demos and it wasnt immediately obvious how they actually show the formatting either I had assumed they restyle it on the server and then output it as a part of the actual document (aka inline styles) but then I found the above which shows that formatting of the diff is separated from the actual content itself. I started researching pdf annotations and its a bit of a rabbit hole but I was able to find things like this and others that give me an idea that if we can figure out the diff we can then maybe consider it a highlight annotation to the user.

So, I am looking for some guiding ideas on how one might approach something like this - the diff and the visualization. I am topically aware of something called "annotations" which might be supported by PDFjs and maybe after somehow figuring out the diff on the server the right way to visually represent the change is with annotation objects.

This bounty is mainly for directions or helpful info that will get us started on the right path. since it would be hard to judge this we will give the 3 best post $100, $50, $50 , one as award and the other two as bonus. the bounty post should be unique and offer something different than the other poster, totally fine if you want to expand on someones idea but make sure its material enough to be valuable.

Hey QDev, will be sending these outstanding bounties out to the charity on Monday unless you pick a winner. Let me know if you need more time.
bevan 2 years ago

Crowdsource coding tasks.

5 Solutions

I have two suggestions
1. using PDFUtil. This utility gets, the page count, gets page content as plain text, extracts attached images from PDF, stores PDF pages as images, compares PDF files in text mode, you get the chance to exclude certain text while comparing PDF files in text mode, compare PDF files in Visual mode. The utility is open source and can be found here
2. pdf-diff
Finds differences between two PDF documents:
Compares the text layers of two PDF documents and outputs the bounding boxes of changed text in JSON.
Rasterizes the changed pages in the PDFs to a PNG and draws red outlines around changed text.
Can be found here

Based on your needs, a convert to text solution would be the easiest and most direct.
but there is an alternate Bitmap solution which I found more helpful, as it is already written on so I will share the link here



I hope it will give you an idea

If the isse is that text simply reflows onto the next page, why not merge the pages seamlessly so that there are no next pages? Maybe redraw the pdf with a different page size? Then you could just use the standard image diff you didn't want to use in the first place.

reflowing the pdf reference would be a weird. What I was attributed to about the example system was their ability to annotate the change. This is similar to how adobe and office handle change flagging. Comparing the Raster version of the pages seems like a way to go but it’s a bit less helpful at scale. For example each time you save to PDF if 1 pixel moves everything in the document would be flagged as changed. Thx for your post!
Qdev over 2 years ago
Yeah, I sort of meant we could do something like hashing it. Basically, create a much lower res, monochrome copy. That way, there's at least a reasonable room for error.
B44ken over 2 years ago
I don't see why we couldn't go overkill, even, and hash the raw text too.
B44ken over 2 years ago


something you can do, really easy, is to run wdiff and pdftotext:

wdiff <(pdftotext a.pdf -) <(pdftotext b.pdf -)

This will return differences for each changed word, it will not match changes in line spacing or styling. For example, the difference between the two files is a single word:

In a.pdf you have the line Proin ac semper risus. and in b.pdf Proin ad semper risus. The above command will return:

Proin [-ac-] {+ad+} semper risus.

Then you can just match and style to show the differences.

If instead, you want a block match you may want to look at the Myers diff algo as it is implemented into Meld, a software written in Python to compare code files, look here for more information:

One way to do this would be to convert both of the pdf documents to html files. Then perform a diff on either the elements or the whole text of the files and wrap the changed values with <ins>/<del> tags.

This way you wouldn't have to worry about calculating the correct coordinates for changed regions.

Here's a writeup on HTML diffing.

Another way would be to extract and diff the whole text of the documents. The diff output could then be transformed into index/length pairs (or similar) and hook pdf.js to iterate over the invisible text elements and wrap the matching index/length text parts with <ins>/<del> tags on the client side.

You could of course also draw the changed background rectangles in the pdf file itself, but that would require to do calculations that take into account font family, font size and line breaks. This is because the pdf file only holds the font family, font size and the starting coordinate for the text.

This could then be put together by diffing the extracted text of both documents and then iterating through the document text values (+ diff output) and calculating/drawing a rectangle at the specific coordinates accordingly.

Here's a simple writeup on PDF structure.

Edit: I digged a bit more into this. If you're OK with using python then pdfminer library might be your go-to choice for text/coordinate extraction. It provides bounding box coordinates not only for text lines, but also individual characters.

This can then be used to extract text(characters) along with their coordinates. Afterwards do a diff on them and then cross-reference the results with your extracted coordinates to build rects that contain changes.

Afterwards you can either draw the rects back to the original file copies with a tool like PyPDF2/4 or send them back to client to be rendered client-side.

View Timeline