Approach to identifying similar images
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I am working on an interesting issue where I need to be able to normalize an image to be able to compare it to a bunch of other versions of the image and determine that they are in fact all the same. I am wondering if someone has a creative way to do this or can point me in the right direction.

Here I have the following images:

Original

Original-BW.jpg

Original-color-balance-red.jpg

Original-resized-1000-no-compression.jpg

Original-photoshop-compression-90.jpg

Original-as-png.png

In my problem we should suppose I have something like 10K images and I need to quickly be able to find that 5 of the 10K match the original.jpg image. my first approach was to break the image in to a lot of parts and hash those and simply compare hashes but with any sort of manipulation of the image even just re-encoding it throws off everything so hashes no longer work. So my thought is that we need to reduce the image to a more fundamental base from which to do our hashing and comparison

All of the above images are here for you to check out.

https://crts-assets.s3.amazonaws.com/bounty/image-analysis-set.zip

I think would be fine if this bounty wasnt for the actual coded working solution but more the high level academic approach to how to solve it. I do need to stay away from any sort of ML or non-deterministic type solutions. We can open another bounty for actually coding something or making a POC.

Found some interesting links, a bit beyond my lane in terms of tactical solutions but might give you some ideas

https://stackoverflow.com/questions/75891/algorithm-for-finding-similar-images

http://www.phash.org/

seems neat if i read right they use gradients https://medium.com/machine-learning-world/feature-extraction-and-similar-image-search-with-opencv-for-newbies-3c59796bf774
page 45 here seems like it might be on to something around a transform method to reduce dimension http://www-i6.informatik.rwth-aachen.de/publications/download/10/DeselaersThomas--FeaturesforImageRetrieval--2003.pdf

hey everyone, still working through some of these replies
Qdev 7 days ago
8 days ago

Crowdsource coding tasks.

4 Solutions


My idea tried taking the images, and simplifying them into a series of points (I did mine manually, maybe use contrast levels in a 3x3 pixel chunk?) like this: http://bit.ly/2CoBQcc. Then we could find the distance to the closest points in another image. The average of this would be the similarity factor.
The similarity threshold to be considered truly similar could be weighted based on the number of points and the resolution, if that has an impact.


Have a look at this article: https://realpython.com/fingerprinting-images-for-near-duplicate-detection/

This kind of hashing/fingerprinting would be my initial go at comparing near-identical images. It also goes over why conventional hashing methods can't be used to achieve this.


I would go with pHash you already posted. Here is a python library that supports pHash among other image hashing algorithms. You can find how some of the algorithms work in the following links:

pHash

dHash

wHash


I think machine learning is the best approach for what you want achieve, why? simple because the concept of similarity is something non deterministic, related to human mind.

When two images are similar?

1) when you resize
2) when you compress
3) when you change encoding.
4) when you change contrast
5) when you lose information due to noise
4) when you take a picture with your camera in two different moments

And other infinite cases, each of these cases can be handled by a deterministic algorithm, a deterministic algorithm is studied to solve a single problem, but it can't handle everything.

But like brain can handle all possible cases a machine learning algorithm can do the same, and if it can't it just need to be trained to do.

Today ML is the most common, easy and fast approach to handle these problems, but since you want determistic alghorithm, we need to study an approach case by case.

First we need to understand can cause two images that are the same to became different:

1) Different source

  • If I take two pictures of same target, with different cameras at same time, or I take with same camera in two different times I get two images that carry the same information, but they are different.
  • If I have a 3d model and I render it and export with two different programs, I'll get 2 pictures of the same model, very similar, but in fact different.
  • etc

2) Loss of information
- If you resize the image, you lose data.
- If you convert from bmp to jpg you lose some data because it's lossy compression
- if you transmit the image via UDP protocol, you can lose some bit during the transmission
- etc

3) change of encoding

  • If you convert an image from CMYK to RGB, you change the format of the image without losing information (pay attention the opposite convertion cause loss of information).
  • If you switch red with blue color, you change the encoding without loss of information.
  • if you convert from BMP to TIFF, you change the format of file, but you will not lose information
  • you convert an image from BMP 16 to BMP 24 you change the format of file, withot lose information

Any of these cause should be threated in a different way.

1) It is the most hard, to compare images generated by different sources, requires an image detection alghorithm.

An example of a determistic alghoritm to handle this case are Generalised Hough transform.

https://en.wikipedia.org/wiki/Hough_transform this works only for lines

https://en.wikipedia.org/wiki/Generalised_Hough_transform this can detect any shape

If found this document that can help for image matching: https://pdfs.semanticscholar.org/275f/5d86a95cd50730a3480aa87319df1af77176.pdf

Today Randomized Hough Tranform are preferred, because more fast, but they aren't determistic, in fact they are used in ML: https://en.wikipedia.org/wiki/Randomized_Hough_transform

2) Loss of information, this is not easy to handle, but all depends on the cause.

Usually the first step consist in reducing the noise, this can be done with filters.

For example if you lose random bits, you can replace replace them with a median filter: https://en.wikipedia.org/wiki/Median_filter

If you lose colors depth, you can apply a smooth filter, you can replace them analyzing the frequency with a low pass filer, check this: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Zurbenko_filter

The second step can be done applying the same filters to original image to generate the same loss of information.

This is very effective in case of image resize, but you need to know the alghorithm, remember that there are many alghoritms to reduce image size:

-Nearest-neighbor interpolation

-Bilinear and bicubic algorithms

-Sinc and Lanczos resampling

-etc.

If you have original.bmp and scaled.bmp and you know that scaled.bmp has been resized with Nearest-neighbor interpolation, you just need to apply Nearest-neighbor interpolation to original.bmp

and you will get the same image.

If different alghoritms are used, the best way is to comapare using an hash alghorithm, for example pHash, or you can just use hamming distance.

Take a look also to Demosaicing https://en.wikipedia.org/wiki/Demosaicing

More complicated cases, must be handled with image detection alghoritms, like point 1).

3) This is a the most easy way, you just need to convert both images with same encoding, things can be more complicated if you don't know what alghoritm is used.

Most encoding are based on file format, so as first step convert both files to same format.

To handle other cases, like I switch red with blue, you can use a sum of colors alghorithm for comaparison

Everything can become more complicated and hard to analyze if you mix all cases together, you change encoding and also lose information.

Based on the examples that you provided I suggest this approach:

1) Convert all images in BMP 24 bit, that will give you a non compressed format easy to analyze.

If you want more detailm you can use bmp32, but bmp 32 is more easy to work with because it uses a matrix with 3 separate bytes for red, green and blue.

Pay attention to convert from TIFF or gif, that they can have multiple images inside.

2) we define a function compare(original.bmp, similar.bmp)

3) we create a copy of both imaages for processing, so we don't edit the original.

4) we downscale both images to same size, I suggest Nearest-neighbor interpolation, because it's fast and simple, we don't need to care to quality, the original must be downscaled to worst quality.

you must keep a limit, for exaple 20%, otheriwse every image will be equal to 1x1 pixel image.

if you hava a too big and a too small picture, you can consider them different.

5) we apply a median filter to remove evental bad pixels

6) we apply a low gaussian filter to smooth sharpness of the images

7) we apply a Kolmogorov Zurbenko_filter to remove to spectral leakage in the images.

9) now we create a matrix of int with the sum of all colors (red, green, blue) and we calculate hamming distance, if the average distance is too big the images are different.

10) if you want try to handle more complicated cases, instead of step 9, you can use a perceptual hashing alghoritm: https://en.wikipedia.org/wiki/Perceptual_hashing

http://bertolami.com/index.php?engine=blog&content=posts&detail=perceptual-hashing

12) if everything doesn't work for your needs, try to use an image detection alghorithm based, I suggest Generalised Hough Transform matching.

View Timeline