PixiePlus Users Guide: Section 1.10
When cleaning up your image folders you'll often want to see if there are any duplicates of images lying around, but finding exact duplicates is not enough. It will fail to identify two versions of the same picture if it's stored in a different format, scaled to a different size, has a different website logo, or has had effects applied such as adjusting contrast or brightness. All these are frequently found on images downloaded from the web. What you need is something that finds images that are similiar to each other, not exact matches.
PixiePlus allows you to do this by selecting "Find Similiar Images" from the "File" menu. This will scan all the images in the browser folder and show you which ones are more than 90% similiar using it's comparison algorithm.
This is done in two steps. The first step is to scan each image and generate a "fingerprint" that can be used when comparing to other images. A progress dialog will pop up and show you how many images you have left to scan. If you have write permission to the current folder PixiePlus will also save this data so you don't have to regenerate it the next time.
Once this is done the next step is to compare all the fingerprints. Again you will be kept updated of it's status by the progress dialog. This step is relatively quick and should only take a few seconds even with a large number of images.
Finally, you will be presented with the following window. It shows a tree of all the images that are similar. You'll have the original image, (actually the first one found), and then the matches will be below it along with the percent of similiarity and some file information. You can click on any of the thumbnails to view the image fullsized, drag and drop them elsewhere, or delete them via the right-click menu.
Much attention was spent on making finding similiar images perform as fast as possible while still being accurate. Compared with other Unix offerings PixiePlus performs quite well in this aspect.
There are two main algorithms for finding similiar images. One is used by GQView and ShowImg. It is based on averaging blocks of color channels and comparing the result. This is the obvious algorithm and the one I was originally going to adopt. In GQView it performs with acceptable speed, in ShowImg it does not. I'm not sure why this is since the code is pretty much directly copied from GQView. Either way, it tends to miss some matches found in the other available algorithm.
The other algorithm isn't based on colors at all but identifying the general patterns in the image. This is the method used by the Perl utility findimagedupes and is the method I adopted for Pixie. What it does is sample an image to a standard size, apply a couple effects to get rid of abnormalities, scale it down again, then convert it into a string of bits suitable for use as a thumbprint. While it sounds like it would take much longer than the above method, in reality it doesn't, and it finds similiar images the other algorithm misses.
Like findimagedupes and unlike GQView and ShowImg a persistent database of fingerprints is used so you don't have to regenerate fingerprints or calculate image blocks each time you compare images. A binary database is used in order to avoid parsing ASCII representations of the binary thumbprint. Unlike findimagedupes the database also contains timestamps so it can tell if an image has been modified. A large hash table is used and should perform well for folders up to 6,000 images. After that expect performance degragation. Findimagedupes especially suffers from this. On my test machine, (AMDK6 450MHZ/128M), initially comparing 47 large images takes 11 sec on Pixie and 20 sec using findimagedupes. This is really good speed for findimagedupes considering it's in Perl. But when the number of images is increased to 5,049 findimagedupes slows down to 1hr 16min while PixiePlus only takes 23min 14 sec (the fastest of the bunch).
Next | Previous | Table of Contents