Finding duplicate images, redux

Ben Okopnik [ben at linuxgazette.net]

Tue, 28 Dec 2010 12:58:41 -0500

Amusing example of serendipity: one of our readers just sent me an email letting me know that a link to a SourceForge project in one of our articles was outdated and needed to be pointed to the new, renamed version of the project. I changed it after taking a moment to verify the SF link - and noticed that some of the project functionality was relevant to Neil Youngman's question of a couple of months ago.

Pulling down the (small) project tarball and reading the docs supported that impression:

  'repeats' searches for duplicate files using a multistage process. Ini-
  tially, all files in the specified directories (and all of their subdi-
  rectories) are listed as potential duplicates.  In the first stage, all
  files with a unique filesize are declared unique and are  removed  from
  the list.  In the second stage, any files which are actually a hardlink
  to another file are removed, since they don't actually take up any more
  disk space.  Next, all files for which the first 4096 bytes (adjustable
  with the -m option) have a unique filehash are declared unique and  are
  removed from the list.  Finally, all files which have a unique filehash
  (for the entire file) are declared unique  and  are  removed  from  the
  list.   Any remaining files are assumed to be duplicates and are listed
  on stdout.

The project is called "littleutils", by Brian Lindholm. There's a number of other handy little utilities in there, all worth exploring.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Mulyadi Santosa [mulyadi.santosa at gmail.com]

Wed, 29 Dec 2010 01:25:33 +0700

On Wed, Dec 29, 2010 at 00:58, Ben Okopnik <ben at linuxgazette.net> wrote:

> The project is called "littleutils", by Brian Lindholm. There's a number
> of other handy little utilities in there, all worth exploring.

Reminds me to ssdeep as well (http://ssdeep.sourceforge.net/). The author call it sliding hash or something like that...

-- regards,

Mulyadi Santosa Freelance Linux trainer and consultant

blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com

Top Back

Ben Okopnik [ben at linuxgazette.net]

Tue, 28 Dec 2010 14:02:51 -0500

On Wed, Dec 29, 2010 at 01:25:33AM +0700, Mulyadi Santosa wrote:

> On Wed, Dec 29, 2010 at 00:58, Ben Okopnik <ben at linuxgazette.net> wrote:
> > The project is called "littleutils", by Brian Lindholm. There's a number
> > of other handy little utilities in there, all worth exploring.
> 
> Reminds me to ssdeep as well (http://ssdeep.sourceforge.net/). The
> author call it sliding hash or something like that...

I've just taken a look at "ssdeep" (it uses a "rolling hash" method to compute "Context-Triggered Piecewise Hashes"); very interesting, but wouldn't help Neil much, unfortunately.

The point of CTPHs is to allow you to identify small differences (and identify the blocks where the difference occurs) in a large number of files. That's useful because a system attacker may modify a program to add a back door - and then obscure their tracks by randomly changing one harmless bit in thousands of other files (the example given was changing "This program cannot be run in DOS mode" to "This program cannot be run on DOS mode"), which would create a huge list of MD5 mismatches, thus hiding the hacked file.

"ssdeep" is a proof of concept for the CTPH technique. It's interesting to note, though, that CTPHs are not guaranteed to be unique; in fact, because they use a 6-bit hash, there's a 1 in 2^-6 probability of hash collision. CTPH computation is also relatively slow - O(n log n) in the worst case. However, when combined with a more traditional hash, they can quickly sort out files with large modifications from ones with small ones. I can see where, e.g., geneticists would find this useful.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Mulyadi Santosa [mulyadi.santosa at gmail.com]

Wed, 29 Dec 2010 02:12:53 +0700

On Wed, Dec 29, 2010 at 02:02, Ben Okopnik <ben at linuxgazette.net> wrote:

> The point of CTPHs is to allow you to identify small differences (and
> identify the blocks where the difference occurs) in a large number of
> files. That's useful because a system attacker may modify a program to
> add a back door - and then obscure their tracks by randomly changing one
> harmless bit in thousands of other files (the example given was changing
> "This program cannot be run in DOS mode" to "This program cannot be run
> on DOS mode"), which would create a huge list of MD5 mismatches, thus
> hiding the hacked file.

Wow, Ben, I am amazed on how fast you deduced all these things right after my reply few minutes ago :D

Looks like age doesn't bite you in :D

-- regards,

Mulyadi Santosa Freelance Linux trainer and consultant

blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com

Top Back